A machine learning (ML) specialist is training a linear regression model. The specialist notices that the model is overfitting. The specialist applies an L1 regularization parameter and runs the model again. This change results in all features having zero weights. What should the ML specialist do to improve the model results?
A. Increase the L1 regularization parameter. Do not change any other training parameters.
B. Decrease the L1 regularization parameter. Do not change any other training parameters.
C. Introduce a large L2 regularization parameter. Do not change the current L1 regularization value.
D. Introduce a small L2 regularization parameter. Do not change the current L1 regularization value.
Explanation: Applying L1 regularization encourages sparsity by penalizing weights
directly, often driving many weights to zero. In this case, the ML specialist observes that all
weights become zero, which suggests that the L1 regularization parameter is set too
high. This high value overly penalizes non-zero weights, effectively removing all features
from the model.
To improve the model, the ML specialist should reduce the L1 regularization parameter,
allowing some features to retain non-zero weights. This adjustment will make the model
less prone to excessive sparsity, allowing it to better capture essential patterns in the data
without dropping all features. Introducing L2 regularization is another approach but may not
directly resolve this specific issue of all-zero weights as effectively as reducing L1.
A Machine Learning Specialist is required to build a supervised image-recognition model to
identify a cat. The ML Specialist performs some tests and records the following results for a
neural network-based image classifier:
Total number of images available = 1,000 Test set images = 100 (constant test set)
The ML Specialist notices that, in over 75% of the misclassified images, the cats were held
upside down by their owners.
Which techniques can be used by the ML Specialist to improve this specific test error?
A. Increase the training data by adding variation in rotation for training images.
B. Increase the number of epochs for model training.
C. Increase the number of layers for the neural network.
D. Increase the dropout rate for the second-to-last layer.
Explanation: To improve the test error for the image classifier, the Machine Learning Specialist should use the technique of increasing the training data by adding variation in rotation for training images. This technique is called data augmentation, which is a way of artificially expanding the size and diversity of the training dataset by applying various transformations to the original images, such as rotation, flipping, cropping, scaling, etc. Data augmentation can help the model learn more robust features that are invariant to the orientation, position, and size of the objects in the images. This can improve the generalization ability of the model and reduce the test error, especially for cases where the images are not well-aligned or have different perspectives1.
A company wants to predict stock market price trends. The company stores stock market
data each business day in Amazon S3 in Apache Parquet format. The company stores 20
GB of data each day for each stock code.
A data engineer must use Apache Spark to perform batch preprocessing data
transformations quickly so the company can complete prediction jobs before the stock
market opens the next day. The company plans to track more stock market codes and
needs a way to scale the preprocessing data transformations.
Which AWS service or feature will meet these requirements with the LEAST development
effort over time?
A. AWS Glue jobs
B. Amazon EMR cluster
C. Amazon Athena
D. AWS Lambda
Explanation: AWS Glue jobs is the AWS service or feature that will meet the requirements
with the least development effort over time. AWS Glue jobs is a fully managed service that
enables data engineers to run Apache Spark applications on a serverless Spark
environment. AWS Glue jobs can perform batch preprocessing data transformations on
large datasets stored in Amazon S3, such as converting data formats, filtering data, joining data, and aggregating data. AWS Glue jobs can also scale the Spark environment
automatically based on the data volume and processing needs, without requiring any
infrastructure provisioning or management. AWS Glue jobs can reduce the development
effort and time by providing a graphical interface to create and monitor Spark applications,
as well as a code generation feature that can generate Scala or Python code based on the
data sources and targets. AWS Glue jobs can also integrate with other AWS services, such
as Amazon Athena, Amazon EMR, and Amazon SageMaker, to enable further data
analysis and machine learning tasks1.
The other options are either more complex or less scalable than AWS Glue jobs. Amazon
EMR cluster is a managed service that enables data engineers to run Apache Spark
applications on a cluster of Amazon EC2 instances. However, Amazon EMR cluster
requires more development effort and time than AWS Glue jobs, as it involves setting up,
configuring, and managing the cluster, as well as writing and deploying the Spark
code. Amazon EMR cluster also does not scale automatically, but requires manual or
scheduled resizing of the cluster based on the data volume and processing needs2.
Amazon Athena is a serverless interactive query service that enables data engineers to
analyze data stored in Amazon S3 using standard SQL. However, Amazon Athena is not
suitable for performing complex data transformations, such as joining data from multiple
sources, aggregating data, or applying custom logic. Amazon Athena is also not designed
for running Spark applications, but only supports SQL queries3. AWS Lambda is a
serverless compute service that enables data engineers to run code without provisioning or
managing servers. However, AWS Lambda is not optimized for running Spark applications,
as it has limitations on the execution time, memory size, and concurrency of the functions.
AWS Lambda is also not integrated with Amazon S3, and requires additional steps to read
and write data from S3 buckets.
A company ingests machine learning (ML) data from web advertising clicks into an Amazon S3 data lake. Click data is added to an Amazon Kinesis data stream by using the Kinesis Producer Library (KPL). The data is loaded into the S3 data lake from the data stream by using an Amazon Kinesis Data Firehose delivery stream. As the data volume increases, an ML specialist notices that the rate of data ingested into Amazon S3 is relatively constant. There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest. Which next step is MOST likely to improve the data ingestion rate into Amazon S3?
A. Increase the number of S3 prefixes for the delivery stream to write to.
B. Decrease the retention period for the data stream.
C. Increase the number of shards for the data stream.
D. Add more consumers using the Kinesis Client Library (KCL).
Explanation: The solution C is the most likely to improve the data ingestion rate into
Amazon S3 because it increases the number of shards for the data stream. The number of
shards determines the throughput capacity of the data stream, which affects the rate of
data ingestion. Each shard can support up to 1 MB per second of data input and 2 MB per
second of data output. By increasing the number of shards, the company can increase the
data ingestion rate proportionally. The company can use the UpdateShardCount API
operation to modify the number of shards in the data stream1.
The other options are not likely to improve the data ingestion rate into Amazon S3
because:
Option A: Increasing the number of S3 prefixes for the delivery stream to write to
will not affect the data ingestion rate, as it only changes the way the data is
organized in the S3 bucket. The number of S3 prefixes can help to optimize the
performance of downstream applications that read the data from S3, but it does
not impact the performance of Kinesis Data Firehose2.
Option B: Decreasing the retention period for the data stream will not affect the
data ingestion rate, as it only changes the amount of time the data is stored in the
data stream. The retention period can help to manage the data availability and
durability, but it does not impact the throughput capacity of the data stream3.
Option D: Adding more consumers using the Kinesis Client Library (KCL) will not
affect the data ingestion rate, as it only changes the way the data is processed by
downstream applications. The consumers can help to scale the data processing
and handle failures, but they do not impact the data ingestion into S3 by Kinesis
Data Firehose4.
A data scientist is designing a repository that will contain many images of vehicles. The repository must scale automatically in size to store new images every day. The repository must support versioning of the images. The data scientist must implement a solution that maintains multiple immediately accessible copies of the data in different AWS Regions. Which solution will meet these requirements?
A. Amazon S3 with S3 Cross-Region Replication (CRR)
B. Amazon Elastic Block Store (Amazon EBS) with snapshots that are shared in a secondary Region
C. Amazon Elastic File System (Amazon EFS) Standard storage that is configured with Regional availability
D. AWS Storage Gateway Volume Gateway
Explanation: For a repository containing a large and dynamically scaling collection of images, Amazon S3 is ideal due to its scalability and versioning capabilities. Amazon S3 natively supports automatic scaling to accommodate increasing storage needs and allows versioning, which enables tracking and managing different versions of objects. To meet the requirement of maintaining multiple, immediately accessible copies of data across AWS Regions, S3 Cross-Region Replication (CRR) can be enabled. CRR automatically replicates new or updated objects to a specified destination bucket in another AWS Region, ensuring low-latency access and disaster recovery. By setting up CRR with versioning enabled, the data scientist can achieve a multi-Region, scalable, and versioncontrolled repository in Amazon S3.
An online delivery company wants to choose the fastest courier for each delivery at the
moment an order is placed. The company wants to implement this feature for existing users
and new users of its application. Data scientists have trained separate models with
XGBoost for this purpose, and the models are stored in Amazon S3. There is one model fof
each city where the company operates.
The engineers are hosting these models in Amazon EC2 for responding to the web client
requests, with one instance for each model, but the instances have only a 5% utilization in
CPU and memory, ....operation engineers want to avoid managing unnecessary resources.
Which solution will enable the company to achieve its goal with the LEAST operational
overhead?
A. Create an Amazon SageMaker notebook instance for pulling all the models from Amazon S3 using the boto3 library. Remove the existing instances and use the notebook to perform a SageMaker batch transform for performing inferences offline for all the possible users in all the cities. Store the results in different files in Amazon S3. Point the web client to the files.
B. Prepare an Amazon SageMaker Docker container based on the open-source multimodel server. Remove the existing instances and create a multi-model endpoint in SageMaker instead, pointing to the S3 bucket containing all the models Invoke the endpoint from the web client at runtime, specifying the TargetModel parameter according to the city of each request.
C. Keep only a single EC2 instance for hosting all the models. Install a model server in the instance and load each model by pulling it from Amazon S3. Integrate the instance with the web client using Amazon API Gateway for responding to the requests in real time, specifying the target resource according to the city of each request.
D. Prepare a Docker container based on the prebuilt images in Amazon SageMaker. Replace the existing instances with separate SageMaker endpoints. one for each city where the company operates. Invoke the endpoints from the web client, specifying the URL and EndpomtName parameter according to the city of each request.
Explanation: The best solution for this scenario is to use a multi-model endpoint in Amazon SageMaker, which allows hosting multiple models on the same endpoint and invoking them dynamically at runtime. This way, the company can reduce the operational overhead of managing multiple EC2 instances and model servers, and leverage the scalability, security, and performance of SageMaker hosting services. By using a multimodel endpoint, the company can also save on hosting costs by improving endpoint utilization and paying only for the models that are loaded in memory and the API calls that are made. To use a multi-model endpoint, the company needs to prepare a Docker container based on the open-source multi-model server, which is a framework-agnostic library that supports loading and serving multiple models from Amazon S3. The company can then create a multi-model endpoint in SageMaker, pointing to the S3 bucket containing all the models, and invoke the endpoint from the web client at runtime, specifying the TargetModel parameter according to the city of each request. This solution also enables the company to add or remove models from the S3 bucket without redeploying the endpoint, and to use different versions of the same model for different cities if needed.
A data scientist is building a forecasting model for a retail company by using the most recent 5 years of sales records that are stored in a data warehouse. The dataset contains sales records for each of the company's stores across five commercial regions The data scientist creates a working dataset with StorelD. Region. Date, and Sales Amount as columns. The data scientist wants to analyze yearly average sales for each region. The scientist also wants to compare how each region performed compared to average sales across all commercial regions. Which visualization will help the data scientist better understand the data trend?
A. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each store. Create a bar plot, faceted by year, of average sales for each store. Add an extra bar in each facet to represent average sales.
B. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each store. Create a bar plot, colored by region and faceted by year, of average sales for each store. Add a horizontal line in each facet to represent average sales.
C. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region Create a bar plot of average sales for each region. Add an extra bar in each facet to represent average sales.
D. Create an aggregated dataset by using the Pandas GroupBy function to get average sales for each year for each region Create a bar plot, faceted by year, of average sales for each region Add a horizontal line in each facet to represent average sales.
Explanation: The best visualization for this task is to create a bar plot, faceted by year, of average sales for each region and add a horizontal line in each facet to represent average sales. This way, the data scientist can easily compare the yearly average sales for each region with the overall average sales and see the trends over time. The bar plot also allows the data scientist to see the relative performance of each region within each year and across years. The other options are less effective because they either do not show the yearly trends, do not show the overall average sales, or do not group the data by region.
A Machine Learning Specialist is building a supervised model that will evaluate customers' satisfaction with their mobile phone service based on recent usage The model's output should infer whether or not a customer is likely to switch to a competitor in the next 30 days Which of the following modeling techniques should the Specialist use1?
A. Time-series prediction
B. Anomaly detection
C. Binary classification
D. Regression
Explanation: The modeling technique that the Machine Learning Specialist should use is binary classification. Binary classification is a type of supervised learning that predicts whether an input belongs to one of two possible classes. In this case, the input is the customer’s recent usage data and the output is whether or not the customer is likely to switch to a competitor in the next 30 days. This is a binary outcome, either yes or no, so binary classification is suitable for this problem. The other options are not appropriate for this problem. Time-series prediction is a type of supervised learning that forecasts future values based on past and present data. Anomaly detection is a type of unsupervised learning that identifies outliers or abnormal patterns in the data. Regression is a type of supervised learning that estimates a continuous numerical value based on the input features.
A company is launching a new product and needs to build a mechanism to monitor
comments about the company and its new product on social media. The company needs to
be able to evaluate the sentiment expressed in social media posts, and visualize trends
and configure alarms based on various thresholds.
The company needs to implement this solution quickly, and wants to minimize the
infrastructure and data science resources needed to evaluate the messages. The company
already has a solution in place to collect posts and store them within an Amazon S3
bucket.
What services should the data science team use to deliver this solution?
A. Train a model in Amazon SageMaker by using the BlazingText algorithm to detect sentiment in the corpus of social media posts. Expose an endpoint that can be called by AWS Lambda. Trigger a Lambda function when posts are added to the S3 bucket to invoke the endpoint and record the sentiment in an Amazon DynamoDB table and in a custom Amazon CloudWatch metric. Use CloudWatch alarms to notify analysts of trends.
B. Train a model in Amazon SageMaker by using the semantic segmentation algorithm to model the semantic content in the corpus of social media posts. Expose an endpoint that can be called by AWS Lambda. Trigger a Lambda function when objects are added to the S3 bucket to invoke the endpoint and record the sentiment in an Amazon DynamoDB table. Schedule a second Lambda function to query recently added records and send an Amazon Simple Notification Service (Amazon SNS) notification to notify analysts of trends.
C. Trigger an AWS Lambda function when social media posts are added to the S3 bucket. Call Amazon Comprehend for each post to capture the sentiment in the message and record the sentiment in an Amazon DynamoDB table. Schedule a second Lambda function to query recently added records and send an Amazon Simple Notification Service (Amazon SNS) notification to notify analysts of trends.
D. Trigger an AWS Lambda function when social media posts are added to the S3 bucket. Call Amazon Comprehend for each post to capture the sentiment in the message and record the sentiment in a custom Amazon CloudWatch metric and in S3. Use CloudWatch alarms to notify analysts of trends.
Explanation: The solution that uses Amazon Comprehend and Amazon CloudWatch is the most suitable for the given scenario. Amazon Comprehend is a natural language processing (NLP) service that can analyze text and extract insights such as sentiment, entities, topics, and syntax. Amazon CloudWatch is a monitoring and observability service that can collect and track metrics, create dashboards, and set alarms based on various thresholds. By using these services, the data science team can quickly and easily implement a solution to monitor the sentiment of social media posts without requiring much infrastructure or data science resources. The solution also meets the requirements of storing the sentiment in both S3 and CloudWatch, and using CloudWatch alarms to notify analysts of trends.
A data science team is working with a tabular dataset that the team stores in Amazon S3. The team wants to experiment with different feature transformations such as categorical feature encoding. Then the team wants to visualize the resulting distribution of the dataset. After the team finds an appropriate set of feature transformations, the team wants to automate the workflow for feature transformations. Which solution will meet these requirements with the MOST operational efficiency?
A. Use Amazon SageMaker Data Wrangler preconfigured transformations to explore feature transformations. Use SageMaker Data Wrangler templates for visualization. Export the feature processing workflow to a SageMaker pipeline for automation.
B. Use an Amazon SageMaker notebook instance to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation.
C. Use AWS Glue Studio with custom code to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualization. Package the feature processing steps into an AWS Lambda function for automation.
D. Use Amazon SageMaker Data Wrangler preconfigured transformations to experiment with different feature transformations. Save the transformations to Amazon S3. Use Amazon QuickSight for visualzation. Package each feature transformation step into a separate AWS Lambda function. Use AWS Step Functions for workflow automation.
Explanation: The solution A will meet the requirements with the most operational
efficiency because it uses Amazon SageMaker Data Wrangler, which is a service that
simplifies the process of data preparation and feature engineering for machine learning.
The solution A involves the following steps:
Use Amazon SageMaker Data Wrangler preconfigured transformations to explore
feature transformations. Amazon SageMaker Data Wrangler provides a visual
interface that allows data scientists to apply various transformations to their tabular
data, such as encoding categorical features, scaling numerical features, imputing
missing values, and more. Amazon SageMaker Data Wrangler also supports
custom transformations using Python code or SQL queries1.
Use SageMaker Data Wrangler templates for visualization. Amazon SageMaker
Data Wrangler also provides a set of templates that can generate visualizations of
the data, such as histograms, scatter plots, box plots, and more. These
visualizations can help data scientists to understand the distribution and
characteristics of the data, and to compare the effects of different feature
transformations1.
Export the feature processing workflow to a SageMaker pipeline for automation.
Amazon SageMaker Data Wrangler can export the feature processing workflow as
a SageMaker pipeline, which is a service that orchestrates and automates
machine learning workflows. A SageMaker pipeline can run the feature processing
steps as a preprocessing step, and then feed the output to a training step or an
inference step. This can reduce the operational overhead of managing the feature
processing workflow and ensure its consistency and reproducibility2.
The other options are not suitable because:
Option B: Using an Amazon SageMaker notebook instance to experiment with
different feature transformations, saving the transformations to Amazon S3, using
Amazon QuickSight for visualization, and packaging the feature processing steps
into an AWS Lambda function for automation will incur more operational overhead
than using Amazon SageMaker Data Wrangler. The data scientist will have to
write the code for the feature transformations, the data storage, the data
visualization, and the Lambda function. Moreover, AWS Lambda has limitations on
the execution time, memory size, and package size, which may not be sufficient
for complex feature processing tasks3.
Option C: Using AWS Glue Studio with custom code to experiment with different
feature transformations, saving the transformations to Amazon S3, using Amazon
QuickSight for visualization, and packaging the feature processing steps into an
AWS Lambda function for automation will incur more operational overhead than
using Amazon SageMaker Data Wrangler. AWS Glue Studio is a visual interface
that allows data engineers to create and run extract, transform, and load (ETL)
jobs on AWS Glue. However, AWS Glue Studio does not provide preconfigured
transformations or templates for feature engineering or data visualization. The data
scientist will have to write custom code for these tasks, as well as for the Lambda
function. Moreover, AWS Glue Studio is not integrated with SageMaker pipelines,
and it may not be optimized for machine learning workflows4.
Option D: Using Amazon SageMaker Data Wrangler preconfigured
transformations to experiment with different feature transformations, saving the
transformations to Amazon S3, using Amazon QuickSight for visualization,
packaging each feature transformation step into a separate AWS Lambda function,
and using AWS Step Functions for workflow automation will incur more operational
overhead than using Amazon SageMaker Data Wrangler. The data scientist will
have to create and manage multiple AWS Lambda functions and AWS Step
Functions, which can increase the complexity and cost of the solution. Moreover,
AWS Lambda and AWS Step Functions may not be compatible with SageMaker
pipelines, and they may not be optimized for machine learning workflows5.
A machine learning (ML) specialist at a retail company must build a system to forecast the daily sales for one of the company's stores. The company provided the ML specialist with sales data for this store from the past 10 years. The historical dataset includes the total amount of sales on each day for the store. Approximately 10% of the days in the historical dataset are missing sales data. The ML specialist builds a forecasting model based on the historical dataset. The specialist discovers that the model does not meet the performance standards that the company requires. Which action will MOST likely improve the performance for the forecasting model?
A. Aggregate sales from stores in the same geographic area.
B. Apply smoothing to correct for seasonal variation.
C. Change the forecast frequency from daily to weekly.
D. Replace missing values in the dataset by using linear interpolation.
Explanation: When forecasting sales data, missing values can significantly impact model accuracy, especially for time series models. Approximately 10% of the days in this dataset lack sales data, which may cause gaps in patterns and disrupt seasonal trends. Linear interpolation is an effective technique for estimating and filling in missing data points based on adjacent known values, thus preserving the continuity of the time series. By interpolating the missing values, the ML specialist can provide the model with a more complete and consistent dataset, potentially enhancing performance. This approach maintains the daily data granularity, which is important for accurately capturing trends at that frequency.
A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy. What is the MOST effective way to encode this categorical feature into a numeric feature?
A. Spell check the column. Use Amazon SageMaker one-hot encoding on the column to transform a categorical feature to a numerical feature.
B. Fix the spelling in the column by using char-RNN. Use Amazon SageMaker Data Wrangler one-hot encoding to transform a categorical feature to a numerical feature.
C. Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings Of vectors Of real numbers.
D. Use Amazon SageMaker Data Wrangler ordinal encoding on the column to encode categories into an integer between O and the total number Of categories in the column.
Explanation: The most effective way to encode this categorical feature into a numeric feature is to use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers. Similarity encoding is a technique that transforms categorical features into numerical features by computing the similarity between the categories. Similarity encoding can handle high cardinality and redundancy in categorical features, as it can group similar categories together based on their string similarity. For example, if the column contains the values “aspirin”, “asprin”, and “ibuprofen”, similarity encoding will assign a high similarity score to “aspirin” and “asprin”, and a low similarity score to “ibuprofen”. Similarity encoding can also create embeddings of vectors of real numbers, which can be used as input for machine learning models. Amazon SageMaker Data Wrangler is a feature of Amazon SageMaker that enables you to prepare data for machine learning quickly and easily. You can use SageMaker Data Wrangler to apply similarity encoding to a column of categorical data, and generate embeddings of vectors of real numbers that capture the similarity between the categories1. The other options are either less effective or more complex to implement. Spell checking the column and using one-hot encoding would require additional steps and resources, and may not capture all the misspellings or redundancies. One-hot encoding would also create a large number of features, which could increase the dimensionality and sparsity of the data. Ordinal encoding would assign an arbitrary order to the categories, which could introduce bias or noise in the data.
Page 8 out of 26 Pages |
Previous |