A Machine Learning Specialist needs to move and transform data in preparation for training Some of the data needs to be processed in near-real time and other data can be moved hourly There are existing Amazon EMR MapReduce jobs to clean and feature engineering to perform on the data Which of the following services can feed data to the MapReduce jobs? (Select TWO )
A. AWSDMS
B. Amazon Kinesis
C. AWS Data Pipeline
D. Amazon Athena
E. Amazon ES
Explanation: Amazon Kinesis and AWS Data Pipeline are two services that can feed data
to the Amazon EMR MapReduce jobs. Amazon Kinesis is a service that can ingest,
process, and analyze streaming data in real time. Amazon Kinesis can be integrated with
Amazon EMR to run MapReduce jobs on streaming data sources, such as web logs, social
media, IoT devices, and clickstreams. Amazon Kinesis can handle data that needs to be
processed in near-real time, such as for anomaly detection, fraud detection, or
dashboarding. AWS Data Pipeline is a service that can orchestrate and automate data
movement and transformation across various AWS services and on-premises data
sources. AWS Data Pipeline can be integrated with Amazon EMR to run MapReduce jobs
on batch data sources, such as Amazon S3, Amazon RDS, Amazon DynamoDB, and
Amazon Redshift. AWS Data Pipeline can handle data that can be moved hourly, such as
for data warehousing, reporting, or machine learning.
AWSDMS is not a valid service name. AWS Database Migration Service (AWS DMS) is a
service that can migrate data from various sources to various targets, but it does not
support streaming data or MapReduce jobs.
Amazon Athena is a service that can query data stored in Amazon S3 using standard SQL,
but it does not feed data to Amazon EMR or run MapReduce jobs.
Amazon ES is a service that provides a fully managed Elasticsearch cluster, which can be
used for search, analytics, and visualization, but it does not feed data to Amazon EMR or
run MapReduce jobs.
A power company wants to forecast future energy consumption for its customers in
residential properties and commercial business properties. Historical power consumption
data for the last 10 years is available. A team of data scientists who performed the initial
data analysis and feature selection will include the historical power consumption data and
data such as weather, number of individuals on the property, and public holidays.
The data scientists are using Amazon Forecast to generate the forecasts.
Which algorithm in Forecast should the data scientists use to meet these requirements?
A. Autoregressive Integrated Moving Average (AIRMA)
B. Exponential Smoothing (ETS)
C. Convolutional Neural Network - Quantile Regression (CNN-QR)
D. Prophet
Explanation: CNN-QR is a proprietary machine learning algorithm for forecasting time series using causal convolutional neural networks (CNNs). CNN-QR works best with large datasets containing hundreds of time series. It accepts item metadata, and is the only Forecast algorithm that accepts related time series data without future values. In this case, the power company has historical power consumption data for the last 10 years, which is a large dataset with multiple time series. The data also includes related data such as weather, number of individuals on the property, and public holidays, which can be used as item metadata or related time series data. Therefore, CNN-QR is the most suitable algorithm for this scenario.
A Data Scientist is building a model to predict customer churn using a dataset of 100
continuous numerical
features. The Marketing team has not provided any insight about which features are
relevant for churn
prediction. The Marketing team wants to interpret the model and see the direct impact of
relevant features on
the model outcome. While training a logistic regression model, the Data Scientist observes
that there is a wide
gap between the training and validation set accuracy.
Which methods can the Data Scientist use to improve the model performance and satisfy
the Marketing team’s
needs? (Choose two.)
A. Add L1 regularization to the classifier
B. Add features to the dataset
C. Perform recursive feature elimination
D. Perform t-distributed stochastic neighbor embedding (t-SNE)
E. Perform linear discriminant analysis
Explanation:
The Data Scientist is building a model to predict customer churn using a dataset of
100 continuous numerical features. The Marketing team wants to interpret the
model and see the direct impact of relevant features on the model outcome.
However, the Data Scientist observes that there is a wide gap between the training
and validation set accuracy, which indicates that the model is overfitting the data
and generalizing poorly to new data.
To improve the model performance and satisfy the Marketing team’s needs, the
Data Scientist can use the following methods:
A company ingests machine learning (ML) data from web advertising clicks into an Amazon
S3 data lake. Click data is added to an Amazon Kinesis data stream by using the Kinesis
Producer Library (KPL). The data is loaded into the S3 data lake from the data stream by
using an Amazon Kinesis Data Firehose delivery stream. As the data volume increases, an
ML specialist notices that the rate of data ingested into Amazon S3 is relatively constant.
There also is an increasing backlog of data for Kinesis Data Streams and Kinesis Data
Firehose to ingest.
Which next step is MOST likely to improve the data ingestion rate into Amazon S3?
A. Increase the number of S3 prefixes for the delivery stream to write to.
B. Decrease the retention period for the data stream.
C. Increase the number of shards for the data stream.
D. Add more consumers using the Kinesis Client Library (KCL).
Explanation: The data ingestion rate into Amazon S3 can be improved by increasing the number of shards for the data stream. A shard is the base throughput unit of a Kinesis data stream. One shard provides 1 MB/second data input and 2 MB/second data output. Increasing the number of shards increases the data ingestion capacity of the stream. This can help reduce the backlog of data for Kinesis Data Streams and Kinesis Data Firehose to ingest.
A company plans to build a custom natural language processing (NLP) model to classify
and prioritize user feedback. The company hosts the data and all machine learning (ML)
infrastructure in the AWS Cloud. The ML team works from the company's office, which has
an IPsec VPN connection to one VPC in the AWS Cloud.
The company has set both the enableDnsHostnames attribute and the enableDnsSupport
attribute of the VPC to true. The company's DNS resolvers point to the VPC DNS. The
company does not allow the ML team to access Amazon SageMaker notebooks through
connections that use the public internet. The connection must stay within a private network
and within the AWS internal network.
Which solution will meet these requirements with the LEAST development effort?
A. Create a VPC interface endpoint for the SageMaker notebook in the VPC. Access the notebook through a VPN connection and the VPC endpoint.
B. Create a bastion host by using Amazon EC2 in a public subnet within the VPC. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host.
C. Create a bastion host by using Amazon EC2 in a private subnet within the VPC with a NAT gateway. Log in to the bastion host through a VPN connection. Access the SageMaker notebook from the bastion host.
D. Create a NAT gateway in the VPC. Access the SageMaker notebook HTTPS endpoint through a VPN connection and the NAT gateway.
Explanation: In this scenario, the company requires that access to the Amazon
SageMaker notebook remain within the AWS internal network, avoiding the public internet.
By creating a VPC interface endpoint for SageMaker, the company can ensure that traffic
to the SageMaker notebook remains internal to the VPC and is accessible over a private
connection. The VPC interface endpoint allows private network access to AWS services,
and it operates over AWS’s internal network, respecting the security and connectivity
policies the company requires.
This solution requires minimal development effort compared to options involving bastion
hosts or NAT gateways, as it directly provides private network access to the SageMaker
notebook.
A Machine Learning Specialist works for a credit card processing company and needs to
predict which
transactions may be fraudulent in near-real time. Specifically, the Specialist must train a
model that returns the
probability that a given transaction may fraudulent.
How should the Specialist frame this business problem?
A. Streaming classification
B. Binary classification
C. Multi-category classification
D. Regression classification
Explanation: The business problem of predicting whether a new credit card applicant will default on a credit card payment can be framed as a binary classification problem. Binary classification is the task of predicting a discrete class label output for an example, where the class label can only take one of two possible values. In this case, the class label can be either “default” or “no default”, indicating whether the applicant will or will not default on a credit card payment. A binary classification model can return the probability that a given applicant belongs to each class, and then assign the applicant to the class with the highest probability. For example, if the model predicts that an applicant has a 0.8 probability of defaulting and a 0.2 probability of not defaulting, then the model will classify the applicant as “default”. Binary classification is suitable for this problem because the outcome of interest is categorical and binary, and the model needs to return the probability of each outcome.
A company has set up and deployed its machine learning (ML) model into production with an endpoint using Amazon SageMaker hosting services. The ML team has configured automatic scaling for its SageMaker instances to support workload changes. During testing, the team notices that additional instances are being launched before the new instances are ready. This behavior needs to change as soon as possible. How can the ML team solve this issue?
A. Decrease the cooldown period for the scale-in activity. Increase the configured maximum capacity of instances.
B. Replace the current endpoint with a multi-model endpoint using SageMaker.
C. Set up Amazon API Gateway and AWS Lambda to trigger the SageMaker inference endpoint.
D. Increase the cooldown period for the scale-out activity.
Explanation: The correct solution for changing the scaling behavior of the SageMaker
instances is to increase the cooldown period for the scale-out activity. The cooldown period
is the amount of time, in seconds, after a scaling activity completes before another scaling
activity can start. By increasing the cooldown period for the scale-out activity, the ML team
can ensure that the new instances are ready before launching additional instances. This
will prevent over-scaling and reduce costs.
The other options are incorrect because they either do not solve the issue or require
unnecessary steps. For example:
Option A decreases the cooldown period for the scale-in activity and increases the
configured maximum capacity of instances. This option does not address the issue
of launching additional instances before the new instances are ready. It may also
cause under-scaling and performance degradation.
Option B replaces the current endpoint with a multi-model endpoint using
SageMaker. A multi-model endpoint is an endpoint that can host multiple models
using a single endpoint. It does not affect the scaling behavior of the SageMaker
instances. It also requires creating a new endpoint and updating the application
code to use it.
Option C sets up Amazon API Gateway and AWS Lambda to trigger the
SageMaker inference endpoint. Amazon API Gateway is a service that allows
users to create, publish, maintain, monitor, and secure APIs. AWS Lambda is a
service that lets users run code without provisioning or managing servers. These
services do not affect the scaling behavior of the SageMaker instances. They also
require creating and configuring additional resources and services.
A Machine Learning Specialist is given a structured dataset on the shopping habits of a company’s customer base. The dataset contains thousands of columns of data and hundreds of numerical columns for each customer. The Specialist wants to identify whether there are natural groupings for these columns across all customers and visualize the results as quickly as possible. What approach should the Specialist take to accomplish these tasks?
A. Embed the numerical features using the t-distributed stochastic neighbor embedding (t- SNE) algorithm and create a scatter plot.
B. Run k-means using the Euclidean distance measure for different values of k and create an elbow plot.
C. Embed the numerical features using the t-distributed stochastic neighbor embedding (t- SNE) algorithm and create a line graph.
D. Run k-means using the Euclidean distance measure for different values of k and create box plots for each numerical column within each cluster.
Explanation:
The best approach to identify and visualize the natural groupings for the
numerical columns across all customers is to embed the numerical features using the tdistributed
stochastic neighbor embedding (t-SNE) algorithm and create a scatter plot. t-
SNE is a dimensionality reduction technique that can project high-dimensional data into a
lower-dimensional space, while preserving the local structure and distances of the data
points. A scatter plot can then show the clusters of data points in the reduced space, where
each point represents a customer and the color indicates the cluster membership. This
approach can help the Specialist quickly explore the patterns and similarities among the
customers based on their numerical features.
The other options are not as effective or efficient as the t-SNE approach. Running k-means
for different values of k and creating an elbow plot can help determine the optimal number
of clusters, but it does not provide a visual representation of the clusters or the customers.
Embedding the numerical features using t-SNE and creating a line graph does not make
sense, as a line graph is used to show the change of a variable over time, not the
distribution of data points in a space. Running k-means for different values of k and
creating box plots for each numerical column within each cluster can provide some insights
into the statistics of each cluster, but it is very time-consuming and cumbersome to create
and compare thousands of box plots.
A machine learning specialist is running an Amazon SageMaker endpoint using the built-in object detection algorithm on a P3 instance for real-time predictions in a company's production application. When evaluating the model's resource utilization, the specialist notices that the model is using only a fraction of the GPU. Which architecture changes would ensure that provisioned resources are being utilized effectively?
A. Redeploy the model as a batch transform job on an M5 instance.
B. Redeploy the model on an M5 instance. Attach Amazon Elastic Inference to the instance.
C. Redeploy the model on a P3dn instance.
D. Deploy the model onto an Amazon Elastic Container Service (Amazon ECS) cluster using a P3 instance.
Explanation: The best way to ensure that provisioned resources are being utilized
effectively is to redeploy the model on an M5 instance and attach Amazon Elastic Inference
to the instance. Amazon Elastic Inference allows you to attach low-cost GPU-powered
acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of
running deep learning inference by up to 75%. By using Amazon Elastic Inference, you can
choose the instance type that is best suited to the overall CPU and memory needs of your
application, and then separately configure the amount of inference acceleration that you
need with no code changes. This way, you can avoid wasting GPU resources and pay only
for what you use.
Option A is incorrect because a batch transform job is not suitable for real-time predictions.
Batch transform is a high-performance and cost-effective feature for generating inferences
using your trained models. Batch transform manages all of the compute resources required
to get inferences. Batch transform is ideal for scenarios where you’re working with large
batches of data, don’t need sub-second latency, or need to process data that is stored in
Amazon S3.
Option C is incorrect because redeploying the model on a P3dn instance would not
improve the resource utilization. P3dn instances are designed for distributed machine
learning and high performance computing applications that need high network throughput
and packet rate performance. They are not optimized for inference workloads.
Option D is incorrect because deploying the model onto an Amazon ECS cluster using a
P3 instance would not ensure that provisioned resources are being utilized effectively.
Amazon ECS is a fully managed container orchestration service that allows you to run and
scale containerized applications on AWS. However, using Amazon ECS would not address
the issue of underutilized GPU resources. In fact, it might introduce additional overhead
and complexity in managing the cluster.
A data scientist is using an Amazon SageMaker notebook instance and needs to securely access data stored in a specific Amazon S3 bucket. How should the data scientist accomplish this?
A. Add an S3 bucket policy allowing GetObject, PutObject, and ListBucket permissions to the Amazon SageMaker notebook ARN as principal.
B. Encrypt the objects in the S3 bucket with a custom AWS Key Management Service (AWS KMS) key that only the notebook owner has access to.
C. Attach the policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket.
D. Use a script in a lifecycle configuration to configure the AWS CLI on the instance with an access key ID and secret.
Explanation: The best way to securely access data stored in a specific Amazon S3 bucket from an Amazon SageMaker notebook instance is to attach a policy to the IAM role associated with the notebook that allows GetObject, PutObject, and ListBucket operations to the specific S3 bucket. This way, the notebook can use the AWS SDK or CLI to access the S3 bucket without exposing any credentials or requiring any additional configuration. This is also the recommended approach by AWS for granting access to S3 from SageMaker.
A company processes millions of orders every day. The company uses Amazon
DynamoDB tables to store order information. When customers submit new orders, the new
orders are immediately added to the DynamoDB tables. New orders arrive in the
DynamoDB tables continuously.
A data scientist must build a peak-time prediction solution. The data scientist must also
create an Amazon OuickSight dashboard to display near real-lime order insights. The data
scientist needs to build a solution that will give QuickSight access to the data as soon as
new order information arrives.
Which solution will meet these requirements with the LEAST delay between when a new
order is processed and when QuickSight can access the new order information?
A. Use AWS Glue to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.
B. Use Amazon Kinesis Data Streams to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.
C. Use an API call from OuickSight to access the data that is in Amazon DynamoDB directly
D. Use Amazon Kinesis Data Firehose to export the data from Amazon DynamoDB to Amazon S3. Configure OuickSight to access the data in Amazon S3.
Explanation: The best solution for this scenario is to use Amazon Kinesis Data Streams to
export the data from Amazon DynamoDB to Amazon S3, and then configure QuickSight to
access the data in Amazon S3. This solution has the following advantages:
It allows near real-time data ingestion from DynamoDB to S3 using Kinesis Data
Streams, which can capture and process data continuously and at scale1.
It enables QuickSight to access the data in S3 using the Athena connector, which
supports federated queries to multiple data sources, including Kinesis Data
Streams2.
It avoids the need to create and manage a Lambda function or a Glue crawler,
which are required for the other solutions.
The other solutions have the following drawbacks:
Using AWS Glue to export the data from DynamoDB to S3 introduces additional
latency and complexity, as Glue is a batch-oriented service that requires
scheduling and configuration3.
Using an API call from QuickSight to access the data in DynamoDB directly is not
possible, as QuickSight does not support direct querying of DynamoDB4.
Using Kinesis Data Firehose to export the data from DynamoDB to S3 is less
efficient and flexible than using Kinesis Data Streams, as Firehose does not
support custom data processing or transformation, and has a minimum buffer
interval of 60 seconds5.
A company has an ecommerce website with a product recommendation engine built in
TensorFlow. The recommendation engine endpoint is hosted by Amazon SageMaker.
Three compute-optimized instances support the expected peak load of the website.
Response times on the product recommendation page are increasing at the beginning of
each month. Some users are encountering errors. The website receives the majority of its
traffic between 8 AM and 6 PM on weekdays in a single time zone.
Which of the following options are the MOST effective in solving the issue while keeping
costs to a minimum? (Choose two.)
A. Configure the endpoint to use Amazon Elastic Inference (EI) accelerators.
B. Create a new endpoint configuration with two production variants.
C. Configure the endpoint to automatically scale with the Invocations Per Instance metric.
D. Deploy a second instance pool to support a blue/green deployment of models.
E. Reconfigure the endpoint to use burstable instances.
Explanation: The solution A and C are the most effective in solving the issue while
keeping costs to a minimum. The solution A and C involve the following steps:
Configure the endpoint to use Amazon Elastic Inference (EI) accelerators. This will
enable the company to reduce the cost and latency of running TensorFlow
inference on SageMaker. Amazon EI provides GPU-powered acceleration for deep
learning models without requiring the use of GPU instances. Amazon EI can attach
to any SageMaker instance type and provide the right amount of acceleration
based on the workload1.
Configure the endpoint to automatically scale with the Invocations Per Instance
metric. This will enable the company to adjust the number of instances based on
the demand and traffic patterns of the website. The Invocations Per Instance
metric measures the average number of requests that each instance processes
over a period of time. By using this metric, the company can scale out the endpoint
when the load increases and scale in when the load decreases. This can improve
the response time and availability of the product recommendation engine2.
The other options are not suitable because:
Option B: Creating a new endpoint configuration with two production variants will
not solve the issue of increasing response time and errors. Production variants are
used to split the traffic between different models or versions of the same model.
They can be useful for testing, updating, or A/B testing models. However, they do
not provide any scaling or acceleration benefits for the inference workload3.
Option D: Deploying a second instance pool to support a blue/green deployment of
models will not solve the issue of increasing response time and errors. Blue/green
deployment is a technique for updating models without downtime or disruption. It
involves creating a new endpoint configuration with a different instance pool and
model version, and then shifting the traffic from the old endpoint to the new
endpoint gradually. However, this technique does not provide any scaling or
acceleration benefits for the inference workload4.
Option E: Reconfiguring the endpoint to use burstable instances will not solve the
issue of increasing response time and errors. Burstable instances are instances
that provide a baseline level of CPU performance with the ability to burst above the
baseline when needed. They can be useful for workloads that have moderate CPU
utilization and occasional spikes. However, they are not suitable for workloads that
have high and consistent CPU utilization, such as the product recommendation
engine. Moreover, burstable instances may incur additional charges when they
exceed their CPU credits5.
Page 5 out of 26 Pages |
Previous |