A Data Scientist needs to migrate an existing on-premises ETL process to the cloud The
current process runs at regular time intervals and uses PySpark to combine and format
multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements for the cloud solution.
* Combine multiple data sources
* Reuse existing PySpark logic
* Run the solution on the existing schedule
* Minimize the number of servers that will need to be managed
Which architecture should the Data Scientist use to build this solution?
A. Write the raw data to Amazon S3 Schedule an AWS Lambda function to submit a Spark step to a persistent Amazon EMR cluster based on the existing schedule Use the existing PySpark logic to run the ETL job on the EMR cluster Output the results to a "processed" location m Amazon S3 that is accessible tor downstream use
B. Write the raw data to Amazon S3 Create an AWS Glue ETL job to perform the ETL processing against the input data Write the ETL job in PySpark to leverage the existing logic Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule Configure the output target of the ETL job to write to a "processed" location in Amazon S3 that is accessible for downstream use.
C. Write the raw data to Amazon S3 Schedule an AWS Lambda function to run on the existing schedule and process the input data from Amazon S3 Write the Lambda logic in Python and implement the existing PySpartc logic to perform the ETL process Have the Lambda function output the results to a "processed" location in Amazon S3 that is accessible for downstream use
D. Use Amazon Kinesis Data Analytics to stream the input data and perform realtime SQL queries against the stream to carry out the required transformations within the stream Deliver the output results to a "processed" location in Amazon S3 that is accessible for downstream use
Explanation:
The Data Scientist needs to migrate an existing on-premises ETL process to the
cloud, using a solution that can combine multiple data sources, reuse existing
PySpark logic, run on the existing schedule, and minimize the number of servers
that need to be managed. The best architecture for this scenario is to use AWS
Glue, which is a serverless data integration service that can create and run ETL
jobs on AWS.
AWS Glue can perform the following tasks to meet the requirements:
Therefore, the Data Scientist should use the following architecture to build the
cloud solution:
A data scientist is trying to improve the accuracy of a neural network classification model. The data scientist wants to run a large hyperparameter tuning job in Amazon SageMaker. However, previous smaller tuning jobs on the same model often ran for several weeks. The ML specialist wants to reduce the computation time required to run the tuning job. Which actions will MOST reduce the computation time for the hyperparameter tuning job? (Select TWO.)
A. Use the Hyperband tuning strategy.
B. Increase the number of hyperparameters.
C. Set a lower value for the MaxNumberOfTrainingJobs parameter.
D. Use the grid search tuning strategy
E. Set a lower value for the MaxParallelTrainingJobs parameter.
Explanation: The Hyperband tuning strategy is a multi-fidelity based tuning strategy that
dynamically reallocates resources to the most promising hyperparameter configurations.
Hyperband uses both intermediate and final results of training jobs to stop underperforming
jobs and reallocate epochs to well-utilized hyperparameter configurations.
Hyperband can provide up to three times faster hyperparameter tuning compared to other
strategies1. Setting a lower value for the MaxNumberOfTrainingJobs parameter can also
reduce the computation time for the hyperparameter tuning job by limiting the number of
training jobs that the tuning job can launch. This can help avoid unnecessary or redundant
training jobs that do not improve the objective metric.
The other options are not effective ways to reduce the computation time for the
hyperparameter tuning job. Increasing the number of hyperparameters will increase the
complexity and dimensionality of the search space, which can result in longer computation
time and lower performance. Using the grid search tuning strategy will also increase the
computation time, as grid search methodically searches through every combination of
hyperparameter values, which can be very expensive and inefficient for large search
spaces. Setting a lower value for the MaxParallelTrainingJobs parameter will reduce the
number of training jobs that can run in parallel, which can slow down the tuning process
and increase the waiting time.
A company wants to forecast the daily price of newly launched products based on 3 years
of data for older product prices, sales, and rebates. The time-series data has irregular
timestamps and is missing some values.
Data scientist must build a dataset to replace the missing values. The data scientist needs
a solution that resamptes the data daily and exports the data for further modeling.
Which solution will meet these requirements with the LEAST implementation effort?
A. Use Amazon EMR Serveriess with PySpark.
B. Use AWS Glue DataBrew.
C. Use Amazon SageMaker Studio Data Wrangler.
D. Use Amazon SageMaker Studio Notebook with Pandas.
Explanation: Amazon SageMaker Studio Data Wrangler is a visual data preparation tool
that enables users to clean and normalize data without writing any code. Using Data
Wrangler, the data scientist can easily import the time-series data from various sources,
such as Amazon S3, Amazon Athena, or Amazon Redshift. Data Wrangler can
automatically generate data insights and quality reports, which can help identify and fix
missing values, outliers, and anomalies in the data. Data Wrangler also provides over 250
built-in transformations, such as resampling, interpolation, aggregation, and filtering, which
can be applied to the data with a point-and-click interface. Data Wrangler can also export
the prepared data to different destinations, such as Amazon S3, Amazon SageMaker
Feature Store, or Amazon SageMaker Pipelines, for further modeling and analysis. Data
Wrangler is integrated with Amazon SageMaker Studio, a web-based IDE for machine
learning, which makes it easy to access and use the tool. Data Wrangler is a serverless
and fully managed service, which means the data scientist does not need to provision,
configure, or manage any infrastructure or clusters.
A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet How should the company set up the job?
A. Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
B. Launch the notebook instances in a private subnet and access the data through a NAT gateway
C. Launch the notebook instances in a public subnet and access the data through a NAT gateway
D. Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
Explanation: A private subnet is a subnet that does not have a route to the internet gateway, which means that the resources in the private subnet cannot access the internet or be accessed from the internet. An S3 VPC endpoint is a gateway endpoint that allows the resources in the VPC to access the S3 service without going through the internet. By launching the notebook instances in a private subnet and accessing the data through an S3 VPC endpoint, the company can set up the job in a secure and compliant way, as the data never leaves the AWS network and is not exposed to the internet. This can also improve the performance and reliability of the data transfer, as the traffic does not depend on the internet bandwidth or availability.
A Machine Learning Specialist observes several performance problems with the training portion of a machine learning solution on Amazon SageMaker The solution uses a large training dataset 2 TB in size and is using the SageMaker k-means algorithm The observed issues include the unacceptable length of time it takes before the training job launches and poor I/O throughput while training the model What should the Specialist do to address the performance issues with the current solution?
A. Use the SageMaker batch transform feature
B. Compress the training data into Apache Parquet format.
C. Ensure that the input mode for the training job is set to Pipe.
D. Copy the training dataset to an Amazon EFS volume mounted on the SageMaker instance.
Explanation: The input mode for the training job determines how the training data is transferred from Amazon S3 to the SageMaker instance. There are two input modes: File and Pipe. File mode copies the entire training dataset from S3 to the local file system of the instance before starting the training job. This can cause a long delay before the training job launches, especially if the dataset is large. Pipe mode streams the data from S3 to the instance as the training job runs. This can reduce the startup time and improve the I/O throughput, as the data is read in smaller batches. Therefore, to address the performance issues with the current solution, the Specialist should ensure that the input mode for the training job is set to Pipe. This can be done by using the SageMaker Python SDK and setting the input_mode parameter to Pipe when creating the estimator or the fit method12. Alternatively, this can be done by using the AWS CLI and setting the InputMode parameter to Pipe when creating the training job3.
A Data Scientist received a set of insurance records, each consisting of a record ID, the final outcome among 200 categories, and the date of the final outcome. Some partial information on claim contents is also provided, but only for a few of the 200 categories. For each outcome category, there are hundreds of records distributed over the past 3 years. The Data Scientist wants to predict how many claims to expect in each category from month to month, a few months in advance. What type of machine learning model should be used?
A. Classification month-to-month using supervised learning of the 200 categories based on claim contents.
B. Reinforcement learning using claim IDs and timestamps where the agent will identify how many claims in each category to expect from month to month.
C. Forecasting using claim IDs and timestamps to identify how many claims in each category to expect from month to month.
D. Classification with supervised learning of the categories for which partial information on claim contents is provided, and forecasting using claim IDs and timestamps for all other categories.
Explanation: Forecasting is a type of machine learning model that predicts future values of
a target variable based on historical data and other features. Forecasting is suitable for
problems that involve time-series data, such as the number of claims in each category from
month to month. Forecasting can handle multiple categories of the target variable, as well
as missing or partial information on some features. Therefore, option C is the best choice
for the given problem.
Option A is incorrect because classification is a type of machine learning model that
assigns a label to an input based on predefined categories. Classification is not suitable for
predicting continuous or numerical values, such as the number of claims in each category
from month to month. Moreover, classification requires sufficient and complete information
on the features that are relevant to the target variable, which is not the case for the given
problem.
Option B is incorrect because reinforcement learning is a type of machine
learning model that learns from its own actions and rewards in an interactive environment.
Reinforcement learning is not suitable for problems that involve historical data and do not
require an agent to take actions.
Option D is incorrect because it combines two different
types of machine learning models, which is unnecessary and inefficient. Moreover,
classification is not suitable for predicting the number of claims in some categories, as
explained in option A.
A Machine Learning Specialist previously trained a logistic regression model using scikitlearn on a local machine, and the Specialist now wants to deploy it to production for inference only. What steps should be taken to ensure Amazon SageMaker can host a model that was trained locally?
A. Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR.
B. Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3.
C. Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub.
D. Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.
Explanation: To deploy a model that was trained locally to Amazon SageMaker, the steps
are:
Build the Docker image with the inference code. The inference code should
include the model loading, data preprocessing, prediction, and postprocessing
logic. The Docker image should also include the dependencies and libraries
required by the inference code and the model.
Tag the Docker image with the registry hostname and upload it to Amazon ECR.
Amazon ECR is a fully managed container registry that makes it easy to store,
manage, and deploy container images. The registry hostname is the Amazon ECR
registry URI for your account and Region. You can use the AWS CLI or the
Amazon ECR console to tag and push the Docker image to Amazon ECR.
Create a SageMaker model entity that points to the Docker image in Amazon ECR
and the model artifacts in Amazon S3. The model entity is a logical representation
of the model that contains the information needed to deploy the model for
inference. The model artifacts are the files generated by the model training
process, such as the model parameters and weights. You can use the AWS CLI,
the SageMaker Python SDK, or the SageMaker console to create the model entity.
Create an endpoint configuration that specifies the instance type and number of
instances to use for hosting the model. The endpoint configuration also defines the
production variants, which are the different versions of the model that you want to
deploy. You can use the AWS CLI, the SageMaker Python SDK, or the
SageMaker console to create the endpoint configuration.
Create an endpoint that uses the endpoint configuration to deploy the model. The
endpoint is a web service that exposes an HTTP API for inference requests. You
can use the AWS CLI, the SageMaker Python SDK, or the SageMaker console to
create the endpoint.
An automotive company uses computer vision in its autonomous cars. The company
trained its object detection models successfully by using transfer learning from a
convolutional neural network (CNN). The company trained the models by using PyTorch
through the Amazon SageMaker SDK.
The vehicles have limited hardware and compute power. The company wants to optimize
the model to reduce memory, battery, and hardware consumption without a significant
sacrifice in accuracy.
Which solution will improve the computational efficiency of the models?
A. Use Amazon CloudWatch metrics to gain visibility into the SageMaker training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set new weights based on the pruned set of filters. Run a new training job with the pruned model.
B. Use Amazon SageMaker Ground Truth to build and run data labeling workflows. Collect a larger labeled dataset with the labelling workflows. Run a new training job that uses the new labeled data with previous training data.
C. Use Amazon SageMaker Debugger to gain visibility into the training weights, gradients, biases, and activation outputs. Compute the filter ranks based on the training information. Apply pruning to remove the low-ranking filters. Set the new weights based on the pruned set of filters. Run a new training job with the pruned model.
D. Use Amazon SageMaker Model Monitor to gain visibility into the ModelLatency metric and OverheadLatency metric of the model after the company deploys the model. Increase the model learning rate. Run a new training job.
Explanation: The solution C will improve the computational efficiency of the models
because it uses Amazon SageMaker Debugger and pruning, which are techniques that can
reduce the size and complexity of the convolutional neural network (CNN) models. The
solution C involves the following steps:
A city wants to monitor its air quality to address the consequences of air pollution A Machine Learning Specialist needs to forecast the air quality in parts per million of contaminates for the next 2 days in the city as this is a prototype, only daily data from the last year is available Which model is MOST likely to provide the best results in Amazon SageMaker?
A. Use the Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
B. Use Amazon SageMaker Random Cut Forest (RCF) on the single time series consisting of the full year of data.
C. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of regressor.
D. Use the Amazon SageMaker Linear Learner algorithm on the single time series consisting of the full year of data with a predictor_type of classifier.
Explanation: The Amazon SageMaker k-Nearest-Neighbors (kNN) algorithm is a supervised learning algorithm that can perform both classification and regression tasks. It can also handle time series data, such as the air quality data in this case. The kNN algorithm works by finding the k most similar instances in the training data to a given query instance, and then predicting the output based on the average or majority of the outputs of the k nearest neighbors. The kNN algorithm can be configured to use different distance metrics, such as Euclidean or cosine, to measure the similarity between instances. To use the kNN algorithm on the single time series consisting of the full year of data, the Machine Learning Specialist needs to set the predictor_type parameter to regressor, as the output variable (air quality in parts per million of contaminates) is a continuous value. The kNN algorithm can then forecast the air quality for the next 2 days by finding the k most similar days in the past year and averaging their air quality values.
A company wants to create an artificial intelligence (Al) yoga instructor that can lead large
classes of students. The company needs to create a feature that can accurately count the
number of students who are in a class. The company also needs a feature that can
differentiate students who are performing a yoga stretch correctly from students who are
performing a stretch incorrectly.
...etermine whether students are performing a stretch correctly, the solution needs to
measure the location and angle of each student's arms and legs A data scientist must use
Amazon SageMaker to ...ss video footage of a yoga class by extracting image frames and
applying computer vision models.
Which combination of models will meet these requirements with the LEAST effort? (Select TWO.)
A. Image Classification
B. Optical Character Recognition (OCR)
C. Object Detection
D. Pose estimation
E. Image Generative Adversarial Networks (GANs)
Explanation: To count the number of students who are in a class, the solution needs to detect and locate each student in the video frame. Object detection is a computer vision model that can identify and locate multiple objects in an image. To differentiate students who are performing a stretch correctly from students who are performing a stretch incorrectly, the solution needs to measure the location and angle of each student’s arms and legs. Pose estimation is a computer vision model that can estimate the pose of a person by detecting the position and orientation of key body parts. Image classification, OCR, and image GANs are not relevant for this use case.
This graph shows the training and validation loss against the epochs for a neural network
The network being trained is as follows
• Two dense layers one output neuron
• 100 neurons in each layer
• 100 epochs
• Random initialization of weights
Which technique can be used to improve model performance in terms of accuracy in the
validation set?
A. Early stopping
B. Random initialization of weights with appropriate seed
C. Increasing the number of epochs
D. Adding another layer with the 100 neurons
Explanation: Early stopping is a technique that can be used to prevent overfitting and improve model performance on the validation set. Overfitting occurs when the model learns the training data too well and fails to generalize to new and unseen data. This can be seen in the graph, where the training loss keeps decreasing, but the validation loss starts to increase after some point. This means that the model is fitting the noise and patterns in the training data that are not relevant for the validation data. Early stopping is a way of stopping the training process before the model overfits the training data. It works by monitoring the validation loss and stopping the training when the validation loss stops decreasing or starts increasing. This way, the model is saved at the point where it has the best performance on the validation set. Early stopping can also save time and resources by reducing the number of epochs needed for training.
A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based
ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the
objective metric This workflow will eventually be deployed in a pipeline that retrains and
tunes hyperparameters each night to model click-through on data that goes stale every 24
hours.
With the goal of decreasing the amount of time it takes to train these models, and ultimately
to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s)
Which visualization will accomplish this?
A. A histogram showing whether the most important input feature is Gaussian.
B. A scatter plot with points colored by target variable that uses (-Distributed Stochastic Neighbor Embedding (I-SNE) to visualize the large number of input variables in an easierto- read dimension.
C. A scatter plot showing (he performance of the objective metric over each training iteration
D. A scatter plot showing the correlation between maximum tree depth and the objective metric.
Explanation: A scatter plot showing the correlation between maximum tree depth and the
objective metric is a visualization that can help the Machine Learning Specialist reconfigure
the input hyperparameter range(s) for the tree-based ensemble model. A scatter plot is a
type of graph that displays the relationship between two variables using dots, where each
dot represents one observation. A scatter plot can show the direction, strength, and shape
of the correlation between the variables, as well as any outliers or clusters. In this case, the
scatter plot can show how the maximum tree depth, which is a hyperparameter that
controls the complexity and depth of the decision trees in the ensemble model, affects the
AUC, which is the objective metric that measures the performance of the model in terms of
the trade-off between true positive rate and false positive rate. By looking at the scatter
plot, the Machine Learning Specialist can see if there is a positive, negative, or no
correlation between the maximum tree depth and the AUC, and how strong or weak the
correlation is. The Machine Learning Specialist can also see if there is an optimal value or
range of values for the maximum tree depth that maximizes the AUC, or if there is a point
of diminishing returns or overfitting where increasing the maximum tree depth does not
improve or even worsens the AUC. Based on the scatter plot, the Machine Learning
Specialist can reconfigure the input hyperparameter range(s) for the maximum tree depth
to focus on the values that yield the best AUC, and avoid the values that result in poor
AUC. This can decrease the amount of time and cost it takes to train the model, as the
hyperparameter tuning job can explore fewer and more promising combinations of
values. A scatter plot can be created using various tools and libraries, such as Matplotlib,
Seaborn, Plotly, etc12.
The other options are not valid or relevant for reconfiguring the input hyperparameter
range(s) for the tree-based ensemble model. A histogram showing whether the most
important input feature is Gaussian is a visualization that can help the Machine Learning
Specialist understand the distribution and shape of the input data, but not the
hyperparameters. A histogram is a type of graph that displays the frequency or count of
values in a single variable using bars, where each bar represents a bin or interval of
values. A histogram can show if the variable is symmetric, skewed, or multimodal, and if it
follows a normal or Gaussian distribution, which is a bell-shaped curve that is often
assumed by many machine learning algorithms. In this case, the histogram can show if the
most important input feature, which is a variable that has the most influence or predictive
power on the output variable, is Gaussian or not. However, this does not help the Machine
Learning Specialist reconfigure the input hyperparameter range(s) for the tree-based
ensemble model, as the input feature is not a hyperparameter that can be tuned or
optimized. A histogram can be created using various tools and libraries, such as Matplotlib,
Seaborn, Plotly, etc34
A scatter plot with points colored by target variable that uses t-Distributed Stochastic
Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-
to-read dimension is a visualization that can help the Machine Learning Specialist
understand the structure and clustering of the input data, but not the hyperparameters. t-
SNE is a technique that can reduce the dimensionality of high-dimensional data, such as
images, text, or gene expression, and project it onto a lower-dimensional space, such as
two or three dimensions, while preserving the local similarities and distances between the
data points. t-SNE can help visualize and explore the patterns and relationships in the data,
such as the clusters, outliers, or separability of the classes. In this case, the scatter plot can
show how the input variables, which are the features or predictors of the output variable,
are mapped onto a two-dimensional space using t-SNE, and how the points are colored by
the target variable, which is the output or response variable that the model tries to predict.
However, this does not help the Machine Learning Specialist reconfigure the input
hyperparameter range(s) for the tree-based ensemble model, as the input variables and
the target variable are not hyperparameters that can be tuned or optimized. A scatter plot
with t-SNE can be created using various tools and libraries, such as Scikit-learn,
TensorFlow, PyTorch, etc5
A scatter plot showing the performance of the objective metric over each training iteration is
a visualization that can help the Machine Learning Specialist understand the learning curve
and convergence of the model, but not the hyperparameters. A scatter plot is a type of
graph that displays the relationship between two variables using dots, where each dot
represents one observation. A scatter plot can show the direction, strength, and shape of
the correlation between the variables, as well as any outliers or clusters. In this case, the
scatter plot can show how the objective metric, which is the performance measure that the
model tries to optimize, changes over each training iteration, which is the number of times
that the model updates its parameters using a batch of data. A scatter plot can show if the
objective metric improves, worsens, or stagnates over time, and if the model converges to
a stable value or oscillates or diverges. However, this does not help the Machine Learning
Specialist reconfigure the input hyperparameter range(s) for the tree-based ensemble
model, as the objective metric and the training iteration are not hyperparameters that can
be tuned or optimized. A scatter plot can be created using various tools and libraries, such
as Matplotlib, Seaborn, Plotly, etc.
Page 6 out of 26 Pages |
Previous |