A Data Scientist is training a multilayer perception (MLP) on a dataset with multiple classes. The target class of interest is unique compared to the other classes within the dataset, but it does not achieve and acceptable ecall metric. The Data Scientist has already tried varying the number and size of the MLP’s hidden layers, which has not significantly improved the results. A solution to improve recall must be implemented as quickly as possible. Which techniques should be used to meet these requirements?
A. Gather more data using Amazon Mechanical Turk and then retrain
B. Train an anomaly detection model instead of an MLP
C. Train an XGBoost model instead of an MLP
D. Add class weights to the MLP’s loss function and then retrain
Explanation: The best technique to improve the recall of the MLP for the target class of interest is to add class weights to the MLP’s loss function and then retrain. Class weights are a way of assigning different importance to each class in the dataset, such that the model will pay more attention to the classes with higher weights. This can help mitigate the class imbalance problem, where the model tends to favor the majority class and ignore the minority class. By increasing the weight of the target class of interest, the model will try to reduce the false negatives and increase the true positives, which will improve the recall metric. Adding class weights to the loss function is also a quick and easy solution, as it does not require gathering more data, changing the model architecture, or switching to a different algorithm.
A Machine Learning Specialist is configuring automatic model tuning in Amazon
SageMaker
When using the hyperparameter optimization feature, which of the following guidelines
should be followed to improve optimization?
Choose the maximum number of hyperparameters supported by
A. Amazon SageMaker to search the largest number of combinations possible
B. Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.
C. Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible
D. Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments
Explanation: Using log-scaled hyperparameters is a guideline that can improve the
automatic model tuning in Amazon SageMaker. Log-scaled hyperparameters are
hyperparameters that have values that span several orders of magnitude, such as learning
rate, regularization parameter, or number of hidden units. Log-scaled hyperparameters can
be specified by using a log-uniform distribution, which assigns equal probability to each
order of magnitude within a range. For example, a log-uniform distribution between 0.001
and 1000 can sample values such as 0.001, 0.01, 0.1, 1, 10, 100, or 1000 with equal
probability. Using log-scaled hyperparameters can allow the hyperparameter optimization
feature to search the hyperparameter space more efficiently and effectively, as it can
explore different scales of values and avoid sampling values that are too small or too large.
Using log-scaled hyperparameters can also help avoid numerical issues, such as underflow
or overflow, that may occur when using linear-scaled hyperparameters. Using log-scaled
hyperparameters can be done by setting the ScalingType parameter to Logarithmic when
defining the hyperparameter ranges in Amazon SageMaker.
The other options are not valid or relevant guidelines for improving the automatic model
tuning in Amazon SageMaker. Choosing the maximum number of hyperparameters
supported by Amazon SageMaker to search the largest number of combinations possible is
not a good practice, as it can increase the time and cost of the tuning job and make it
harder to find the optimal values. Amazon SageMaker supports up to 20 hyperparameters
for tuning, but it is recommended to choose only the most important and influential
hyperparameters for the model and algorithm, and use default or fixed values for the
rest.
Specifying a very large hyperparameter range to allow Amazon SageMaker to cover
every possible value is not a good practice, as it can result in sampling values that are
irrelevant or impractical for the model and algorithm, and waste the tuning budget. It is
recommended to specify a reasonable and realistic hyperparameter range based on the
prior knowledge and experience of the model and algorithm, and use the results of the
tuning job to refine the range if needed.
Executing only one hyperparameter tuning job at a
time and improving tuning through successive rounds of experiments is not a good
practice, as it can limit the exploration and exploitation of the hyperparameter space and
make the tuning process slower and less efficient. It is recommended to use parallelism
and concurrency to run multiple training jobs simultaneously and leverage the Bayesian
optimization algorithm that Amazon SageMaker uses to guide the search for the best
hyperparameter values.
An online reseller has a large, multi-column dataset with one column missing 30% of its data A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?
A. Listwise deletion
B. Last observation carried forward
C. Multiple imputation
D. Mean substitution
Explanation: Multiple imputation is a technique that uses machine learning to generate multiple plausible values for each missing value in a dataset, based on the observed data and the relationships among the variables. Multiple imputation preserves the integrity of the dataset by accounting for the uncertainty and variability of the missing data, and avoids the bias and loss of information that may result from other methods, such as listwise deletion, last observation carried forward, or mean substitution. Multiple imputation can improve the accuracy and validity of statistical analysis and machine learning models that use the imputed dataset.
A retail company wants to combine its customer orders with the product description data from its product catalog. The structure and format of the records in each dataset is different. A data analyst tried to use a spreadsheet to combine the datasets, but the effort resulted in duplicate records and records that were not properly combined. The company needs a solution that it can use to combine similar records from the two datasets and remove any duplicates. Which solution will meet these requirements?
A. Use an AWS Lambda function to process the data. Use two arrays to compare equal strings in the fields from the two datasets and remove any duplicates.
B. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Call the AWS Glue SearchTables API operation to perform a fuzzy-matching search on the two datasets, and cleanse the data accordingly.
C. Create AWS Glue crawlers for reading and populating the AWS Glue Data Catalog. Use the FindMatches transform to cleanse the data.
D. Create an AWS Lake Formation custom transform. Run a transformation for matching products from the Lake Formation console to cleanse the data automatically.
Explanation: The FindMatches transform is a machine learning transform that can identify
and match similar records from different datasets, even when the records do not have a
common unique identifier or exact field values. The FindMatches transform can also
remove duplicate records from a single dataset. The FindMatches transform can be used
with AWS Glue crawlers and jobs to process the data from various sources and store it in a
data lake. The FindMatches transform can be created and managed using the AWS Glue
console, API, or AWS Glue Studio.
The other options are not suitable for this use case because:
A data scientist wants to improve the fit of a machine learning (ML) model that predicts house prices. The data scientist makes a first attempt to fit the model, but the fitted model has poor accuracy on both the training dataset and the test dataset. Which steps must the data scientist take to improve model accuracy? (Select THREE.)
A. Increase the amount of regularization that the model uses.
B. Decrease the amount of regularization that the model uses.
C. Increase the number of training examples that that model uses.
D. Increase the number of test examples that the model uses.
E. Increase the number of model features that the model uses.
F. Decrease the number of model features that the model uses.
Explanation: When a model shows poor accuracy on both the training and test datasets, it
often indicates underfitting. To improve the model’s accuracy, the data scientist can:
A manufacturing company wants to use machine learning (ML) to automate quality control
in its facilities. The facilities are in remote locations and have limited internet connectivity.
The company has 20 of training data that consists of labeled images of defective product
parts. The training data is in the corporate on-premises data center.
The company will use this data to train a model for real-time defect detection in new parts
as the parts move on a conveyor belt in the facilities. The company needs a solution that
minimizes costs for compute infrastructure and that maximizes the scalability of resources
for training. The solution also must facilitate the company’s use of an ML model in the lowconnectivity
environments.
Which solution will meet these requirements?
A. Move the training data to an Amazon S3 bucket. Train and evaluate the model by using Amazon SageMaker. Optimize the model by using SageMaker Neo. Deploy the model on a SageMaker hosting services endpoint.
B. Train and evaluate the model on premises. Upload the model to an Amazon S3 bucket. Deploy the model on an Amazon SageMaker hosting services endpoint.
C. Move the training data to an Amazon S3 bucket. Train and evaluate the model by using Amazon SageMaker. Optimize the model by using SageMaker Neo. Set up an edge device in the manufacturing facilities with AWS IoT Greengrass. Deploy the model on the edge device.
D. Train the model on premises. Upload the model to an Amazon S3 bucket. Set up an edge device in the manufacturing facilities with AWS IoT Greengrass. Deploy the model on the edge device.
Explanation: The solution C meets the requirements because it minimizes costs for compute infrastructure, maximizes the scalability of resources for training, and facilitates the use of an ML model in low-connectivity environments. The solution C involves the following steps:
A company supplies wholesale clothing to thousands of retail stores. A data scientist must
create a model that predicts the daily sales volume for each item for each store. The data
scientist discovers that more than half of the stores have been in business for less than 6
months. Sales data is highly consistent from week to week. Daily data from the database
has been aggregated weekly, and weeks with no sales are omitted from the current
dataset. Five years (100 MB) of sales data is available in Amazon S3.
Which factors will adversely impact the performance of the forecast model to be developed,
and which actions should the data scientist take to mitigate them? (Choose two.)
A. Detecting seasonality for the majority of stores will be an issue. Request categorical data to relate new stores with similar stores that have more historical data.
B. The sales data does not have enough variance. Request external sales data from other industries to improve the model's ability to generalize.
C. Sales data is aggregated by week. Request daily sales data from the source database to enable building a daily model.
D. The sales data is missing zero entries for item sales. Request that item sales data from the source database include zero entries to enable building the model.
E. Only 100 MB of sales data is available in Amazon S3. Request 10 years of sales data, which would provide 200 MB of training data for the model.
Explanation: The factors that will adversely impact the performance of the forecast model
are:
Sales data is aggregated by week. This will reduce the granularity and resolution
of the data, and make it harder to capture the daily patterns and variations in sales
volume. The data scientist should request daily sales data from the source
database to enable building a daily model, which will be more accurate and useful
for the prediction task.
Sales data is missing zero entries for item sales. This will introduce bias and
incompleteness in the data, and make it difficult to account for the items that have
no demand or are out of stock. The data scientist should request that item sales
data from the source database include zero entries to enable building the model,
which will be more robust and realistic.
A data scientist wants to use Amazon Forecast to build a forecasting model for inventory demand for a retail company. The company has provided a dataset of historic inventory demand for its products as a .csv file stored in an Amazon S3 bucket. The table below shows a sample of the dataset. How should the data scientist transform the data?
A. Use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. Upload both datasets as .csv files to Amazon S3.
B. Use a Jupyter notebook in Amazon SageMaker to separate the dataset into a related time series dataset and an item metadata dataset. Upload both datasets as tables in Amazon Aurora.
C. Use AWS Batch jobs to separate the dataset into a target time series dataset, a related time series dataset, and an item metadata dataset. Upload them directly to Forecast from a local machine.
D. Use a Jupyter notebook in Amazon SageMaker to transform the data into the optimized protobuf recordIO format. Upload the dataset in this format to Amazon S3.
Explanation: Amazon Forecast requires the input data to be in a specific format. The data scientist should use ETL jobs in AWS Glue to separate the dataset into a target time series dataset and an item metadata dataset. The target time series dataset should contain the timestamp, item_id, and demand columns, while the item metadata dataset should contain the item_id, category, and lead_time columns. Both datasets should be uploaded as .csv files to Amazon S3 .
A financial services company is building a robust serverless data lake on Amazon S3. The
data lake should be flexible and meet the following requirements:
A. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data catalog to search and discover metadata.
B. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.
C. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.
D. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.
Explanation: To build a robust serverless data lake on Amazon S3 that meets the
requirements, the financial services company should use the following AWS services:
A credit card company wants to build a credit scoring model to help predict whether a new
credit card applicant
will default on a credit card payment. The company has collected data from a large number
of sources with
thousands of raw attributes. Early experiments to train a classification model revealed that
many attributes are
highly correlated, the large number of features slows down the training speed significantly,
and that there are
some overfitting issues.
The Data Scientist on this project would like to speed up the model training time without
losing a lot of
information from the original dataset.
Which feature engineering technique should the Data Scientist use to meet the objectives?
A. Run self-correlation on all features and remove highly correlated features
B. Normalize all numerical values to be between 0 and 1
C. Use an autoencoder or principal component analysis (PCA) to replace original features with new features
D. Cluster raw data using k-means and use sample data from each cluster to build a new dataset
Explanation: The best feature engineering technique to speed up the model training time without losing a lot of information from the original dataset is to use an autoencoder or principal component analysis (PCA) to replace original features with new features. An autoencoder is a type of neural network that learns a compressed representation of the input data, called the latent space, by minimizing the reconstruction error between the input and the output. PCA is a statistical technique that reduces the dimensionality of the data by finding a set of orthogonal axes, called the principal components, that capture the maximum variance of the data. Both techniques can help reduce the number of features and remove the noise and redundancy in the data, which can improve the model performance and speed up the training process.
A Machine Learning Specialist is working for a credit card processing company and receives an unbalanced dataset containing credit card transactions. It contains 99,000 valid transactions and 1,000 fraudulent transactions The Specialist is asked to score a model that was run against the dataset The Specialist has been advised that identifying valid transactions is equally as important as identifying fraudulent transactions What metric is BEST suited to score the model?
A. Precision
B. Recall
C. Area Under the ROC Curve (AUC)
D. Root Mean Square Error (RMSE)
Explanation: Area Under the ROC Curve (AUC) is a metric that is best suited to score the
model for the given scenario. AUC is a measure of the performance of a binary classifier,
such as a model that predicts whether a credit card transaction is valid or fraudulent. AUC
is calculated based on the Receiver Operating Characteristic (ROC) curve, which is a plot that shows the trade-off between the true positive rate (TPR) and the false positive rate
(FPR) of the classifier as the decision threshold is varied. The TPR, also known as recall or
sensitivity, is the proportion of actual positive cases (fraudulent transactions) that are
correctly predicted as positive by the classifier. The FPR, also known as the fall-out, is the
proportion of actual negative cases (valid transactions) that are incorrectly predicted as
positive by the classifier. The ROC curve illustrates how well the classifier can distinguish
between the two classes, regardless of the class distribution or the error costs. A perfect
classifier would have a TPR of 1 and an FPR of 0 for all thresholds, resulting in a ROC
curve that goes from the bottom left to the top left and then to the top right of the plot. A
random classifier would have a TPR and an FPR that are equal for all thresholds, resulting
in a ROC curve that goes from the bottom left to the top right of the plot along the diagonal
line. AUC is the area under the ROC curve, and it ranges from 0 to 1. A higher AUC
indicates a better classifier, as it means that the classifier has a higher TPR and a lower
FPR for all thresholds. AUC is a useful metric for imbalanced classification problems, such
as the credit card transaction dataset, because it is insensitive to the class imbalance and
the error costs. AUC can capture the overall performance of the classifier across all
possible scenarios, and it can be used to compare different classifiers based on their ROC
curves.
The other options are not as suitable as AUC for the given scenario for the following
reasons:
Precision: Precision is the proportion of predicted positive cases (fraudulent
transactions) that are actually positive. Precision is a useful metric when the cost
of a false positive is high, such as in spam detection or medical diagnosis.
However, precision is not a good metric for imbalanced classification problems,
because it can be misleadingly high when the positive class is rare. For example, a
classifier that predicts all transactions as valid would have a precision of 0, but a
very high accuracy of 99%. Precision is also dependent on the decision threshold
and the error costs, which may vary for different scenarios.
Recall: Recall is the same as the TPR, and it is the proportion of actual positive
cases (fraudulent transactions) that are correctly predicted as positive by the
classifier. Recall is a useful metric when the cost of a false negative is high, such
as in fraud detection or cancer diagnosis. However, recall is not a good metric for
imbalanced classification problems, because it can be misleadingly low when the
positive class is rare. For example, a classifier that predicts all transactions as
fraudulent would have a recall of 1, but a very low accuracy of 1%. Recall is also
dependent on the decision threshold and the error costs, which may vary for
different scenarios.
Root Mean Square Error (RMSE): RMSE is a metric that measures the average
difference between the predicted and the actual values. RMSE is a useful metric
for regression problems, where the goal is to predict a continuous value, such as
the price of a house or the temperature of a city. However, RMSE is not a good
metric for classification problems, where the goal is to predict a discrete value,
such as the class label of a transaction. RMSE is not meaningful for classification
problems, because it does not capture the accuracy or the error costs of the
predictions.
A Data Scientist needs to create a serverless ingestion and analytics solution for highvelocity,
real-time streaming data.
The ingestion process must buffer and convert incoming records from JSON to a queryoptimized,
columnar format without data loss. The output datastore must be highly
available, and Analysts must be able to run SQL queries against the data and connect to
existing business intelligence dashboards.
Which solution should the Data Scientist build to satisfy the requirements?
A. Create a schema in the AWS Glue Data Catalog of the incoming data format. Use an Amazon Kinesis Data Firehose delivery stream to stream the data and transform the data to Apache Parquet or ORC format using the AWS Glue Data Catalog before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.
B. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and writes the data to a processed data location in Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena, and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.
C. Write each JSON record to a staging location in Amazon S3. Use the S3 Put event to trigger an AWS Lambda function that transforms the data into Apache Parquet or ORC format and inserts it into an Amazon RDS PostgreSQL database. Have the Analysts query and run dashboards from the RDS database.
D. Use Amazon Kinesis Data Analytics to ingest the streaming data and perform real-time SQL queries to convert the records to Apache Parquet before delivering to Amazon S3. Have the Analysts query the data directly from Amazon S3 using Amazon Athena and connect to Bl tools using the Athena Java Database Connectivity (JDBC) connector.
Explanation: To create a serverless ingestion and analytics solution for high-velocity, realtime
streaming data, the Data Scientist should use the following AWS services:
Page 4 out of 26 Pages |
Previous |