A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure. The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications. The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields. Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?
A. The Tungsten encoding used by Databricks is optimized for storing string data; newlyadded native support for querying JSON strings means that string types are always most efficient.
B. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
C. Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
E. Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.
Explanation:
This is the correct answer because it accurately presents information about
Delta Lake and Databricks that may impact the decision-making process of a junior data
engineer who is trying to determine the best approach for dealing with schema declaration
given the highly-nested structure of the data and the numerous fields. Delta Lake and
Databricks support schema inference and evolution, which means that they can
automatically infer the schema of a table from the source data and allow adding new
columns or changing column types without affecting existing queries or pipelines. However,
schema inference and evolution may not always be desirable or reliable, especially when
dealing with complex or nested data structures or when enforcing data quality and
consistency across different systems. Therefore, setting types manually can provide
greater assurance of data quality enforcement and avoid potential errors or conflicts due to
incompatible or unexpected data types. Verified References: [Databricks Certified Data
Engineer Professional], under “Delta Lake” section; Databricks Documentation, under
“Schema inference and partition of streaming DataFrames/Datasets” section.
The business intelligence team has a dashboard configured to track various summary
metrics for retail stories. This includes total sales for the previous day alongside totals and
averages for a variety of time periods. The fields required to populate this dashboard have
the following schema:
For Demand forecasting, the Lakehouse contains a validated table of all itemized sales
updated incrementally in near real-time. This table named products_per_order, includes the
following fields:
Because reporting on long-term sales trends is less volatile, analysts using the new
dashboard only require data to be refreshed once daily. Because the dashboard will be
queried interactively by many users throughout a normal business day, it should return
results quickly and reduce total compute associated with each materialization.
Which solution meets the expectations of the end users while controlling and limiting
possible costs?
A. Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.
B. Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.
C. Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.
D. Define a view against the products_per_order table and define the dashboard against this view.
Explanation:
Given the requirement for daily refresh of data and the need to ensure quick
response times for interactive queries while controlling costs, a nightly batch job to precompute and save the required summary metrics is the most suitable approach.
By pre-aggregating data during off-peak hours, the dashboard can serve queries
quickly without requiring on-the-fly computation, which can be resource-intensive
and slow, especially with many users.
This approach also limits the cost by avoiding continuous computation throughout
the day and instead leverages a batch process that efficiently computes and stores
the necessary data.
The other options (A, C, D) either do not address the cost and performance
requirements effectively or are not suitable for the use case of less frequent data
refresh and high interactivity.
References:
Databricks Documentation on Batch Processing: Databricks Batch Processing
Data Lakehouse Patterns: Data Lakehouse Best Practices
A data engineer needs to capture pipeline settings from an existing in the workspace, and use them to create and version a JSON file to create a new pipeline. Which command should the data engineer enter in a web terminal configured with the Databricks CLI?
A. Use the get command to capture the settings for the existing pipeline; remove the pipeline_id and rename the pipeline; use this in a create command
B. Stop the existing pipeline; use the returned settings in a reset command
C. Use the alone command to create a copy of an existing pipeline; use the get JSON command to get the pipeline definition; save this to git
D. Use list pipelines to get the specs for all pipelines; get the pipeline spec from the return results parse and use this to create a pipeline
Explanation:
The Databricks CLI provides a way to automate interactions with Databricks
services. When dealing with pipelines, you can use the databricks pipelines get --
pipeline-id command to capture the settings of an existing pipeline in JSON format. This
JSON can then be modified by removing the pipeline_id to prevent conflicts and renaming
the pipeline to create a new pipeline. The modified JSON file can then be used with the
databricks pipelines create command to create a new pipeline with those settings.
References:
Databricks Documentation on CLI for Pipelines: Databricks CLI - Pipelines
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables. Which approach will ensure that this requirement is met?
A. Whenever a database is being created, make sure that the location keyword is used
B. When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.
C. Whenever a table is being created, make sure that the location keyword is used.
D. When tables are created, make sure that the external keyword is used in the create table statement.
E. When the workspace is being configured, make sure that external cloud object storage has been mounted.
Explanation:
This is the correct answer because it ensures that this requirement is met.
The requirement is that all tables in the Lakehouse should be configured as external Delta
Lake tables. An external table is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created by using the location keyword to specify the path to an existing directory in a cloud
storage system, such as DBFS or S3. By creating external tables, the data engineering
team can avoid losing data if they drop or overwrite the table, as well as leverage existing
data without moving or copying it. Verified References: [Databricks Certified Data Engineer
Professional], under “Delta Lake” section; Databricks Documentation, under “Create an
external table” section.
The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible. A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team. Which statement captures best practices for this situation?
A. Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.
B. All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.
C. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
D. Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.
Explanation:
The best practice in such scenarios is to ensure that production data is
handled securely and with proper access controls. By granting only read access to
production data in development and testing environments, it mitigates the risk of
unintended data modification. Additionally, maintaining isolated databases for different
environments helps to avoid accidental impacts on production data and systems.
References:
Databricks best practices for securing data:
https://docs.databricks.com/security/index.html
Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?
A. Regex
B. Julia
C. pyspsark.ml.feature
D. Scala Datasets
E. C++
Explanation:
Regex, or regular expressions, are a powerful way of matching patterns in
text. They can be used to identify key areas of text when parsing Spark Driver log4j output,
such as the log level, the timestamp, the thread name, the class name, the method name,
and the message. Regex can be applied in various languages and frameworks, such as
Scala, Python, Java, Spark SQL, and Databricks notebooks.
References:
https://docs.databricks.com/notebooks/notebooks-use.html#use-regularexpressions
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#using-regularexpressions-in-udfs
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_extract.html
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_replace.html
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records. In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
A. Set the configuration delta.deduplicate = true.
B. VACUUM the Delta table after each batch completes.
C. Perform an insert-only merge with a matching condition on a unique key
D. Perform a full outer join on a unique key and overwrite existing data.
E. Rely on Delta Lake schema enforcement to prevent duplicate records.
Explanation:
To deduplicate data against previously processed records as it is inserted
into a Delta table, you can use the merge operation with an insert-only clause. This allows
you to insert new records that do not match any existing records based on a unique key,
while ignoring duplicate records that match existing records. For example, you can use the
following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key =
source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not
present in the target table, and skip the records that have a matching key. This way, you
can avoid inserting duplicate records into the Delta table.
References:
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-usingmerge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A. If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?
A. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
B. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
C. All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.
D. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.
E. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
Explanation:
The query uses the CREATE TABLE USING DELTA syntax to create a Delta
Lake table from an existing Parquet file stored in DBFS. The query also uses the
LOCATION keyword to specify the path to the Parquet file as
/mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query
creates an external table, which is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created from an existing directory in a cloud storage system, such as DBFS or S3, that
contains data files in a supported format, such as Parquet or CSV.
The resulting state after running the second command is that an external table will be
created in the storage container mounted to /mnt/finance_eda_bucket with the new name
prod.sales_by_store. The command will not change any data or move any files in the
storage container; it will only update the table reference in the metastore and create a new
Delta transaction log for the renamed table.
Verified References: [Databricks Certified Data
Engineer Professional], under “Delta Lake” section; Databricks Documentation, under
“ALTER TABLE RENAME TO” section; Databricks Documentation, under “Create an
external table” section.
Which statement describes the default execution mode for Databricks Auto Loader?
A. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.
B. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.
C. Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.
D. New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.
Explanation:
Databricks Auto Loader simplifies and automates the process of loading data
into Delta Lake. The default execution mode of the Auto Loader identifies new files by
listing the input directory. It incrementally and idempotently loads these new files into the
target Delta Lake table. This approach ensures that files are not missed and are processed
exactly once, avoiding data duplication. The other options describe different mechanisms
or integrations that are not part of the default behavior of the Auto Loader.
References:
Databricks Auto Loader Documentation: Auto Loader Guide
Delta Lake and Auto Loader: Delta Lake Integration
Which statement characterizes the general programming model used by Spark Structured Streaming?
A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
C. Structured Streaming uses specialized hardware and I/O streams to achieve subsecond latency for data transfer.
D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
Explanation:
This is the correct answer because it characterizes the general programming
model used by Spark Structured Streaming, which is to treat a live data stream as a table
that is being continuously appended. This leads to a new stream processing model that is
very similar to a batch processing model, where users can express their streaming
computation using the same Dataset/DataFrame API as they would use for static data. The
Spark SQL engine will take care of running the streaming query incrementally and
continuously and updating the final result as streaming data continues to arrive.
Verified
References: [Databricks Certified Data Engineer Professional], under “Structured
Streaming” section; Databricks Documentation, under “Overview” section.
A Delta Lake table was created with the below query: Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command?
A. The table reference in the metastore is updated and no data is changed.
B. The table name change is recorded in the Delta transaction log.
C. All related files and metadata are dropped and recreated in a single ACID transaction.
D. The table reference in the metastore is updated and all data files are moved.
E. A new Delta transaction log Is created for the renamed table.
Explanation:
The query uses the CREATE TABLE USING DELTA syntax to create a Delta
Lake table from an existing Parquet file stored in DBFS. The query also uses the
LOCATION keyword to specify the path to the Parquet file as
/mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query
creates an external table, which is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created from an existing directory in a cloud storage system, such as DBFS or S3, that
contains data files in a supported format, such as Parquet or CSV.
The result that will occur after running the second command is that the table reference in
the metastore is updated and no data is changed. The metastore is a service that stores
metadata about tables, such as their schema, location, properties, and partitions. The
metastore allows users to access tables using SQL commands or Spark APIs without
knowing their physical location or format. When renaming an external table using the
ALTER TABLE RENAME TO command, only the table reference in the metastore is
updated with the new name; no data files or directories are moved or changed in the
storage system. The table will still point to the same location and use the same format as
before. However, if renaming a managed table, which is a table whose metadata and data
are both managed by Databricks, both the table reference in the metastore and the data
files in the default warehouse directory are moved and renamed accordingly.
Verified
References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section;
Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks
Documentation, under “Metastore” section; Databricks Documentation, under “Managed
and external tables” section.
Which statement describes the correct use of pyspark.sql.functions.broadcast?
A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
Explanation:
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.broadca
st.html
The broadcast function in PySpark is used in the context of joins. When you mark a
DataFrame with broadcast, Spark tries to send this DataFrame to all worker nodes so that
it can be joined with another DataFrame without shuffling the larger DataFrame across the
nodes. This is particularly beneficial when the DataFrame is small enough to fit into the
memory of each node. It helps to optimize the join process by reducing the amount of data
that needs to be shuffled across the cluster, which can be a very expensive operation in
terms of computation and time.
The pyspark.sql.functions.broadcast function in PySpark is used to hint to Spark that a
DataFrame is small enough to be broadcast to all worker nodes in the cluster. When this
hint is applied, Spark can perform a broadcast join, where the smaller DataFrame is sent to
each executor only once and joined with the larger DataFrame on each executor. This can
significantly reduce the amount of data shuffled across the network and can improve the
performance of the join operation.
In a broadcast join, the entire smaller DataFrame is sent to each executor, not just a
specific column or a cached version on attached storage. This function is particularly useful
when one of the DataFrames in a join operation is much smaller than the other, and can fit
comfortably in the memory of each executor node.
References:
Databricks Documentation on Broadcast Joins: Databricks Broadcast Join Guide
PySpark API Reference: pyspark.sql.functions.broadcast
Page 2 out of 10 Pages |
Previous |