Skip to main content
Version: v3.0 print this page

Delta Lake Datasets

In Amorphic, User can create Delta Lake datasets with Lakeformation target location which creates Delta Lake table in the backend to store the data.

What is Delta Lake?

Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. To learn more, check out the Delta Lake Documentation.

What does Amorphic support?

Amorphic Delta Lake datasets support following features:

  • Delta Lake provides ACID transaction guarantees between reads and writes. This means that:
    • For supported storage systems, multiple writers across multiple clusters can simultaneously modify a table partition and see a consistent snapshot view of the table and there will be a serial order for these writes.
    • Readers continue to see a consistent snapshot view of the table that the Apache Spark job started with, even when a table is modified during a job.
  • Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
  • Time travel: Data versioning enables rollbacks and full historical audit trails.
  • Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.
  • Supported file types: CSV, JSON, Parquet

Creating Delta Lake Datasets

In Amorphic, user can create Delta Lake datasets like lakeformation datasets by selecting Lakeformation as target and file types from any of the following (csv, json, parquet), and Yes in 'Delta Lake Table' dropdown. Update method can be either 'append' (will append data to the existing dataset) or 'overwrite' (will rewrite the new data on the existing dataset). Create DeltaLake dataset

Upon successful registration of Dataset metadata, you can specify Delta Lake related information with following attributes:

  • Partition Column Name: Partition column name which should be of any column name from schema.

Loading Data into Delta Lake Datasets

  • Data upload follows Amorphic's Data Reloads process.
  • Files enter a Pending State and users have to select and process pending files.
  • Pending files can be deleted before processing.
  • Restrictions:
    • Cannot tag or delete completed files.
    • No Truncate Dataset, Download File, Apply ML or View AI/ML Results options.

Query Delta Lake Datasets

Once the data is loaded into Delta Lake datasets, it is available for the user to query and analyze directly from the Amorphic Playground feature with some Athena limitations as mentioned in the Limitations section.

Access Delta Lake Datasets from ETL Jobs and Data Labs

User can add Delta Lake datasets to their ETL Jobs and Data Labs same as any other datasets by editing an existing ETL job/Datalab or while creating a new ETL job/Datalab in Amorphic UI. When user adds a Delta Lake dataset to read/write access in ETL job/Datalab then the read/write access will be provided on Delta Lake table.

In order to access Delta Lake datasets via Data Labs, user needs to run these magic commands in glue session enabled Notebook Data Lab.

%%configure
{
"--enable-glue-datacatalog": "true",
"--conf" : "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore",
"--datalake-formats" : "delta"
}

Limitations (Both AWS and Amorphic)

  • Supported data types
  • Applicable to ONLY 'Lakeformation' Target Location
  • Restricted/Non-Applicable Amorphic features for Delta Lake datasets
    • DataValidation
    • Skip LZ (Validation) Process
    • Malware Detection
    • Data Profiling
    • Data Cleanup
    • Data Metrics collection
    • Life Cycle Policy
  • Currently, No Schema/ Partition evolution is supported by Amorphic for Delta Lake Datasets.
  • No time travel support through Playground
  • Write DML statements like UPDATE, INSERT, or DELETE are not supported through Playground

For more limitation, Check Athena Delta Lake Documentation