Version: v3.1 print this page

Iceberg Datasets

In Amorphic, User can create Iceberg datasets with S3Athena and Lake Formation target location which creates Iceberg table in the backend to store the data.

Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.

info

Only Parquet file type is supported for Iceberg Datasets.

What is Apache Iceberg?

Apache Iceberg is an open table format for big data analysis. It can manage lots of files as tables and provides modern data lake operations like record-level inserts, updates, deletes, and time travel queries. Iceberg also makes it possible to update the table's schema and partitions, and it's optimized for usage on Amazon S3. Additionally it helps ensure data accuracy when multiple users write at the same time. To learn more, check out the Apache Iceberg Documentation.

What does Amorphic support?

Amorphic Iceberg datasets support following features:

ACID transactions
- ACID (atomic, consistent, isolated, and durable) transactions protect the integrity of Data Catalog operations such as creating or updating a table. They enable multiple users to concurrently add and delete objects in the Amazon S3 data lake, while also allowing for queries and ML models to return consistent and up-to-date results. Iceberg tables are involved in reads and writes, and use transactions to protect the manifest metadata. AWS services such as Amazon Athena support iceberg tables. To use transactions in AWS Glue ETL jobs, begin a transaction before performing reads/writes, and commit it upon completion. For more info, see Reading from and Writing to the Data Lake Within Transactions.
Set/Unset Table properties (User can specify attributes like Write compression, Data optimization configuration etc)
Schema evolution (Add, Drop, Rename, Update (changing data type) columns)
Hidden-Partitioning
Time travel queries to specified date and time
Version travel queries to specified snapshot ID (Table version)
Queries combining time and version travel
Iceberg table data can be managed directly on Athena using INSERT, UPDATE, and DELETE queries.
Optimizing Iceberg tables data by REWRITE DATA compaction action
Row-level deletes

For more information about Athena supported Iceberg features and limitations, Check Athena Iceberg Documentation

How to Create Iceberg Datasets?

Users can create Iceberg datasets like Athena datasets by selecting S3Athena or Lake Formation as target location, parquet as file type, and Yes in Iceberg Table dropdown. Dataset can be created by using either of the three ways:
- Using already defined Iceberg Datasets Templates
- Importing required JSON payload
- Using the form and entering the required details
Add Iceberg table properties in key-value pairs in Iceberg Table Properties section. Refer to Iceberg documentation for supported table properties.

Create Iceberg Dataset

Iceberg Table Options

Upon successful registration of Dataset metadata, users can specify partition related information through Custom Partition Options with following attributes:

Column Name: Partition column name which should be of any column name from schema.
Transformation: Iceberg (Hidden partitioning) converts column data using partition transform functions. Available functions: year, month, day, hour, bucket, truncate, None (no transformation).
Transformation Input: If Transformation is either bucket or truncate then additional input should be provided. Input value should be a positive number.

For more information, Check documentation for Iceberg Partitioning.

Iceberg Dataset Partitions

Loading Data into Iceberg Datasets

Data upload follows Amorphic's Data Reloads process.
Files enter a Pending State and users have to select and process pending files.
Pending files can be deleted before processing.
Restrictions:
- Cannot tag or delete completed files.
- No Truncate Dataset, Download File, Apply ML or View AI/ML Results options.

Query Iceberg Datasets

Once the data is loaded into Iceberg datasets, it is available for the user to query and analyze directly from the Amorphic Playground feature by selecting the workgroup as AmazonAthenaEngineV3.

Additional commands can be performed for Iceberg datasets for the following actions:

Iceberg table data can be managed directly on Athena using below commands
- INSERT INTO, UPDATE, DELETE FROM and MERGE INTO
- For more information, Check AWS Documentation
View Metadata
- DESCRIBE, SHOW TBLPROPERTIES
- SHOW COLUMNS
- For more information, Check AWS Documentation
Optimize data by REWRITE DATA compaction action
- OPTIMIZE
- For more information, Check AWS Documentation
Perform snapshot expiration and orphan file removal
- VACUUM
- For more information, Check AWS Documentation

IMPORTANT

It's better to AVOID the above commands if user does not have knowledge on them as it'll change/delete the data and its metadata based on the specified command.

Below image shows result of "DESCRIBE" table command on an Iceberg dataset:

Query Dataset

Limitations (Both AWS and Amorphic)

Supported data types
Applicable to ONLY S3Athena and Lake Formation TargetLocation and 'Parquet' file type.
Restricted/Not Applicable Amorphic datasets features for Iceberg datasets
- Data Validation
- Skip LZ (Validation) Process
- Malware Detection
- Data Profiling
- Data Cleanup
- Data Metrics collection
- Life Cycle Policy
No Partition evolution (Changing partitions after table creation).
Only predefined list of key-value pairs allowed in the table properties for creating or altering Iceberg tables. Check AWS Documentation
Schema evolution:
- Allowed ONLY a set of update column (data type promotions) actions:
  - Change an integer column to a big integer column
  - Change a float column to a double
  - Increase the precision of a decimal type column
- columns re-ordering is not supported.

What is Apache Iceberg?​

What does Amorphic support?​

How to Create Iceberg Datasets?​

Loading Data into Iceberg Datasets​

Query Iceberg Datasets​

Limitations (Both AWS and Amorphic)​