Datasets | Amorphic Docs

📄️ Intro

The Amorphic Dataset portal enables the creation of unstructured, semi-structured, and structured datasets while providing comprehensive data lake visibility.

📄️ Edit Schema

The Edit Schema functionality allows users to modify the schema of existing datasets, providing flexibility in managing dataset structures.

Data Profiling is the process of analyzing an existing data source to gather statistics and generate summaries about the data. It helps identify anomalies, assess data quality, and gain insights into the dataset’s structure and characteristics.

📄️ Files

The Files section provides a centralized interface for managing files that contain actual data for datasets. It allows users to view, track, and manage files associated with their datasets, along with their respective statuses.

📄️ Dataset Lifecycle Policy

The Dataset Lifecycle Policy is a feature that helps manage objects in Amazon S3 to optimize storage costs over time. It enables users to control object transition and expiration using predefined rules that dictate how Amazon S3 handles stored objects.

📄️ Athena Datasets

Athena Datasets allows users to store structured data in Glue tables and files in Amazon S3 and run SQL queries on the data. Data validation can be enabled to check for corrupt data, and the playground can be used to run queries on the dataset.

📄️ Delta Lake Datasets

In Amorphic, User can create Delta Lake datasets with Lakeformation target location which creates Delta Lake table in the backend to store the data.

📄️ DynamoDB Datasets

Amorphic DynamoDB Datasets refer to structured data stored as key-value pairs, acting as a single, reliable source of truth across all organizational departments.

📄️ External Datasets

External datasets in Amorphic allow users to directly consume their existing data stored in S3 buckets, without the need to ingest it into a new Amorphic dataset.

📄️ View Type Datasets

View Type Datasets in Amorphic are a specialized type of dataset that allows users to create structured representations of data. These view-type datasets can be shared with other authorized users and tags within the organization, providing a flexible way to interact with data.

📄️ Hudi Datasets

Introduction

📄️ Iceberg Datasets

In Amorphic, User can create Iceberg datasets with S3Athena and Lake Formation target location which creates Iceberg table in the backend to store the data.

📄️ Lakeformation Datasets

Lakeformation extends S3-Athena datasets with added security and supports CSV, TSV, XLSX, JSON, NDJSON, JSONLand Parquet files. It also checks data integrity and offers ACID transactions, data compaction, and time-travel queries.

📄️ Redshift Datasets

Redshift datasets allow users to create datasets in Amorphic utilizing the power of AWS Redshift. Amorphic enables users to store CSV, TSV, XLSX, JSON, NDJSON, JSONL and Parquet files in Amazon S3 with Redshift as the target location. This feature includes optional partial data validation, which is enabled by default. The validation process helps detect and correct corrupt or invalid data files while supporting a variety of data types, including:

📄️ Data Quality Checks

Amorphic provides data quality checks to help detect errors in data before it is utilized by other systems or machine learning algorithms. Users can create rules for columns in structured datasets and run checks to identify rule violations.

📄️ Data Retrieval API

The Amorphic Data retrieval API provides direct, query-based access to the row-level data stored within an Amorphic dataset. It allows you to retrieve the actual data contained within the source files, making it a powerful tool for previews, and integrations.

📄️ Spatial Datasets

Spatial datasets in Amorphic enable users to store, manage, and analyze geospatial data with built-in support for various spatial data formats and coordinate reference systems. These datasets provide specialized functionality for handling geographic information, including support for Well-Known Text (WKT), Well-Known Binary (WKB), GeoJSON, and coordinate-based data.

📄️ S3 Tables Datasets

In Amorphic, S3 Tables datasets use Lake Formation as the target location and AWS S3 Tables as the backend to store Iceberg tables. S3 Tables is an alternative catalog to the AWS Glue catalog for Iceberg: you choose Iceberg catalog as S3 Tables when creating the dataset. The platform registers S3 Tables with Lake Formation and exposes them in Athena via a federated Glue catalog named s3tablescatalog.