Version: v3.1 print this page

S3 Datasource

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

S3 Datasources are used to migrate data from a remote S3 bucket to Amorphic. There are two connectivity options available for S3 Datasources: Bucket Policy and Access Keys.

Bucket Policy

To create an S3 Datasource using a Bucket Policy, user have to first select the Bucket from which the data needs to be migrated and specify the Bucket Region.

After the datasource is established, the user will have the option to download a Bucket policy and a KMS key policy. The generated bucket policy should be attached to the source bucket that was added to the datasource during its creation. If the source bucket is associated with a KMS key, then the KMS key policy generated by Amorphic should also be attached to the policy of that KMS key.

Access Keys

To create an S3 Datasource using Access Keys, user have to select the bucket from which the data has to be migrated and provide Access Key and Secret Access Key of the user who has permission to read the data from the bucket.

How to create a S3 Datasource?

S3 Datasources

Attribute	Description
Datasource Name	Give the datasource a unique name.
Datasource Type	Type of datasource, in this case, it is S3.
Description	User can describe the datasource purpose and important information about the datasource.
Authorized Users	Amorphic users who can access this datasource.
S3 Bucket	Name of the bucket from which the dataset files have to be imported.
Datasource Access Type	There are two access types for this datasource Access Keys and Bucket Policy.
Version	Enables user to select which version of ingestion scripts to use (Amorphic specific). For any new feature/Glue version that gets added to the underlying ingestion script, new version will be added to the Amorphic.
S3 Bucket Region	Region where source S3 bucket is created. If the source bucket is in one of the regions (eu-south-1, af-south-1, me-south-1,ap-east-1) then this property needs to be provided and the region needs to be enabled in Amorphic else ingestion fails.

Note

For Redshift use cases involving a substantial volume of incoming files, it is advisable for the user to enable data load throttling and configure a maximum limit of 90 for Redshift.

Ingestion Timeout Configuration (API Only)

You can configure the ingestion process timeout during datasource creation by adding the IngestionTimeout key to the DatasourceConfig payload.

Value range: 1 to 2880 minutes
Default value: 480 minutes (8 hours) if not specified

POST - /datasources

Sample request payload to create S3 datasource
{
  "DatasourceName": "S3-Datasource",
  "DatasourceType": "s3",
  "Description": "Test description",
  "Keywords": [
    "Owner: Mark Liu"
  ],
  "DatasourceConfig": {
    "S3Bucket": "example-test-bucket",
    "S3BucketRegion": "us-east-1",
    "S3ConnectionType": "bucket_policy",
    "IngestionTimeout": 720
  }
}

info

The IngestionTimeout value set in the datasource can be overridden during schedule creation or schedule run by providing a MaxTimeOut argument.

Test Datasource

This functionality allows users to quickly verify the connectivity to the specified s3 bucket. By initiating this test, users can confirm if the s3 bucket details provided are accurate and functional, ensuring seamless access to the s3 bucket. Test S3 Datasource

Data migration to Amorphic

To migrate data to Amorphic, users must follow these steps:

Create a new Dataset and select the Datasource Type as S3. Choose the S3 datasource from the drop-down list
Provide the FileType and DirectoryPath if required, ensuring that only the data from the specified path and file type is extracted.
If necessary, create Partitions for the dataset(for S3Athena and LakeFormation only), this will ingest files into that specified partition and helps in better reading and the query performance.

After creating the dataset, the next step for the user is to set up a Schedule for data ingestion.

Create a schedule with the Type specified as DataIngestion and select the previously created dataset from the list.
If the user created any Partitions for the dataset, specify values for each partition.

This schedule will be responsible for ingesting data from the source S3 bucket into the target dataset that was created earlier. Once the user has set up the schedule for data ingestion and runs it successfully user can check the Amorphic dataset to verify that the files from the source S3 bucket have been successfully ingested and are now present in the dataset.

In the details page, Estimated Cost of the Datasource is also displayed to show the approximate cost incurred since creation.

Upgrade S3 Datasource

User can upgrade a datasource if a new version is available. Upgrading a datasource upgrades the underlying Glue version and the data ingestion script with new features.

Downgrade S3 Datasource

User can downgrade a datasource to a previous version if the upgrade is not meeting the requirements. A datasource can only be downgraded if it has been upgraded. The option to downgrade is available on the top right corner if the datasource is downgrade compatible.

Datasource Versions

1.6

In this version of s3 datasource, the data ingestion happens by considering and comparing ETag of the files in source and target.

The first step involves checking the filename and confirming if the file exists and its size remains unchanged. Subsequently, we retrieve the ETags of both the source file and the current file in the dataset. If these ETags match, we proceed with the ingestion process. This approach is adopted because an ETag of a file remains consistent even if the filename changes. Therefore, if a user intends to duplicate files by altering their names, this method ensures that ingestion is not affected solely by the ETag.

In this version, only files stored in S3 Standard class are supported for S3 data ingestion. If there exist files from other storage classes, the ingestion process will fail.

1.7

In this version of s3 datasource, the storage classes of files do not affect the flow of ingestion.

That means, the ingestion process will not be terminated even if there exist files from S3 Glacier type of storage classes. We just skip those files from ingestion, then show the details of skipped files and successfully complete the ingestion of all other files without any failure.

Note

Files that are stored in S3 Glacier and S3 Glacier deep archive classes will be skipped during ingestion.

1.8

In this version of s3 datasource, we added support of skip LZ feature.

This feature enables users to directly upload data to the data lake zone by skipping the data validation. Please refer Skip LZ related docs for more details.

1.9

In this version of s3 datasource, we have incorporated multithreading to achieve faster data ingestion. This can be configured by providing an argument 'FileConcurrency' during schedule execution for the ingestion. This argument accepts values ranging from 1 to 100, allowing users to fine-tune their concurrency preferences. If the 'FileConcurrency' argument is not provided, it will default to a value of 20.

2.0

The update in this version is specifically to ensure FIPS compliance, with no changes made to the script.

2.1

In this S3 datasource update, we've introduced Dynamic Partitioning support.

For Partitioned datasets users can now input wildcard (*) patterns during schedule creation to dynamically generate partitions upon S3 ingestion. Each partition captures values from source S3 objects based on user-defined patterns

Note

This feature is exclusive to Append-type datasets. For other type datasets, there's no change; ingestion creates a single partition with the specified value upon schedule creation, following the usual flow.

2.2

In this S3 datasource update, we've introduced additional configuration options for Dynamic Partitioning. These enhancements provide users with more control over which partitions are ingested, including the ability to rename partitions.

Configuration Explanation:

Include: Specifies which partitions should be included for ingestion.
Exclude: Defines which partitions should be excluded from ingestion into Amorphic.
Rename: Allows users to rename partitions based on specified criteria.

2.3

This version does not introduce any new features but focuses on optimizing performance and enhancing error handling capabilities.

2.4

This version also focuses on optimizing performance and enhancing error handling capabilities.

2.5

This version brings full support for Amorphic 3.0 along with an upgrade to the latest Python Glue job version.

3.1

This version is optimized to handle large numbers of files without out-of-memory issues by using a Bloom filter and batch processing.

Bucket Policy​

Access Keys​

How to create a S3 Datasource?​

Ingestion Timeout Configuration (API Only)​

Test Datasource​

Data migration to Amorphic​

Upgrade S3 Datasource​

Downgrade S3 Datasource​

Datasource Versions​

1.6​

1.7​

1.8​

1.9​

2.0​

2.1​

2.2​

2.3​

2.4​

2.5​

3.1​

Bucket Policy

Access Keys

How to create a S3 Datasource?

Ingestion Timeout Configuration (API Only)

Test Datasource

Data migration to Amorphic

Upgrade S3 Datasource

Downgrade S3 Datasource

Datasource Versions

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

3.1