S3 Datasource
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.
S3 Datasources are used to migrate data from a remote S3 bucket to Amorphic. There are two types of S3 Datasources available: Bucket Policy
and Access Keys
.
Bucket Policy
To create an S3 Datasource using a Bucket Policy, user have to first select the Bucket from which the data needs to be migrated and specify the Bucket Region.
After the datasource is established, the user will have the option to download a Bucket policy and a KMS key policy. The generated bucket policy should be attached to the source bucket that was added to the datasource during its creation. If the source bucket is associated with a KMS key, then the KMS key policy generated by Amorphic should also be attached to the policy of that KMS key.
Access Keys
To create an S3 Datasource using Access Keys, user have to select the bucket from which the data has to be migrated and provide Access Key and Secret Access Key of the user who has permission to read the data from the bucket.
How to create a S3 Datasource?
Attribute | Description |
---|---|
Datasource Name | Give the datasource a unique name. |
Datasource Type | Type of datasource, in this case, it is S3. |
Description | User can describe the datasource purpose and important information about the datasource. |
Authorized Users | Amorphic users who can access this datasource. |
S3 Bucket | Name of the bucket from which the dataset files have to be imported. |
Datasource Access Type | There are two access types for this datasource Access Keys and Bucket Policy. |
Version | Enables user to select which version of ingestion scripts to use (Amorphic specific). For any new feature/Glue version that gets added to the underlying ingestion script, new version will be added to the Amorphic. |
S3 Bucket Region | Region where source S3 bucket is created. If the source bucket is in one of the regions (eu-south-1, af-south-1, me-south-1,ap-east-1) then this property needs to be provided and the region needs to be enabled in Amorphic else ingestion fails. |
For Redshift use cases involving a substantial volume of incoming files, it is advisable for the user to enable data load throttling and configure a maximum limit of 90 for Redshift.
Additionally, the timeout for the ingestion process can be set during datasource creation by adding a key IngestionTimeout to DatasourceDetails in the input payload. The value should be between 1 and 2880 and is expected in minutes. If the value is not provided the default value of 480(8hours) would be used. Please note that this feature is available exclusively via API.
{
"DatasourceDetails": {
"S3Bucket": "example-test-bucket",
"S3ConnectionType": "bucket_policy",
"S3BucketRegion": "us-east-1"
"IngestionTimeout": 222
},
}
This timeout can be overridden during schedule creation and schedule run by providing an argument MaxTimeOut.
Test Datasource
This functionality allows users to quickly verify the connectivity to the specified s3 bucket. By initiating this test, users can confirm if the s3 bucket details provided are accurate and functional, ensuring seamless access to the s3 bucket.
Data migration to Amorphic
To migrate data to Amorphic, users must follow these steps:
- Create a new Dataset and select the Datasource Type as S3. Choose the S3 datasource from the drop-down list
- Provide the FileType and DirectoryPath if required, ensuring that only the data from the specified path and file type is extracted.
- If necessary, create Partitions for the dataset(for S3Athena and LakeFormation only), this will ingest files into that specified partition and helps in better reading and the query performance.
After creating the dataset, the next step for the user is to set up a Schedule for data ingestion.
- Create a schedule with the Type specified as DataIngestion and select the previously created dataset from the list.
- If the user created any Partitions for the dataset, specify values for each partition.
This schedule will be responsible for ingesting data from the source S3 bucket into the target dataset that was created earlier. Once the user has set up the schedule for data ingestion and runs it successfully user can check the Amorphic dataset to verify that the files from the source S3 bucket have been successfully ingested and are now present in the dataset.
In the details page, Estimated Cost of the Datasource is also displayed to show the approximate cost incurred since creation.
Upgrade S3 Datasource
User can upgrade a datasource if a new version is available. Upgrading a datasource upgrades the underlying Glue version and the data ingestion script with new features.
Downgrade S3 Datasource
User can downgrade a datasource to a previous version if the upgrade is not meeting the requirements. A datasource can only be downgraded if it has been upgraded. The option to downgrade is available on the top right corner if the datasource is downgrade compatible.
Datasource Versions
1.6
In this version of s3 datsource, the data ingestion happens by considering and comparing ETag of the files in source and target.
The first step involves checking the filename and confirming if the file exists and its size remains unchanged. Subsequently, we retrieve the ETags of both the source file and the current file in the dataset. If these ETags match, we proceed with the ingestion process. This approach is adopted because an ETag of a file remains consistent even if the filename changes. Therefore, if a user intends to duplicate files by altering their names, this method ensures that ingestion is not affected solely by the ETag.
In this version, only files stored in S3 Standard class are supported for S3 data ingestion. If there exist files from other storage classes, the ingestion process will fail.
1.7
In this version of s3 datasource, the storage classes of files do not affect the flow of ingestion.
That means, the ingestion process will not be terminated even if there exist files from S3 Glacier type of storage classes. We just skip those files from ingestion, then show the details of skipped files and successfully complete the ingestion of all other files without any failure.
Files that are stored in S3 Glacier and S3 Glacier deep archive classes will be skipped during ingestion.
1.8
In this version of s3 datasource, we added support of skip LZ feature.
This feature enables users to directly upload data to the data lake zone by skipping the data validation. Please refer Skip LZ related docs for more details.
1.9
In this version of s3 datasource, we have incorporated multithreading to achieve faster data ingestion. This can be configured by providing an argument 'FileConcurrency' during schedule execution for the ingestion. This argument accepts values ranging from 1 to 100, allowing users to fine-tune their concurrency preferences. If the 'FileConcurrency' argument is not provided, it will default to a value of 20.
2.0
The update in this version is specifically to ensure FIPS compliance, with no changes made to the script.
2.1
In this S3 datasource update, we've introduced Dynamic Partitioning support.
For Partitioned datasets users can now input wildcard (*) patterns during schedule creation to dynamically generate partitions upon S3 ingestion. Each partition captures values from source S3 objects based on user-defined patterns
This feature is exclusive to Append-type datasets. For other type datasets, there's no change; ingestion creates a single partition with the specified value upon schedule creation, following the usual flow.
2.2
In this S3 datasource update, we've introduced additional configuration options for Dynamic Partitioning. These enhancements provide users with more control over which partitions are ingested, including the ability to rename partitions.
Configuration Explanation:
- Include: Specifies which partitions should be included for ingestion.
- Exclude: Defines which partitions should be excluded from ingestion into Amorphic.
- Rename: Allows users to rename partitions based on specified criteria.
2.3
This version does not introduce any new features but focuses on optimizing performance and enhancing error handling capabilities.
2.4
This version also focuses on optimizing performance and enhancing error handling capabilities.
2.5
This version brings full support for Amorphic 3.0 along with an upgrade to the latest Python Glue job version.