Skip to main content
Version: v3.0 print this page

External Datasets

External datasets in Amorphic allow users to directly consume their existing data stored in S3 buckets, without the need to ingest it into a new Amorphic dataset.

Creating External Datasets

Users can create external datasets by setting the Dataset Type attribute to 'external' and providing the source s3 location as the Dataset S3 Path. External datasets can be targeted to S3 and Lakeformation target locations including Hudi datasets.

create external dataset

On creation of the dataset, users will be provided with a Dataset Role Arn and a bucket policy. They need to attach the bucket policy to the source bucket as part of Dataset S3 path for amorphic to access the data in it.

external dataset details

If the source S3 bucket is encrypted with a custom KMS key, users will have to allow the dataset role to access that particular KMS key. A sample key policy for this is provided below.

{
"Version": "2012-10-17",
"Id": "amorphic-external-dataset-access",
"Statement": [
{
"Sid": "AllowAccessForExternalDataset",
"Effect": "Allow",
"Principal": {
"AWS": [
'arn:aws:iam::XXXXXXXXXXXX:role/<your-external-dataset-role-name>' #### Replace this with the dataset role arn
]
},
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "*"
}
]
}

Partitioned Datasets

If the data in the source S3 location is partitioned, users need to specify this while completing the dataset registration. On completing the dataset registration, users need to run the MSCK REPAIR TABLE query through the playground to load the partitions into glue. external dataset partitions

To view the partitions in glue, Users can run SHOW PARTITIONS query in 'Playground' against the specific external dataset. show partitions query

Querying Datasets

Querying the datasets through playground is the same as for internal datasets. The queries supported on top of external datasets are:

Query Athena Datasets

Dataset Access From ETL Jobs and Data Labs

For an External S3 dataset, in order to consume the data from ETL jobs or Data Labs, users should create STS credentials with the dataset role, create boto3 session and use them to retrieve the S3 data. The code snippet below shows an example of listing objects in the source S3 path from an ETL job:

import boto3

DATASET_ROLE_ARN = 'arn:aws:iam::XXXXXXXXXXXX:role/<your-external-dataset-role-name>' ### The External Dataset Role Arn
S3_BUCKET_NAME = '<your-s3-bucket-name>'
AWS_REGION = '<region-where-amorphic-is-deployed>'

# Getting the dataset role credentials
sts_client = boto3.client('sts', AWS_REGION)
sts_credentials = sts_client.assume_role(
RoleArn=DATASET_ROLE_ARN,
RoleSessionName='job-role',
DurationSeconds=900
)['Credentials']

session = boto3.Session(
aws_access_key_id=sts_credentials['AccessKeyId'],
aws_secret_access_key=sts_credentials['SecretAccessKey'],
aws_session_token=sts_credentials['SessionToken']
)

s3_client = session.client('s3')

resp = s3_client.list_objects(Bucket=S3_BUCKET_NAME)

  • For an External LF dataset, in order to consume the data from ETL jobs or Data Labs, user needs to add JOB_ROLE_ARN/DATALAB_ROLE_ARN to the Principal of the bucket policy attached to source bucket.
  • JOB/DATALAB_ROLE_ARN = arn:<aws_partition>:iam::<account_id>:role/<project_short_name>-custom-<datalab_id/job_id>-Role

Views

Views targeting Lakeformation can be created on top of external datasets. No other target locations are supported.

Limitations

  • The maximum resources(ETL jobs/Data Labs) that can be attached to a dataset are limited by the role trust policy length quota of the AWS account. A maximum of 20 resources can be attached with the default quota of 2048 characters and 44 resources can be attached with the maximum quota of 4096 characters.

  • If an external dataset is created with a S3 path then that specific S3 path cannot be re-used to create another external dataset

  • Dataset repair cannot be performed for external datasets

  • Data validation, malware detection, data profiling, data metrics collection, data cleanup and lifecycle policies are disabled for external datasets

  • There is a 1 hour limit on the session duration for which you can assume the dataset role from an ETL Job or Data Lab. (https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html)

Important
  • Write Access: Write access is not permitted for external datasets. This means file uploads, data ingestion, writing data through Jobs/DataLabs, and other file operations are disabled.
  • Read Access:
    • External datasets targeted to LF and S3 can be accessed through the Amorphic Playground.
    • To read external datasets targeted to S3/LF in ETL Jobs and Data Labs, users can add the datasets to the 'Dataset Read Access' dropdown and access the data.
    • Write permissions (e.g., PutObject/DeleteObject) cannot be granted to Jobs/Data Labs, as write access is restricted for external datasets.