Skip to main content
Version: cicd.4.0 print this page

Datasets

Amorphic supports unstructured, semi-structured, and structured Datasets while also providing comprehensive data lake visibility.

Below is the sample resource definition file for Dataset:

{
"rCicdDataset": {
"Type": "Dataset",
"Properties": {
"DatasetName": "cicd_dataset",
"DatasetDescription": "Dataset created from CICD",
"Domain": {
"!DependsOn": "rDomain.DomainName"
},
"Keywords": ["Owner: johndoe"],
"DatasourceType": "api",
"IsDataValidationEnabled": true,
"SerDe": "OpenCSVSerde",
"FileDelimiter": ",",
"FileType": "csv",
"IsDataCleanupEnabled": false,
"IsDataProfilingEnabled": true,
"LifeCyclePolicyStatus": "Disabled",
"TargetLocation": "s3athena",
"SkipFileHeader": true,
"SkipRowCount": { "header": 1, "footer": 0 },
"SkipLZProcess": false,
"TableUpdate": "append",
"DataMetricsCollectionOptions": { "IsMetricsCollectionEnabled": false },
"DatasetType": "internal",
"DatasetSchema": [
{
"name": "FirstName",
"description": "",
"type": "varchar(256)",
"is_not_null": false
},
{
"name": "LastName",
"description": "",
"type": "varchar(256)",
"is_not_null": false
}
]
}
}
}
Note: Deployment dependencies

Dataset has dependencies on Domain and Tag.

Dependent resources should not be deleted before the primary resource; attempting to do so may lead to failures or inconsistencies during the deletion process.

Referencing this Resource

Below are the common keys that can be used in the DependsOn function to retrieve details of this resource.

Supported Keys

KeyDescription
DatasetNameReturns the DatasetName value of this resource
DatasetIdReturns the DatasetId value of this resource

For additional supported keys, refer to the API definition document for the respective resource type.

Example

The following example shows how to retrieve the dataset name from a dataset template and use it in a job template:

{
"rPythonJob": {
"Type": "Job",
"Properties": {
"DatasetAccess": {
"Owner": [
{
"DatasetName": {
"!DependsOn": "rCicdDataset.DatasetName"
},
"DatasetId": {
"!DependsOn": "rCicdDataset.DatasetId"
},
}
]
}
}
}
}

When creating a dataset through CICD, you must provide the schema under the DatasetSchema key.

info

CICD has limitations when managing dataset schemas.

Once a dataset has been created, its schema cannot be modified via CICD.

Even if you update the schema in your configuration and re-run CICD, the pipeline logs may display an update as successful, but the schema in Amorphic will not actually be updated.

If schema changes are required after creation, they must be performed directly in the Amorphic console.