Datasets
Amorphic supports unstructured, semi-structured, and structured Datasets while also providing comprehensive data lake visibility.
Below is the sample resource definition file for Dataset:
{
"rCicdDataset": {
"Type": "Dataset",
"Properties": {
"DatasetName": "cicd_dataset",
"DatasetDescription": "Dataset created from CICD",
"Domain": {
"!DependsOn": "rDomain.DomainName"
},
"Keywords": ["Owner: johndoe"],
"DatasourceType": "api",
"IsDataValidationEnabled": true,
"SerDe": "OpenCSVSerde",
"FileDelimiter": ",",
"FileType": "csv",
"IsDataCleanupEnabled": false,
"IsDataProfilingEnabled": true,
"LifeCyclePolicyStatus": "Disabled",
"TargetLocation": "s3athena",
"SkipFileHeader": true,
"SkipRowCount": { "header": 1, "footer": 0 },
"SkipLZProcess": false,
"TableUpdate": "append",
"DataMetricsCollectionOptions": { "IsMetricsCollectionEnabled": false },
"DatasetType": "internal",
"DatasetSchema": [
{
"name": "FirstName",
"description": "",
"type": "varchar(256)",
"is_not_null": false
},
{
"name": "LastName",
"description": "",
"type": "varchar(256)",
"is_not_null": false
}
]
}
}
}
Dataset has dependencies on Domain and Tag.
Dependent resources should not be deleted before the primary resource; attempting to do so may lead to failures or inconsistencies during the deletion process.
Referencing this Resource
Below are the common keys that can be used in the DependsOn function to retrieve details of this resource.
Supported Keys
| Key | Description |
|---|---|
| DatasetName | Returns the DatasetName value of this resource |
| DatasetId | Returns the DatasetId value of this resource |
For additional supported keys, refer to the API definition document for the respective resource type.
Example
The following example shows how to retrieve the dataset name from a dataset template and use it in a job template:
{
"rPythonJob": {
"Type": "Job",
"Properties": {
"DatasetAccess": {
"Owner": [
{
"DatasetName": {
"!DependsOn": "rCicdDataset.DatasetName"
},
"DatasetId": {
"!DependsOn": "rCicdDataset.DatasetId"
},
}
]
}
}
}
}
When creating a dataset through CICD, you must provide the schema under the DatasetSchema key.
CICD has limitations when managing dataset schemas.
Once a dataset has been created, its schema cannot be modified via CICD.
Even if you update the schema in your configuration and re-run CICD, the pipeline logs may display an update as successful, but the schema in Amorphic will not actually be updated.
If schema changes are required after creation, they must be performed directly in the Amorphic console.