Files
The Files section provides a centralized interface for managing files that contain actual data for datasets. It allows users to view, track, and manage files associated with their datasets, along with their respective statuses. This section helps ensure data integrity and transparency by offering insights into file uploads and processing.
When files are uploaded, their data is automatically copied into the tables at the designated target location, such as S3-Athena, Redshift, Lake Formation, or DynamoDB. This seamless integration ensures that datasets remain up to date and accessible for further analysis.
Users can perform several operations on files, including:
Property | Details |
---|---|
Upload | Users can upload files to datasets based on the selected Update Method. If users want the data in the files to be validated against the dataset schema, enable Data Validation in the advanced configuration settings. |
Truncate | If the target location is S3, S3-Athena, or Lake Formation, users can permanently delete all files within the dataset. |
Delete | When a file is deleted from a dataset, it moves to the Deleted Files section, where it remains for four weeks before being permanently removed. |
Restore | Users can restore deleted files if the target location is S3, S3-Athena, or Lake Formation. |
Restore Files from Archive | Files stored in archival storage classes such as Glacier and Deep Archive can be restored either temporarily or permanently. |
Upload
Users can add files based on the Update Method selected while creating a dataset. For example, datasets with target locations Redshift have three available update methods. For datasets with target locations S3 and S3-Athena, there are two update methods.
Reload
- Newly uploaded files replace existing files in the dataset.
- Users can choose to process or discard the uploaded files.
- Supported for:
Redshift
,S3
,S3-Athena
,Lake Formation
,Iceberg (Lake Formation & S3-Athena)
,DynamoDB
.
Append
- Newly added files are appended to the existing dataset without replacing older files.
- Supported for:
Redshift
,S3
,S3-Athena
,Normal Lake Formation
,Lake Formation (HUDI, Delta Lake)
,Iceberg (Lake Formation & S3-Athena)
,DynamoDB
.
LatestRecord
- Similar to Append, but ensures that only the latest version of each record is retained.
- Supported for:
Redshift
.
Overwrite
- Replaces the existing dataset with newly uploaded files, ensuring a fresh dataset each time.
- Supported for:
Lake Formation (HUDI, Delta Lake)
.
Change Data Capture (CDC)
- Tracks incremental changes in the dataset without modifying existing structures.
- Supported for:
Iceberg (Lake Formation & S3-Athena)
.
- Files uploaded will be processed immediately, even if data throttling is enabled. However, files uploaded through any jobs, ingestion, etc., still undergo the normal data load throttling procedures.
- Users can also custom partition the data based on specific attributes, enabling better organization, faster query performance, and optimized data retrieval. To learn more, refer to Dataset Custom Partitioning.
Skip LZ (Validation) Process
Enabling the Skip LZ (Validation) Process option (set to True/Yes) allows direct file uploads to the DLZ bucket, bypassing the entire validation process. This streamlines the upload by skipping unnecessary S3 copies and validations, enhancing efficiency.
When activated, this setting also automatically disables Malware Detection and IsDataValidationEnabled for datasets targeting S3-Athena and Lake Formation.
Note: This feature is only available for datasets using the Append and Update methods.
File Details
- Clicking on the file name in the interface opens a detailed view panel.
- This panel shows file details such as a preview and other pertinent information.
- Users can apply machine learning models to the data and view analytics results directly.
- If the AI service is active for datasets targeting S3, detailed analytics results are accessible.
- There is an option for users to download files for offline analysis.
Preview supported file types:
- Image Formats: bmp, gif, ico, jpeg, jpg, png, apng, avif, webp, svg
- Document Formats: csv, tsv, txt, json, docx, pdf
- Audio Formats: m4a, mp3, oga, ogg, wav
- Video Formats: mp4, webm, ogv, mov
Truncate Dataset
When the Truncate option is selected, all files in the dataset are permanently deleted. This operation cannot be undone, so use it with caution.
Availability:
- This functionality is only available for datasets with target locations in S3, S3-Athena, and Lake Formation.
Below is an illustration of the Truncate Dataset functionality:
Delete
Users can delete files from a dataset, and if needed, they can be restored using the Restore function. This allows users to recover deleted files within the retention period before they are permanently removed.
Availability:
- This functionality is only available for datasets with target locations in S3, S3-Athena, and Lake Formation.
Below is an illustration of the Delete functionality:
All files marked as deleted will be permanently removed after four weeks, with eventual consistency. This action will delete the file data and its metadata, and this action cannot be reversed. Users can view the time remaining for the file in the Deleted Files section to take any necessary action if required.
Restore
The Restore function allows users to recover deleted files from a dataset before they are permanently removed. This ensures that mistakenly deleted files can be retrieved within the retention period.
Availability:
- This functionality is only available for datasets with target locations in S3, S3-Athena, and Lake Formation.
Below is an illustration of the Restore Files functionality:
Restore Files from Archive
Users can restore files that are stored in archival storage classes such as Glacier and Deep Archive. This allows access to archived data when needed.
Availability:
- This functionality is applicable to datasets with target locations in S3, S3-Athena, and Lake Formation.
Required Attributes for Restoring Archived Files:
Property | Description |
---|---|
File Copy Type | Specifies whether the restored file should be a temporary or permanent copy. |
Temporary | A temporary copy will be available for a specified number of days (Restore Expiration Days) before expiring. |
Permanent | The restored file will be permanently copied to the Standard storage class and will always remain accessible. |
Restore Expiration Days | Defines the duration (in days) for which the temporary copy will be available for use or download. This is only applicable when the File Copy Type is set to Temporary. |
Retrieval Option | Determines the retrieval method for restoring an archived file. The time and cost of restoration depend on the selected option. For more details, refer to the Object Retrieval Options.
|
Temporarily restored objects cannot be queried. To learn more, refer AWS Athena limitations.
Also, a user cannot query or create views on transition related datasets.
Permanently Delete Files
Users can permanently delete files from a dataset, ensuring they are irreversibly removed and cannot be restored.
Availability:
- This functionality is applicable to datasets with target locations in S3, S3-Athena, and Lake Formation.
Below is an illustration of the Permanent File Deletion functionality:
Files Use Case
A company stores customer information in various files and manages them efficiently using different dataset operations.
How the Company Manages Files:
- Uploading New Files: The company uploads new files containing updated customer information. The Skip LZ (Validation) Process is set to
'No'
, ensuring that the data is validated before being added to the tables. - Appending New Data: Newly added files are appended to the existing dataset, ensuring that the latest customer records are always available.
- Reloading Old Files: The company reloads files, replacing older ones to maintain an up-to-date dataset.
- Truncating the Dataset: If necessary, the company can truncate the dataset to permanently delete all files.
- Restoring Deleted Files:
- If a file is accidentally deleted, it can be restored from the Deleted Files section.
- If the file was archived, it can be restored from Archive (e.g., Glacier, Deep Archive).
By using these functionalities, the company ensures its customer dataset is accurate, up-to-date, and well-managed.