Datasets
S3 Athena Datasets
1. Empty Data for Numeric or Non-String Columns
Error message: Data parsing failed, empty field data found for non-string column
Issue description: When the user tries to load empty/null values for non-string columns, the load process fails throwing a data validation failed error message.
Explanation: This issue arises because the file parser does not currently support null or empty values for non-string fields. As per the documentation, one workaround is to import the data as string columns and create dataset views on top of it, casting them to the required data types.
2. File parsing:
Error message: Data validation failed with message, new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Issue description: This is one of the data validation errors which occurs while loading the data into Dataset.
Explanation: This error message is thrown by the file parser which currently does not support the usage of embedded line breaks in the csv/tsv/xlsx file. Please follow the documentation. Possible solution of this is to perform a regex replace on in-appropriate new line or carriage return characters in the file.
3. Field validations:
Error message: N/A
Issue description: Validations not available for all data types
Explanation: Currently data type validations are only limited primitive types Strings/Varchar
, Integers
, Double
, Boolean
, Date
and Timestamp
. Support of complex structures is yet to be added. Moreover for data types like Date and Timestamp, value formats are not strictly validated as they are multi formatted.
4. Data Profiling:
Error message: Data Profiling Failing with Failed to create any executor tasks
Error
Issue description: This error occurs when there are insufficient IP addresses available in the subnet for the Data Profiling Glue job execution.
Explanation: The job fails to start if the number of required IP addresses exceeds those available in the subnet. Re-triggering the Data Profiling Job for that dataset after some time can resolve this issue, as IP addresses used by other job executions will become available again shortly after completion.
Dataset Views
-
Issue Description: Unable to create a dataset view using another dataset view as its source when dataset type is set to 'view'.
Affected Versions: 2.6, 2.6.1
Explanation: When attempting to create a dataset view with an existing dataset view as its source, the operation results in a 'GE-1008 - Could not complete the request. Please try again.' error.
Workaround: Initially create the dataset view with a SQL statement that references an invalid or non-existent table or dataset view. This will result in the dataset view being created but placed in a failed or invalid state due to the missing source. Once the dataset view metadata is in place, you can subsequently update it by modifying the SQL statement to reference the correct source dataset view . This approach allows the dataset view to be successfully updated and function as expected. In the case of materialized dataset views, the process is similar, but you will have to use the API to update the dataset view as update of materialized dataset view is not supported from the UI.
-
Issue Description: Users losing access to dataset views after making a dataset views listing call.
Affected Versions: 2.6, 2.6.1
Explanation: This issue arises during a metadata consistency check during the dataset view listing call, where user-accessible dataset views are retrieved from the groups-dataset-views table and validated against the dataset views table. If a dataset view is present in the groups-dataset-views table but not found in the dataset views table, it is flagged and removed. Due to DynamoDB limitations, such as the 16 MB data return limit for a single BatchGetItem request and the 1 MB per-partition read limit, retrieval may be truncated, resulting in an incomplete item set. This can lead to dataset views being incorrectly identified as missing and subsequently removed from the groups-dataset-views table, causing users to lose access to them. This situation is more likely if users have a large number of dataset views with very long SQL statements (around 400 KB each).
Workaround: Contact Amorphic support to get a fix for this issue.
-
Issue Description: Dataset View listing call fails when trying to retrieve a large number of dataset views on one page.
Affected Versions: 2.6, 2.6.1
Explanation: The dataset view listing call fails when attempting to retrieve more items than the default limit in a single call. For example, with a default page limit of 12 items, users attempting to fetch 24, 50, or 100 items per page, especially with dataset views that contain large SQL statements, may experience a failure. This occurs because the system cannot handle payloads that exceed 6 MB.
Workaround: Reset the application and log in again. Then, try listing fewer items per page to avoid exceeding the payload size limit.
-
Issue Description: Unable to update dataset views when the dataset view has a large SQL statement (greater than 256 KB).
Affected Versions: 2.6, 2.6.1, 2.7
Explanation: Updating a dataset view with a large SQL statement fails with the error message GE-1008 - Could not complete the request. Please try again. This happens because the backend processes the update via an asynchronous Lambda invocation, which supports a maximum payload size of 256 KB. If the SQL statement exceeds this limit, the update fails.
Workaround: Create a new dataset view with updated sql statement
Redshift Datasets
1. Parquet File Processing Fails After Column Data Type Changes
Error message: Various errors including data type conversion failures or data loading into incorrect columns
Issue description: When uploading parquet files to a Redshift dataset after modifying column data types (such as changing from INTEGER to VARCHAR), the file processing may fail or data may be loaded incorrectly.
Affected Versions: 2.7, 3.0
Explanation: Parquet files store data in a column-based format that relies on the order of columns. When you modify a column's data type in Redshift, the internal structure of the table changes, which can cause misalignment between the columns in your parquet file and the columns in the dataset table. This is specific to parquet files - other file formats like CSV, TSV, and Excel files are not affected by this issue.
Workaround:
- Create a new dataset with the proper schema structure: Instead of modifying column data types in an existing Redshift dataset, create a new dataset from scratch with the correct column data types that match your parquet files. This ensures proper alignment between your file structure and the dataset schema.
- Contact Amorphic support for assistance: If you have complex data migration requirements or need help with transferring large amounts of historical data, reach out to Amorphic support for guidance on the best approach for your specific use case.
Dataset Schema Extraction
Error message: Schema is extracted successfully (displayed even when extraction fails)
Issue description: UI issues related to schema extraction from JSON files and schema definition workflows.
Affected Versions: 3.1
Explanation: There are two main issues with the dataset schema extraction process:
-
JSON File Upload with Sample File Option: When using the sample file upload option with JSON files, the extraction fails as expected since only newline-delimited JSON files are currently supported. However, the UI displays "Schema is extracted successfully" regardless of the actual API response, which is misleading.
-
Schema Definition JSON File Workflow: When creating a schema from a schema definition JSON file, the DatasetSchema is not being added to the payload unless the user first verifies and saves the schema. This causes issues when users directly click "Publish Schema" after a successful extraction, as the schema data is missing from the request. The same workflow works correctly for sample file uploads.
Workaround:
- For JSON files: Use newline-delimited JSON format or verify the actual extraction status despite the success message
- For schema definition workflow: Always verify and save the schema before clicking "Publish Schema" to ensure the DatasetSchema is properly included in the payload
Data Quality Checks Schedules
Error message: Schedule execution cannot be stopped in RUNNING state.
Issue description: When a schedule created for a data quality check is running, users cannot stop the schedule.
Explanation: When stopping an execution, the system performs validation checks based on the JobType. If the JobType is neither "data-pipelines" nor "glue", the system defaults to displaying the error message: "Schedule execution cannot be stopped in state." Additionally, the code was missing functionality to properly stop DQ (Data Quality) executions for V2 DQ checks.
Affected Versions: 2.5, 2.6, 2.7, 3.0
Workaround: Users can stop the data quality check execution by navigating to the Data Quality Check Execution page and stopping it from there.
Dataset Metadata Updates
Error message: System error occurs when trying to update dataset
Issue Description: Users cannot update a dataset if its associated datasource has been deleted from the system.
Explanation: When users delete a datasource that is still being used by one or more datasets, the system allows the deletion to proceed but doesn't remove the connection between the dataset and the deleted datasource. This creates a broken link that causes problems later.
What happens:
- Users delete a datasource that is connected to datasets
- The datasource is removed from the system, but datasets still reference it
- When users try to update any of these datasets later, the operation fails because the system cannot find the datasource information it needs
Affected Versions: 3.0, 3.1
Workaround:
- Prevention: Before deleting a datasource, make sure no datasets are using it. Check datasets to see which ones reference the datasource users want to delete
- If already affected: Contact Amorphic support to help resolve datasets that have broken datasource connections
- Best practice: Only delete datasources that are no longer needed by any datasets