Parquet file processing fails in Redshift data loads when column datatypes are modified
Parquet file uploads fail when you change column data types in Redshift datasets.
Affected Versions: 2.7
3.0
3.0.3
3.0.6
Fix Version: 3.1
Root cause(s)
When you change a column's data type in a Redshift dataset (for example, changing from INTEGER to VARCHAR), Redshift internally restructures the table by creating a new column, copying data, and removing the old column. This changes the order of columns in your dataset.
Parquet files rely on this column order to know where to put data, but after the data type change, the file's column order no longer matches your dataset's new structure. This causes the upload to fail or put data in wrong places.
Other file formats like CSV, Excel, and TSV use column names instead of position, so they work correctly even after data type changes.
Impact
- Parquet file uploads fail with data type conversion errors after column datatype modifications
- Data may be loaded into incorrect columns without clear error messages
- Users are unable to upload parquet files to datasets where column datatypes have been modified
- Inconsistent behavior between parquet files and other supported file formats
Mitigation
Fix is available in Amorphic version 3.1. Please upgrade to the latest version to resolve this issue.
Timeline
- 2024-07-24: Bug reported/identified (CLOUD-5965)
- 2024-07-24: Bug triaged and documented
- 2024-07-28: Root cause analysis, fix development and testing completed
- 2024-07-29: Solution merged and Version 3.1 released with the fix