Files get re-ingested during s3 ingestion to append type datasets
It is observed that during consecutive runs of S3 ingestion for datasets of append type, existing files within the dataset are erroneously re-ingested.
Affected Versions: 2.3
2.2
2.1
2.0
Fix Version: 2.4
Root cause(s)
The system used Etags of S3 files to determine file existence. However, due to the Etags not being the MD5 hash for larger files, different Etags were generated, causing failed comparisons and resulting in the ingestion of duplicate files.
Impact
This issue results in a failure to accurately identify previously ingested files, leading to their inadvertent re-ingestion. This recurrence may cause duplication of files, impacting data integrity and overall system efficiency.
Mitigation
A fix is available in Amorphic v2.4. Please upgrade to the latest version to resolve this issue.
Timeline
gantt
title Timeline
dateFormat YYYY-MM-DD
tickInterval 5day
axisFormat %b-%d
todayMarker off
section Tracker
%% update the ticket number and date of bug report
CLOUD-3937: done, 2023-09-11, 0d
section Identification
Reported : crit, 2023-09-11, 1d
section Mitigation
%% Update number of days took for each step below
Bug fixed: milestone, 2023-10-05, 1d
section Delivery
%% update the date of each step below
testing complete: milestone, 2023-10-06, 1d
- 2023-09-11: Bug reported/identified (CLOUD-3937)
- 2023-09-11: Bug triaged
- 2023-10-05: Bug fixed
- 2023-10-06: Testing completed and fix is available