Data Profiling
Data Profiling is the process of analyzing an existing data source to gather statistics and generate summaries about the data. It helps identify anomalies, assess data quality, and gain insights into the dataset’s structure and characteristics.
Key insights provided by data profiling include the total number of rows and columns, minimum and maximum values for each column, data distribution to detect outliers, identification of missing or null values, and an overview of data types and their occurrences.
This functionality is available in the Profile section of the Dataset details page
The image above showcases a data profile for a sales dataset, derived from a sample of grocery data.
Enable Data profiling
Users can enable Data Profiling as needed, with the flexibility to enable or disable it at any time.
This feature is available exclusively for structured datasets, including those hosted on S3-Athena, Redshift, Lake Formation, and DynamoDB target datasets.
The following properties provide key insights into the dataset’s structure, quality, and metadata:
Property | Description |
---|---|
Files | The total number of files in the dataset. |
Dataset Size | The total size of the dataset. For datasets with S3-Athena as the target location, it represents the size of stored files in S3. For Redshift, it reflects the dataset size within the data warehouse. |
Rows | The total number of rows present in the dataset. |
Duplicate Rows | The count of non-unique rows within the dataset. |
Columns | The total number of columns in the dataset. |
Missing Values | The number of empty or null values across the dataset. |
Last Modified | The timestamp of the most recent user-initiated edit to the dataset. |
Last Profiled | The timestamp when the data profile was last generated. |
Data Type | The inferred data types of each column at the time of dataset registration. |
Min Value | The smallest value recorded for each column. |
Max Value | The largest value recorded for each column. |
Sample Rows | A randomly selected sample of 10 rows from the dataset for quick review. |
Data Profiling Update Frequency and Concurrency
Data profiling jobs are scheduled to run daily at 12 AM UTC. If data profiling is enabled and new data has been added to a dataset within the last 24 hours, the profile will be updated accordingly. This ensures efficient resource usage, as the data profile remains unchanged if no new files are added. Currently, all data profiling jobs operate with a concurrency factor of 5, meaning multiple datasets are processed simultaneously. For example, if 100 datasets require profiling and each takes approximately 3 minutes, the total update time would be 60 minutes, ensuring all profiles are refreshed by 1:00 AM UTC.
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.