Version: v3.0 print this page

Data Profiling

Data Profiling is the process of analyzing an existing data source to gather statistics and generate summaries about the data. It helps identify anomalies, assess data quality, and gain insights into the dataset’s structure and characteristics.

Key insights provided by data profiling include the total number of rows and columns, minimum and maximum values for each column, data distribution to detect outliers, identification of missing or null values, and an overview of data types and their occurrences.

This functionality is available in the Profile section of the Dataset details page

Data Profiling

The image above showcases a data profile for a sales dataset, derived from a sample of grocery data.

Enable Data profiling

Users can enable Data Profiling as needed, with the flexibility to enable or disable it at any time. This feature is available exclusively for structured datasets, including those hosted on S3-Athena, Redshift, Lake Formation, and DynamoDB target datasets.

The following properties provide key insights into the dataset’s structure, quality, and metadata:

Property	Description
Files	The total number of files in the dataset.
Dataset Size	The total size of the dataset. For datasets with S3-Athena as the target location, it represents the size of stored files in S3. For Redshift, it reflects the dataset size within the data warehouse.
Rows	The total number of rows present in the dataset.
Duplicate Rows	The count of non-unique rows within the dataset.
Columns	The total number of columns in the dataset.
Missing Values	The number of empty or null values across the dataset.
Last Modified	The timestamp of the most recent user-initiated edit to the dataset.
Last Profiled	The timestamp when the data profile was last generated.
Data Type	The inferred data types of each column at the time of dataset registration.
Min Value	The smallest value recorded for each column.
Max Value	The largest value recorded for each column.
Sample Rows	A randomly selected sample of 10 rows from the dataset for quick review.

Data Profiling Update Frequency and Concurrency

Data profiling jobs are scheduled to run daily at 12 AM UTC. If data profiling is enabled and new data has been added to a dataset within the last 24 hours, the profile will be updated accordingly. This ensures efficient resource usage, as the data profile remains unchanged if no new files are added. Currently, all data profiling jobs operate with a concurrency factor of 5, meaning multiple datasets are processed simultaneously. For example, if 100 datasets require profiling and each takes approximately 3 minutes, the total update time would be 60 minutes, ensuring all profiles are refreshed by 1:00 AM UTC.

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Enable Data profiling​

Data Profiling Update Frequency and Concurrency​

Enable Data profiling

Data Profiling Update Frequency and Concurrency