Skip to main content
Version: v3.0 print this page

Data Profiling

Data Profiling is the process of analyzing an existing data source to gather statistics and generate summaries about the data. It helps identify anomalies, assess data quality, and gain insights into the dataset’s structure and characteristics.

Key insights provided by data profiling include the total number of rows and columns, minimum and maximum values for each column, data distribution to detect outliers, identification of missing or null values, and an overview of data types and their occurrences.

This functionality is available in the Profile section of the Dataset details page

Data Profiling

The image above showcases a data profile for a sales dataset, derived from a sample of grocery data.

Enable Data profiling

Users can enable Data Profiling as needed, with the flexibility to enable or disable it at any time. This feature is available exclusively for structured datasets, including those hosted on S3-Athena, Redshift, Lake Formation, and DynamoDB target datasets. Enable Data Profiling

The following properties provide key insights into the dataset’s structure, quality, and metadata:

PropertyDescription
FilesThe total number of files in the dataset.
Dataset SizeThe total size of the dataset. For datasets with S3-Athena as the target location, it represents the size of stored files in S3. For Redshift, it reflects the dataset size within the data warehouse.
RowsThe total number of rows present in the dataset.
Duplicate RowsThe count of non-unique rows within the dataset.
ColumnsThe total number of columns in the dataset.
Missing ValuesThe number of empty or null values across the dataset.
Last ModifiedThe timestamp of the most recent user-initiated edit to the dataset.
Last ProfiledThe timestamp when the data profile was last generated.
Data TypeThe inferred data types of each column at the time of dataset registration.
Min ValueThe smallest value recorded for each column.
Max ValueThe largest value recorded for each column.
Sample RowsA randomly selected sample of 10 rows from the dataset for quick review.

Data Profiling Update Frequency and Concurrency

Data profiling jobs are scheduled to run daily at 12 AM UTC. If data profiling is enabled and new data has been added to a dataset within the last 24 hours, the profile will be updated accordingly. This ensures efficient resource usage, as the data profile remains unchanged if no new files are added. Currently, all data profiling jobs operate with a concurrency factor of 5, meaning multiple datasets are processed simultaneously. For example, if 100 datasets require profiling and each takes approximately 3 minutes, the total update time would be 60 minutes, ensuring all profiles are refreshed by 1:00 AM UTC.

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.