Version: v3.1 print this page

Data Quality Checks

Amorphic provides data quality checks to help detect errors in data before it is utilized by other systems or machine learning algorithms. Users can create rules for columns in structured datasets and run checks to identify rule violations.

If any rule is broken, the entire check fails. In Amorphic Datasets, users can create new checks, view details and executions, and edit existing checks.

How to Create Data Quality Check?

Data quality checks

To create new data quality check:

Click on Create Data Quality Check
Fill in the following fields:

Property	Description
Auto-Constraint Suggestions Enabled	Enables or disables automatic constraint suggestions. This can be challenging for large and complex datasets that contain information from multiple sources. Enabling this functionality helps users find suitable constraints for their data.
Constraints	Defines the constraints to be applied to dataset columns.

Edit Data Quality Check

Users can enable or disable Auto-Constraint Suggestions or modify, add, or remove constraints in a data quality check’s metadata using the "Edit Data Quality Check" button.

Execute a Data Quality Check

Data quality checks can be executed either on-demand or scheduled. Once execution is complete, users receive an email and a push notification with the results.

$data_quality_check\execution$

Stop Data Quality Check execution

To stop an ongoing execution, use the "Stop Execution" option under Actions.

$data_quality_check\execution\stop$

View Data Quality Check executions

User can view the results of a particular execution. The report displays the constraints that were both successful and failed.

Data_quality_check_suggestions

To view auto constraint suggestions, click on View Auto Constraint Suggestions.

Data_quality_check_suggestions

Constraint Definitions

Name of the constraint	Definition of the constraint
hasMax	Creates a constraint that asserts on the maximum value of a column. The column contains either a long, int or float datatype.
hasMin	Creates a constraint that asserts on the minimum value of a column. The column is contains either a long, int or float datatype.
hasMaxLength	Creates a constraint that asserts on the maximum length of a string datatype column.
hasMinLength	Creates a constraint that asserts on the minimum length of a string datatype column.
hasMean	Creates a constraint that asserts on the mean of the column.
hasSum	Creates a constraint that asserts on the sum of the column.
hasStandardDeviation	Creates a constraint that asserts on the standard deviation of the column.
hasApproxCountDistinct	Creates a constraint that asserts on the approximate count distinct of the given column.
isComplete	Creates a constraint that asserts on a column completion.
isUnique	Creates a constraint that asserts on a column uniqueness.
containsCreditCardNumber	Check to run against the compliance of a column against a Credit Card pattern.
containsEmail	Check to run against the compliance of a column against an e-mail pattern.
containsURL	Check to run against the compliance of a column against an URL pattern.
isPositive	Creates a constraint which asserts that a column contains no negative values and is greater than 0.
containsSocialSecurityNumber	Check to run against the compliance of a column against the Social security number pattern for the US.
isNonNegative	Creates a constraint which asserts that a column contains no negative values.
hasCompleteness	Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
hasEntropy	Creates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.
hasMutualInformation	Creates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.
hasCorrelation	Creates a constraint that asserts on the pearson correlation between two columns.
isLessThan	Asserts that, in each row, the value of columnA is less than the value of columnB.
isLessThanOrEqualTo	Asserts that, in each row, the value of columnA is less than or equal to the value of columnB.
isGreaterThan	Asserts that, in each row, the value of columnA is greater than the value of columnB.
isGreaterThanOrEqualTo	Asserts that, in each row, the value of columnA is greater than or equal to the value of columnB.
hasUniqueness	Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
hasDistinctness	Creates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).
hasUniqueValueRatio	Creates a constraint on the unique value ratio in a single or combined set of key columns.
haveCompleteness	Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
haveAnyCompleteness	Creates a constraint that asserts on any completion in the combined set of columns.
areComplete	Creates a constraint that asserts completion in combined set of columns.
areAnyComplete	Creates a constraint that asserts any completion in the combined set of columns.
isContainedIn	Asserts that every non-null value in a column is contained in a set of predefined values.

Data Quality check use case

A retail company has a large database of customer information, including name, address, email, and purchase history. Before running any data analysis or machine learning algorithms on this data, the company wants to ensure the quality of the data by checking for errors and inconsistencies.

To do this, the company sets up a data quality check in Amorphic, with constraints such as:

The email column must contain a valid email address format. The address column must contain a valid postal code. The purchase history column must contain only positive numbers. The company runs the data quality check, which reads the entire database and performs these checks for each record. If any of the constraints fail, the data quality check execution is considered as a failure and the report provides details denoting which constraint failed and for which particular record.

The company can then use this information to correct the errors in the database and ensure that the data is of high quality before running any further data analysis or machine learning algorithms on it.

How to Create Data Quality Check?​

Edit Data Quality Check​

Execute a Data Quality Check​

Stop Data Quality Check execution​

View Data Quality Check executions​

Constraint Definitions​

Data Quality check use case​