Skip to main content
Version: v3.0 print this page

Data Quality Checks

Amorphic provides data quality checks to help detect errors in data before it is utilized by other systems or machine learning algorithms. Users can create rules for columns in structured datasets and run checks to identify rule violations.

If any rule is broken, the entire check fails. In Amorphic Datasets, users can create new checks, view details and executions, and edit existing checks.

How to Create Data Quality Check?

Data quality checks

To create new data quality check:

  1. Click on Create Data Quality Check
  2. Fill in the following fields:
PropertyDescription
Auto-Constraint Suggestions EnabledEnables or disables automatic constraint suggestions. This can be challenging for large and complex datasets that contain information from multiple sources. Enabling this functionality helps users find suitable constraints for their data.
ConstraintsDefines the constraints to be applied to dataset columns.

Edit Data Quality Check

Users can enable or disable Auto-Constraint Suggestions or modify, add, or remove constraints in a data quality check’s metadata using the "Edit Data Quality Check" button.

Execute a Data Quality Check

Data quality checks can be executed either on-demand or scheduled. Once execution is complete, users receive an email and a push notification with the results.

data_quality_check\execution

Stop Data Quality Check execution

To stop an ongoing execution, use the "Stop Execution" option under Actions.

data_quality_check\execution\stop

View Data Quality Check executions

User can view the results of a particular execution. The report displays the constraints that were both successful and failed.

Data_quality_check_suggestions

To view auto constraint suggestions, click on View Auto Constraint Suggestions.

Data_quality_check_suggestions

Constraint Definitions

Name of the constraintDefinition of the constraint
hasMaxCreates a constraint that asserts on the maximum value of a column. The column contains either a long, int or float datatype.
hasMinCreates a constraint that asserts on the minimum value of a column. The column is contains either a long, int or float datatype.
hasMaxLengthCreates a constraint that asserts on the maximum length of a string datatype column.
hasMinLengthCreates a constraint that asserts on the minimum length of a string datatype column.
hasMeanCreates a constraint that asserts on the mean of the column.
hasSumCreates a constraint that asserts on the sum of the column.
hasStandardDeviationCreates a constraint that asserts on the standard deviation of the column.
hasApproxCountDistinctCreates a constraint that asserts on the approximate count distinct of the given column.
isCompleteCreates a constraint that asserts on a column completion.
isUniqueCreates a constraint that asserts on a column uniqueness.
containsCreditCardNumberCheck to run against the compliance of a column against a Credit Card pattern.
containsEmailCheck to run against the compliance of a column against an e-mail pattern.
containsURLCheck to run against the compliance of a column against an URL pattern.
isPositiveCreates a constraint which asserts that a column contains no negative values and is greater than 0.
containsSocialSecurityNumberCheck to run against the compliance of a column against the Social security number pattern for the US.
isNonNegativeCreates a constraint which asserts that a column contains no negative values.
hasCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
hasEntropyCreates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.
hasMutualInformationCreates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.
hasCorrelationCreates a constraint that asserts on the pearson correlation between two columns.
isLessThanAsserts that, in each row, the value of columnA is less than the value of columnB.
isLessThanOrEqualToAsserts that, in each row, the value of columnA is less than or equal to the value of columnB.
isGreaterThanAsserts that, in each row, the value of columnA is greater than the value of columnB.
isGreaterThanOrEqualToAsserts that, in each row, the value of columnA is greater than or equal to the value of columnB.
hasUniquenessCreates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
hasDistinctnessCreates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).
hasUniqueValueRatioCreates a constraint on the unique value ratio in a single or combined set of key columns.
haveCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
haveAnyCompletenessCreates a constraint that asserts on any completion in the combined set of columns.
areCompleteCreates a constraint that asserts completion in combined set of columns.
areAnyCompleteCreates a constraint that asserts any completion in the combined set of columns.
isContainedInAsserts that every non-null value in a column is contained in a set of predefined values.

Data Quality check use case

A retail company has a large database of customer information, including name, address, email, and purchase history. Before running any data analysis or machine learning algorithms on this data, the company wants to ensure the quality of the data by checking for errors and inconsistencies.

To do this, the company sets up a data quality check in Amorphic, with constraints such as:

The email column must contain a valid email address format. The address column must contain a valid postal code. The purchase history column must contain only positive numbers. The company runs the data quality check, which reads the entire database and performs these checks for each record. If any of the constraints fail, the data quality check execution is considered as a failure and the report provides details denoting which constraint failed and for which particular record.

The company can then use this information to correct the errors in the database and ensure that the data is of high quality before running any further data analysis or machine learning algorithms on it.