Data Quality Checks
Amorphic provides data quality checks to help detect errors in data before it is utilized by other systems or machine learning algorithms. Users can create rules for columns in structured datasets and run checks to identify rule violations.
If any rule is broken, the entire check fails. In Amorphic Datasets, users can create new checks, view details and executions, and edit existing checks.
How to Create Data Quality Check?
To create new data quality check:
- Click on
Create Data Quality Check
- Fill in the following fields:
Property | Description |
---|---|
Auto-Constraint Suggestions Enabled | Enables or disables automatic constraint suggestions. This can be challenging for large and complex datasets that contain information from multiple sources. Enabling this functionality helps users find suitable constraints for their data. |
Constraints | Defines the constraints to be applied to dataset columns. |
Edit Data Quality Check
Users can enable or disable Auto-Constraint Suggestions or modify, add, or remove constraints in a data quality check’s metadata using the "Edit Data Quality Check" button.
Execute a Data Quality Check
Data quality checks can be executed either on-demand or scheduled. Once execution is complete, users receive an email and a push notification with the results.
Stop Data Quality Check execution
To stop an ongoing execution, use the "Stop Execution" option under Actions.
View Data Quality Check executions
User can view the results of a particular execution. The report displays the constraints that were both successful and failed.
To view auto constraint suggestions, click on View Auto Constraint Suggestions.
Constraint Definitions
Name of the constraint | Definition of the constraint |
---|---|
hasMax | Creates a constraint that asserts on the maximum value of a column. The column contains either a long, int or float datatype. |
hasMin | Creates a constraint that asserts on the minimum value of a column. The column is contains either a long, int or float datatype. |
hasMaxLength | Creates a constraint that asserts on the maximum length of a string datatype column. |
hasMinLength | Creates a constraint that asserts on the minimum length of a string datatype column. |
hasMean | Creates a constraint that asserts on the mean of the column. |
hasSum | Creates a constraint that asserts on the sum of the column. |
hasStandardDeviation | Creates a constraint that asserts on the standard deviation of the column. |
hasApproxCountDistinct | Creates a constraint that asserts on the approximate count distinct of the given column. |
isComplete | Creates a constraint that asserts on a column completion. |
isUnique | Creates a constraint that asserts on a column uniqueness. |
containsCreditCardNumber | Check to run against the compliance of a column against a Credit Card pattern. |
containsEmail | Check to run against the compliance of a column against an e-mail pattern. |
containsURL | Check to run against the compliance of a column against an URL pattern. |
isPositive | Creates a constraint which asserts that a column contains no negative values and is greater than 0. |
containsSocialSecurityNumber | Check to run against the compliance of a column against the Social security number pattern for the US. |
isNonNegative | Creates a constraint which asserts that a column contains no negative values. |
hasCompleteness | Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider. |
hasEntropy | Creates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message. |
hasMutualInformation | Creates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another. |
hasCorrelation | Creates a constraint that asserts on the pearson correlation between two columns. |
isLessThan | Asserts that, in each row, the value of columnA is less than the value of columnB. |
isLessThanOrEqualTo | Asserts that, in each row, the value of columnA is less than or equal to the value of columnB. |
isGreaterThan | Asserts that, in each row, the value of columnA is greater than the value of columnB. |
isGreaterThanOrEqualTo | Asserts that, in each row, the value of columnA is greater than or equal to the value of columnB. |
hasUniqueness | Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once. |
hasDistinctness | Creates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s). |
hasUniqueValueRatio | Creates a constraint on the unique value ratio in a single or combined set of key columns. |
haveCompleteness | Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider. |
haveAnyCompleteness | Creates a constraint that asserts on any completion in the combined set of columns. |
areComplete | Creates a constraint that asserts completion in combined set of columns. |
areAnyComplete | Creates a constraint that asserts any completion in the combined set of columns. |
isContainedIn | Asserts that every non-null value in a column is contained in a set of predefined values. |
Data Quality check use case
A retail company has a large database of customer information, including name, address, email, and purchase history. Before running any data analysis or machine learning algorithms on this data, the company wants to ensure the quality of the data by checking for errors and inconsistencies.
To do this, the company sets up a data quality check in Amorphic, with constraints such as:
The email column must contain a valid email address format. The address column must contain a valid postal code. The purchase history column must contain only positive numbers. The company runs the data quality check, which reads the entire database and performs these checks for each record. If any of the constraints fail, the data quality check execution is considered as a failure and the report provides details denoting which constraint failed and for which particular record.
The company can then use this information to correct the errors in the database and ensure that the data is of high quality before running any further data analysis or machine learning algorithms on it.