Getting started

This page will guide you through some of the basic use cases where Deirokay can make your life easier when you are looking for a data quality tool. At the end you will be familiar with the most features that comes with Deirokay.

Note

Here are the artifacts if you wish to follow this guide while practicing:

Facing data quality issues

In any data-driven company, there comes a time when data processing and validation becomes a bottleneck for data products. It is usual for any company to have a lot of data sources such as databases, file systems, object storage systems, data generated by certain processes, and so on.

Let’s start simple, with a single CSV file, which will show how Deirokay can help you.

Suppose it has the following content:

name	age	is_married
John	55	True
Mariah	44	yes
Carl		false
Josh	12	yes
Anna	32	no
Josh	12	true

Maybe some of the rules that you want your data to follow are:

there can’t be any duplicate name in the table;
a minimum of 3 is_married values should be True (or truthy);
there can’t be any null age value;

Even before we can apply these rules, we need to be sure that our data is being correctly parsed and converted to the right data types. Take the is_married column for example: although it semantically represents a boolean value, its values are not consistent, since we have True, yes and true values to identify the same piece of information.

Thus, we need to guarantee that all data is correct and free of possible inconsistences, which would lead to errors in next steps of your data pipeline.

Let’s consider another problem when you are working specifically with pandas for your data analysis. A known issue when dealing with integer values is that they are parsed as float type when there is a null value in the same column (which is converted to NaN), preserving this data type when you try to save or export the data:

>>> import pandas
>>> pandas.read_csv('example.csv')
    name    age     is_married
0   John    55.0    True
1   Mariah  44.0    yes
2   Carl    NaN     false
3   Josh    12.0    yes
4   Anna    32.0    no
5   Josh    12.0    true
>>> pandas.read_csv('example.csv').dtypes
name           object
age           float64
is_married     object
dtype: object

Even though strings are correctly parsed, the integer column became float.

Depending on the dataset you are working with, you will always have to remember to manually fix all sorts of issues caused by the Pandas’ builtin datatype inference. If you just decide to read the file forcing the data type as string, you may get rid of part of the problem, but you still need to convert your columns manually to the right data type.

As a solution, Deirokay comes with the data_reader feature, which makes it easier for you to customize how your data must be parsed.

Parsing data with Deirokay Data Reader

For the Data Reader to work, you will need to pass as argument a Deirokay Options Document, which contains your instructions for how the data should be parsed. An example of such a document is shown below:

{
    "columns": {
        "name": {
            "dtype": "string"
        },
        "age": {
            "dtype": "integer"
        },
        "is_married": {
            "dtype": "boolean",
            "truthies": ["yes", "true", "True"],
            "falsies": ["no", "false", "False"]
        }
    }
}

To be able to use this option document you just need to import from Deirokay the DataReader, and will get a pandas dataframe that doesn’t have the initial problems:

>>> from deirokay import data_reader
>>> from deirokay.backend import Backend
>>> data_reader('example.csv', options='options.json', backend=Backend.PANDAS)
     name   age  is_married
0    John    55        True
1  Mariah    44        True
2    Carl  <NA>       False
3    Josh    12        True
4    Anna    32       False
5    Josh    12        True

The options argument also accepts YAML files or dict objects directly. When passing Deirokay file options as dict, you may optionally import the available data types from the deirokay.enums.DTypes enumeration class to prevent typos.

Validating data with Deirokay Validator

The next step is to translate the rules we want into a Deirokay Validation Item:

there can’t be any duplicate name in the table;

{
  "scope": "name",
  "statements": [
    {
      "type": "unique"
    }
  ]
}

a minimum of 3 is_married values should be True (or truthy);

{
  "scope": "is_married",
  "statements": [
    {
      "type": "contain",
      "rule": "all",
      "values": [ true ],
      "parser": { "dtype": "boolean" },
      "min_occurrences": 3
    }
  ]
}

there can’t be any null age value;

{
  "scope": "age",
  "statements": [
    {
      "type": "not_null"
    }
  ]
}

Below you can find our final Validation Document:

{
    "name": "example",
    "description": "Getting started with Deirokay",
    "items": [
        {
          "scope": "name",
          "statements": [
            {
              "type": "unique"
            }
          ]
        },
        {
          "scope": "is_married",
          "statements": [
            {
              "type": "contain",
              "rule": "all",
              "values": [ true ],
              "parser": { "dtype": "boolean" },
              "min_occurrences": 3
            }
          ]
        },
        {
          "scope": "age",
          "statements": [
            {
              "type": "not_null"
            }
          ]
        }
    ]
}

Finally, to test your dataset against the validation document:

from deirokay import data_reader
from deirokay.backend import Backend

df = data_reader('example.csv',
                 options='options.json',
                 backend=Backend.PANDAS)
validation_result_document = validate(df,
                                      against='assertions.json',
                                      raise_exception=False)

The resulting validation document will present the reports for each statement, as well as its final result: pass or fail. You may probably want to save your validation result document by passing a path to a folder (local or in S3) as save_to argument to validate. By default, the validation result document will be saved in the same file format as the original validation document (you may specify another format – either json or yaml – in the save_format argument).

Here is the resulting document in JSON format:

{
  "name": "example",
  "description": "Getting started with Deirokay",
  "items": [
    {
      "scope": "name",
      "statements": [
        {
          "type": "unique",
          "report": {
            "detail": {
              "unique_rows": 4,
              "unique_rows_%": 66.66666666666667
            },
            "result": "fail"
          }
        }
      ]
    },
    {
      "scope": "is_married",
      "statements": [
        {
          "type": "contain",
          "rule": "all",
          "values": [
            true
          ],
          "parser": {
            "dtype": "boolean"
          },
          "min_occurrences": 3,
          "report": {
            "detail": {
              "values": [
                {
                  "value": true,
                  "count": 4,
                  "perc": 66.66666666666667
                },
                {
                  "value": false,
                  "count": 2,
                  "perc": 33.333333333333336
                }
              ]
            },
            "result": "pass"
          }
        }
      ]
    },
    {
      "scope": "age",
      "statements": [
        {
          "type": "not_null",
          "report": {
            "detail": {
              "null_rows": 1,
              "null_rows_%": 16.666666666666668,
              "not_null_rows": 5,
              "not_null_rows_%": 83.33333333333333
            },
            "result": "fail"
          }
        }
      ]
    }
  ]
}