Jinja Templating and Statements based on Past Validation Data

Some validation statements may present dynamic behaviour, maybe folowing a natural uptrend or downtrend movement of your data. Suppose you expect the number of rows of your data file to possibly fluctuate +/- 3% around a 7-day moving average. Deirokay allows you to refer to past validation data by means of a special function called series.

To use it, replace a static value for your statement parameter by a templated argument, such as the following:

name: VENDAS
items:
- scope:
  - WERKS01
  - PROD_VENDA
  #  When the scope has more than one column or has special characters,
  #  you should provide an `alias` string to refer to this item.
  alias: werks_prod
  statements:
  - type: row_count
    min: >
      '{{ 0.97 * (series("VENDAS", 7).werks_prod.row_count.rows.mean()
      |default(19, true)) | float }}'
    max: >
      '{{ 1.03 * (series("VENDAS", 7).werks_prod.row_count.rows.mean()
      |default(21, true)) | float }}'

The YAML validation document above presents some new features. A templated Jinja argument is identified by a pair of double curly braces ({{ }}) surrounding its content. Deirokay has a special method named series that you can use to retrieve past data as a pandas.Series object. When declared, series receives a validation document name and a number of logs to look behind. Next, you should provide a path to a statement report value, following the sequence: scope (possibly aliased) => statement type name => statement result metric name. This returns a pandas.Series object you can take any calculation from (mean, min, max, etc.).

To illustrate this path, take the templated argument from the YAML document above as an example:

series("VENDAS", 7).werks_prod.row_count.rows.mean()
  • series(“VENDAS”, 7): Retrieve the 7 last validation logs for “VENDAS”;

  • werks_prod: Consider the werks_prod-aliased item (could be the scope value itself if it is a string for a single column name);

  • row_count: Use statistics from row_count statement;

  • rows: Use the rows metric reported by the selected statement (you should know the statistics reported by your statement).

  • mean(): Since the previous object is already a pandas Series, this call gets its mean value.

When your validation results have no history, this call returns None. In Jinja, you may provide a default value to replace a null variable by using |default(<default value>, true).

Finally, here is an example of validation result log. Notice that the templates are already rendered before being saved:

name: VENDAS
items:
- scope:
  - WERKS01
  - PROD_VENDA
  alias: werks_prod
  statements:
  - type: row_count
    min: 1490.0
    max: 1510.0
    report:
      detail:
        rows: 1500
      result: pass