deirokay.statements.builtin.row_count.RowCount

class deirokay.statements.builtin.row_count.RowCount(*args, **kwargs)[source]

Bases: BaseStatement

Check if the number of rows (or the number of of distinct rows) in a scope is between a minimum and maximum value.

The available options are:

  • min: The minimum number of rows. If None, no minimum is enforced. Default: None.

  • max: The maximum number of rows. If None, no maximum is enforced. Default: None.

  • distinct: If True, check the number of distinct rows instead of the total number of rows. Default: False.

Providing no min or max parameters, the statement will act only as a logger for its statistics.

When counting the total number of rows (distinct=False), this statement may be applied to any scope of your DataFrame, since every column would have the same number of rows. By convention, you should apply it to a scope containing all the columns of your DataFrame.

To count the number of (not-)null rows, you should use the not_null statement instead. To count the number of unique rows, use the unique statement.

Examples

  • After some historial analysis of your data, you found that the number of rows is always greater or equal to than 42. You can declare the following validation item to represent this rule:

{
    "scope": ["foo", "bar"],
    "statements": [
        {
            "name": "row_count",
            "min": 42
        }
    ]
}
  • You have a table of daily transactions from all branches of a company. Not all branches have transactions for every day, and new branches may be added at any time. You want to ensure that the number of branches that appears in your data does not vary sharply downwards (below 5% of its 7-day historical average), which could be a sign of failure to receive transactions from some branches. You can declare the following validation item (in YAML format) to check this rule:

scope: branch_name
statements:
- name: row_count
  distinct: True
  min: >
    {{ 0.95 * (
      series("transactions", 7).branch_name.row_count.distinct_rows.mean()  # noqa E501
      | default(19, true))
      | float
    ) }}

There are many things going on here:

  • In YAML, the “>” operator is used to collapse a multi-line string into a single line. In JSON you would have to put everything in the same line;

  • The “{{}}” braces are used to indicate that the following expression is a Jinja2 template.

  • The series function is a built-in Deirokay method used to get the 7-day historical validations. Further down, the mean function is used to compute the 7-day average of the distinct_rows metric returned by the row_count statement in the branch_name scope.

  • The “|” operator inside the Jinja2 template is used to apply a function to the result of the previous expression, such that in the end we obtain a float.

  • The default function is used to set a default value if the previous expression is None.

  • The float function is used to convert the result of the previous expression, which is a numpy.float64, to a float, which can be properly serialized in JSON or YAML format when the validation logs are generated.

For this example to work, you will need to declare in your deirokay.validate call the save_to parameters, so that the validation logs can be saved and later used to provide historical analysis.

from deirokay import validate

validate(df, against=assertions, save_to='logs')

Methods

attach_backend

Generate a subclass that concretizes multibackend backend methods into their intended name.

get_backend

Get current active backend for this class.

profile

Given a template data table, generate a statement dict from it.

register_backend_method

Proxy for register_backend_method to register an existing function as a backend-specific method.

report

Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.

result

Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).

Attributes

expected_parameters

Parameters expected for this statement.

name

Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).

supported_backends

Backends supported by this resource.

__call__(df: DeirokayDataSource) dict

Run statement instance.

classmethod __init_subclass__() None

Validate subclassed statement.

classmethod __post_attach_backend__()

This classmethod can be optionally overwritten to serve as a callback function for when the attach_backend() method is called.

classmethod attach_backend(backend: Backend) Type[_AnyMultiBackendClass]

Generate a subclass that concretizes multibackend backend methods into their intended name. The methods marked with the given backend will compose the returned class.

Parameters
  • cls (type) – Class to be subclassed with the given backend.

  • backend (Backend) – Backend to be selected.

Returns

Subclass of the current class with methods filtered for the given backend.

Return type

Type[MultiBackendMixin]

expected_parameters: List[str] = ['min', 'max', 'distinct']

Parameters expected for this statement.

Type

List[str]

classmethod get_backend() Backend

Get current active backend for this class.

Returns

The current active backend.

Return type

Backend

Raises

InvalidBackend – Backend not set or not a valid execution class.

name: str = 'row_count'

Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).

Type

str

static profile(df: DeirokayDataSource) Dict[str, Any]

Given a template data table, generate a statement dict from it.

Parameters

df (DataFrame) – The DataFrame to be used as template.

Returns

Statement dict.

Return type

dict

Raises

NotImplementedError – If this method is not implemented by the subclass or the profile generation for this statement was intentionally skipped.

classmethod register_backend_method(alias_for: str, func: Callable[[...], Any], backend: Backend) None

Proxy for register_backend_method to register an existing function as a backend-specific method.

Parameters
  • alias_for (str) – The name of the method to be substituted with a backend-specific version.

  • func (AnyCallable) – Existing function to be registered as a method.

  • backend (Backend) – Backend for the method.

report(df: DeirokayDataSource) dict

Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.

Parameters

df (DataFrame) – The scoped DataFrame columns to be analysed in this report by this statement.

Returns

A dictionary of useful statistics about the target columns.

Return type

dict

result(report: dict) bool[source]

Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).

Parameters

report (dict) – Report generated by report method. Should ideally contain all statistics necessary to evaluate the statement validity.

Returns

Whether or not this statement passed.

Return type

bool

supported_backends: List[Backend] = [<Backend.PANDAS: 'pandas'>, <Backend.DASK: 'dask'>]

Backends supported by this resource.

Type

List[Backend]