deirokay.statements.builtin.contain.Contain

class deirokay.statements.builtin.contain.Contain(*args, **kwargs)[source]

Bases: BaseStatement

Checks if the given scope contains specific values. You may also check the number of their occurrences by specifying a minimum and maximum value of frequency.

The available parameters for this statement are:

  • rule (required): One of all, only or all_and_only.

  • values (required): A list of values to which the rule applies.

  • multicolumn: A boolean indicating whether the statement should consider each of the values as a tuple of multiple columns or a single value. When set to False and evaluated over a scope containing more than one column, the rule will be applied over all the values from the original columns as in a single column. Default: False.

  • parser: The parser (or a list) to be used to parse the values. Correspond to the parser parameter of the treater function (see deirokay.data_reader method). Either parser or parsers must be declared.

  • parsers: An alias for parser, recommended when multicolumn is set to True. Either parser or parsers must be declared.

  • min_occurrences: a global minimum number of occurrences for each of the values. Default: 1 for all and all_and_only rules, 0 for only.

  • max_occurrences: a global maximum number of occurrences for each of the values. Default: inf (unbounded).

  • occurrences_per_value: a list of dictionaries overriding the global boundaries. Each dictionary may have the following keys:

    • values (required): a list of values to which the occurrence bounds below must apply to.

    • min_occurrences: a minimum number of occurrences for these values. Default: global min_occurrences parameter.

    • max_occurrences: a maximum number of occurrences for these values. Default: global max_occurrences parameter.

    Global parameters apply to all values not present in any of the dictionaries in occurrences_per_value (but yet present on the main values list).

  • report_limit: if set to a positive integer, limit the number of items generated in the statement report. Default: 32.

The all rule checks if all the values declared are present in the scope (possibly tolerating other values not declared in values). Use it when you want to be sure that your data contains at least all the values you declare, also setting min_occurrences and max_occurrences when necessary. You may also check for “zero” occurrences of a set of values by setting max_occurrences to 0.

The only rule ensures that the values are the only possible values in the scope (possibly not containing them at all). Use it when you want to enumerate the admitted values for the scope, as in an enumeration.

The all_and_only rule checks both if all the values declared are present in the scope and if only they are present (not tolerating values not declared). Use it when you know all the possible values for the scope and you are sure that they will be always present.

The min_occurrences and max_occurrences parameters are applied applied to all the values declared, and only these. It means you cannot (yet) specify boundaries for values you did’t declare.

Null values are considered valid for the purpose of the statement evaluation and must be explicitely passed in values if you wish to allow them (or not).

You may also notice that, by tweaking the expected number of occurrences, you may end up having the very same behaviour regardless the rule you choose. In this case, you should go for the rule that semantically matches best your intents, so that your final validation document looks more readable and easy to understand.

Examples

  • You have a table of users containg a column handedness only admitting the values: right-handed, left-handed and ambidextrous. You know that some of these values may not appear in the data, but you don’t want other values to be present.

{
    "scope": "handedness",
    "statements": [
        {
            "type": "contain",
            "rule": "only",
            "values": ["right-handed", "left-handed", "ambidextrous"],
            "parser": {"dtype": "string"}
        }
    ]
}
  • You have a table of servers containg a column role which may contain the values master and slave. You want to be sure that there is always one and only one master server in the data.

{
    "scope": "role",
    "statements": [
        {
            "type": "contain",
            "rule": "all",
            "values": ["master"],
            "parser": {"dtype": "string"},
            "min_occurrences": 1,
            "max_occurrences": 1
        }
    ]
}

You may also extend the previous example by making some adjustments to ensure that there is no other value than master and role in the data. Make notice that although the rule below is changed to only, the statement above is still contemplated by the occurrences_per_value parameter in the following validation item:

{
    "scope": "role",
    "statements": [
        {
            "type": "contain",
            "rule": "only",
            "values": ["master", "slave"],
            "parser": {"dtype": "string"},
            "occurrences_per_value": [
                {
                    "values": ["master"],
                    "min_occurrences": 1,
                    "max_occurrences": 1
                }
            ]
        }
    ]
}
  • You have a table of transactions containing details about transactions in all the branches of a company. You expect that there should always be at least one transaction per branch.

{
    "scope": ["branch_name"],
    "statements": [
        {
            "type": "contain",
            "rule": "all_and_only",
            "values": [
                "Albany", "Utica", "Scranton", "Akron",
                "Nashua", "Buffalo", "Rochester"
            ],
            "parser": {"dtype": "string"}
        }
    ]
}
  • You have a table for the logs of user accesses to a website which contains an IP column. You want to be sure that blacklisted IPs are not present in the data. The following validation item in YAML format checks for the absense of blacklisted IPs:

scope: IP
statements:
- type: contain
  rule: all
  max_occurrences: 0
  values: # blacklisted IPs
  - 3.48.48.135
  - 3.48.48.136
  parser: {dtype: string}
  • You want the pair ‘San Diego’ and ‘2022’ to appear at most twice in your dataset (for any reason).

scope: [city, year]
statements:
- type: contain
  rule: all
  min_occurrences: 0
  max_occurrences: 2
  values:
  - ['San Diego', 2022]
  parsers:
  - {dtype: string}
  - {dtype: integer}

Methods

attach_backend

Generate a subclass that concretizes multibackend backend methods into their intended name.

get_backend

Get current active backend for this class.

profile

Given a template data table, generate a statement dict from it.

register_backend_method

Proxy for register_backend_method to register an existing function as a backend-specific method.

report

Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.

result

Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).

Attributes

DEFAULT_MAX_OCCURRENCES

DEFAULT_MIN_OCCURRENCES

DEFAULT_REPORT_LIMIT

expected_parameters

Parameters expected for this statement.

name

Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).

supported_backends

Backends supported by this resource.

DEFAULT_MAX_OCCURRENCES = {'all': (inf, inf), 'all_and_only': (inf, 0), 'only': (inf, 0)}
DEFAULT_MIN_OCCURRENCES = {'all': (1, 0), 'all_and_only': (1, 0), 'only': (0, 0)}
DEFAULT_REPORT_LIMIT = 32
__call__(df: DeirokayDataSource) dict

Run statement instance.

classmethod __init_subclass__() None

Validate subclassed statement.

classmethod __post_attach_backend__()

This classmethod can be optionally overwritten to serve as a callback function for when the attach_backend() method is called.

classmethod attach_backend(backend: Backend) Type[_AnyMultiBackendClass]

Generate a subclass that concretizes multibackend backend methods into their intended name. The methods marked with the given backend will compose the returned class.

Parameters
  • cls (type) – Class to be subclassed with the given backend.

  • backend (Backend) – Backend to be selected.

Returns

Subclass of the current class with methods filtered for the given backend.

Return type

Type[MultiBackendMixin]

expected_parameters: List[str] = ['rule', 'values', 'multicolumn', 'parser', 'parsers', 'min_occurrences', 'max_occurrences', 'occurrences_per_value', 'report_limit']

Parameters expected for this statement.

Type

List[str]

classmethod get_backend() Backend

Get current active backend for this class.

Returns

The current active backend.

Return type

Backend

Raises

InvalidBackend – Backend not set or not a valid execution class.

name: str = 'contain'

Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).

Type

str

static profile(df: DeirokayDataSource) Dict[str, Any]

Given a template data table, generate a statement dict from it.

Parameters

df (DataFrame) – The DataFrame to be used as template.

Returns

Statement dict.

Return type

dict

Raises

NotImplementedError – If this method is not implemented by the subclass or the profile generation for this statement was intentionally skipped.

classmethod register_backend_method(alias_for: str, func: Callable[[...], Any], backend: Backend) None

Proxy for register_backend_method to register an existing function as a backend-specific method.

Parameters
  • alias_for (str) – The name of the method to be substituted with a backend-specific version.

  • func (AnyCallable) – Existing function to be registered as a method.

  • backend (Backend) – Backend for the method.

report(df: DeirokayDataSource) dict

Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.

Parameters

df (DataFrame) – The scoped DataFrame columns to be analysed in this report by this statement.

Returns

A dictionary of useful statistics about the target columns.

Return type

dict

result(report: dict) bool[source]

Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).

Parameters

report (dict) – Report generated by report method. Should ideally contain all statistics necessary to evaluate the statement validity.

Returns

Whether or not this statement passed.

Return type

bool

supported_backends: List[Backend] = [<Backend.PANDAS: 'pandas'>, <Backend.DASK: 'dask'>]

Backends supported by this resource.

Type

List[Backend]