deirokay.statements.builtin.contain.Contain
- class deirokay.statements.builtin.contain.Contain(*args, **kwargs)[source]
Bases:
BaseStatementChecks if the given scope contains specific values. You may also check the number of their occurrences by specifying a minimum and maximum value of frequency.
The available parameters for this statement are:
rule (required): One of all, only or all_and_only.
values (required): A list of values to which the rule applies.
multicolumn: A boolean indicating whether the statement should consider each of the values as a tuple of multiple columns or a single value. When set to False and evaluated over a scope containing more than one column, the rule will be applied over all the values from the original columns as in a single column. Default: False.
parser: The parser (or a list) to be used to parse the values. Correspond to the parser parameter of the treater function (see deirokay.data_reader method). Either parser or parsers must be declared.
parsers: An alias for parser, recommended when multicolumn is set to True. Either parser or parsers must be declared.
min_occurrences: a global minimum number of occurrences for each of the values. Default: 1 for all and all_and_only rules, 0 for only.
max_occurrences: a global maximum number of occurrences for each of the values. Default: inf (unbounded).
occurrences_per_value: a list of dictionaries overriding the global boundaries. Each dictionary may have the following keys:
values (required): a list of values to which the occurrence bounds below must apply to.
min_occurrences: a minimum number of occurrences for these values. Default: global min_occurrences parameter.
max_occurrences: a maximum number of occurrences for these values. Default: global max_occurrences parameter.
Global parameters apply to all values not present in any of the dictionaries in occurrences_per_value (but yet present on the main values list).
report_limit: if set to a positive integer, limit the number of items generated in the statement report. Default: 32.
The all rule checks if all the values declared are present in the scope (possibly tolerating other values not declared in values). Use it when you want to be sure that your data contains at least all the values you declare, also setting min_occurrences and max_occurrences when necessary. You may also check for “zero” occurrences of a set of values by setting max_occurrences to 0.
The only rule ensures that the values are the only possible values in the scope (possibly not containing them at all). Use it when you want to enumerate the admitted values for the scope, as in an enumeration.
The all_and_only rule checks both if all the values declared are present in the scope and if only they are present (not tolerating values not declared). Use it when you know all the possible values for the scope and you are sure that they will be always present.
The min_occurrences and max_occurrences parameters are applied applied to all the values declared, and only these. It means you cannot (yet) specify boundaries for values you did’t declare.
Null values are considered valid for the purpose of the statement evaluation and must be explicitely passed in values if you wish to allow them (or not).
You may also notice that, by tweaking the expected number of occurrences, you may end up having the very same behaviour regardless the rule you choose. In this case, you should go for the rule that semantically matches best your intents, so that your final validation document looks more readable and easy to understand.
Examples
You have a table of users containg a column handedness only admitting the values: right-handed, left-handed and ambidextrous. You know that some of these values may not appear in the data, but you don’t want other values to be present.
{ "scope": "handedness", "statements": [ { "type": "contain", "rule": "only", "values": ["right-handed", "left-handed", "ambidextrous"], "parser": {"dtype": "string"} } ] }
You have a table of servers containg a column role which may contain the values master and slave. You want to be sure that there is always one and only one master server in the data.
{ "scope": "role", "statements": [ { "type": "contain", "rule": "all", "values": ["master"], "parser": {"dtype": "string"}, "min_occurrences": 1, "max_occurrences": 1 } ] }
You may also extend the previous example by making some adjustments to ensure that there is no other value than master and role in the data. Make notice that although the rule below is changed to only, the statement above is still contemplated by the occurrences_per_value parameter in the following validation item:
{ "scope": "role", "statements": [ { "type": "contain", "rule": "only", "values": ["master", "slave"], "parser": {"dtype": "string"}, "occurrences_per_value": [ { "values": ["master"], "min_occurrences": 1, "max_occurrences": 1 } ] } ] }
You have a table of transactions containing details about transactions in all the branches of a company. You expect that there should always be at least one transaction per branch.
{ "scope": ["branch_name"], "statements": [ { "type": "contain", "rule": "all_and_only", "values": [ "Albany", "Utica", "Scranton", "Akron", "Nashua", "Buffalo", "Rochester" ], "parser": {"dtype": "string"} } ] }
You have a table for the logs of user accesses to a website which contains an IP column. You want to be sure that blacklisted IPs are not present in the data. The following validation item in YAML format checks for the absense of blacklisted IPs:
scope: IP statements: - type: contain rule: all max_occurrences: 0 values: # blacklisted IPs - 3.48.48.135 - 3.48.48.136 parser: {dtype: string}
You want the pair ‘San Diego’ and ‘2022’ to appear at most twice in your dataset (for any reason).
scope: [city, year] statements: - type: contain rule: all min_occurrences: 0 max_occurrences: 2 values: - ['San Diego', 2022] parsers: - {dtype: string} - {dtype: integer}
Methods
Generate a subclass that concretizes multibackend backend methods into their intended name.
Get current active backend for this class.
Given a template data table, generate a statement dict from it.
Proxy for register_backend_method to register an existing function as a backend-specific method.
Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.
Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).
Attributes
Parameters expected for this statement.
Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).
Backends supported by this resource.
- DEFAULT_MAX_OCCURRENCES = {'all': (inf, inf), 'all_and_only': (inf, 0), 'only': (inf, 0)}
- DEFAULT_MIN_OCCURRENCES = {'all': (1, 0), 'all_and_only': (1, 0), 'only': (0, 0)}
- DEFAULT_REPORT_LIMIT = 32
- __call__(df: DeirokayDataSource) dict
Run statement instance.
- classmethod __init_subclass__() None
Validate subclassed statement.
- classmethod __post_attach_backend__()
This classmethod can be optionally overwritten to serve as a callback function for when the attach_backend() method is called.
- classmethod attach_backend(backend: Backend) Type[_AnyMultiBackendClass]
Generate a subclass that concretizes multibackend backend methods into their intended name. The methods marked with the given backend will compose the returned class.
- Parameters
cls (type) – Class to be subclassed with the given backend.
backend (Backend) – Backend to be selected.
- Returns
Subclass of the current class with methods filtered for the given backend.
- Return type
Type[MultiBackendMixin]
- expected_parameters: List[str] = ['rule', 'values', 'multicolumn', 'parser', 'parsers', 'min_occurrences', 'max_occurrences', 'occurrences_per_value', 'report_limit']
Parameters expected for this statement.
- Type
List[str]
- classmethod get_backend() Backend
Get current active backend for this class.
- Returns
The current active backend.
- Return type
- Raises
InvalidBackend – Backend not set or not a valid execution class.
- name: str = 'contain'
Statement name when referred in Validation Documents (only valid for Deirokay built-in statements).
- Type
str
- static profile(df: DeirokayDataSource) Dict[str, Any]
Given a template data table, generate a statement dict from it.
- Parameters
df (DataFrame) – The DataFrame to be used as template.
- Returns
Statement dict.
- Return type
dict
- Raises
NotImplementedError – If this method is not implemented by the subclass or the profile generation for this statement was intentionally skipped.
- classmethod register_backend_method(alias_for: str, func: Callable[[...], Any], backend: Backend) None
Proxy for register_backend_method to register an existing function as a backend-specific method.
- Parameters
alias_for (str) – The name of the method to be substituted with a backend-specific version.
func (AnyCallable) – Existing function to be registered as a method.
backend (Backend) – Backend for the method.
- report(df: DeirokayDataSource) dict
Receive a DataFrame containing only columns on the scope of validation and returns a report of related metrics that can be used later to declare this Statement as fulfilled or failed.
- Parameters
df (DataFrame) – The scoped DataFrame columns to be analysed in this report by this statement.
- Returns
A dictionary of useful statistics about the target columns.
- Return type
dict
- result(report: dict) bool[source]
Receive the report previously generated and declare this statement as either fulfilled (True) or failed (False).
- Parameters
report (dict) – Report generated by report method. Should ideally contain all statistics necessary to evaluate the statement validity.
- Returns
Whether or not this statement passed.
- Return type
bool