Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating and merging data check docs #4293

Merged
merged 2 commits into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 0 additions & 43 deletions docs/cookbooks/adding_data_checks.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ nav:
- Common workflows: cookbooks/common_workflows.md
- Creating a derived dataset: cookbooks/creating_a_derived_dataset.md
- Testing: cookbooks/testing.md
- Adding Data Checks: cookbooks/adding_data_checks.md
- Reference:
- bqetl CLI: bqetl.md
- Recommended practices: reference/recommended_practices.md
Expand Down
66 changes: 61 additions & 5 deletions docs/reference/data_checks.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,75 @@
# bqetl data checks
# bqetl Data Checks

> Instructions on how to add data checks can be found under the [Adding data checks](../cookbooks/adding_data_checks.md) cookbook.

## Background

To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.

## bqetl data checks to the rescue
## bqetl Data Checks to the Rescue

bqetl data checks aim to provide this ability by providing a simple interface for specifying our "assumptions" about the data the query should produce and checking them against the actual result.

This easy interface is achieved by providing a number of jinja templates providing "out-of-the-box" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found [here](../../tests/checks/) and are available as jinja macros inside the `checks.sql` files. This allows to "configure" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.

It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.

## data checks available with examples
## Adding Data Checks

### Create checks.sql

Inside the query directory, which usually contains `query.sql` or `query.py`, `metadata.yaml` and `schema.yaml`, create a new file called `checks.sql` (unless already exists).

Once checks have been added, we need to `regenerate the DAG` responsible for scheduling the query.

### Update checks.sql

If `checks.sql` already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.

When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.

### Removing checks.sql

All checks can be removed by deleting the `checks.sql` file and regenerating the DAG responsible for scheduling the query.

Alternatively, specific checks can be removed by deleting them from the `checks.sql` file.

### Example checks.sql

Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in [`tests/checks`](https://github.com/mozilla/bigquery-etl/tree/main/tests/checks) which may take different parameters used to generate the SQL check expression.

Each check needs to have a specific marker set. Available markers are:
* `#fail`: This marker ensures that if the check fails and assertion is raised, a notification is sent and all downstream dependencies are blocked from running. This marker should be used for checks that indicate a serious data issue. These checks can be seen as circuit-breakers.
* `#warn`: This marker ensures that if the check fails task owners will get notified but downstream dependencies are not blocked from running. These type of checks can be used to indicate _potential_ issues that might require more manual investigation.

Example of what a `checks.sql` may look like:

```sql
-- raw SQL checks
#fail
ASSERT (
SELECT
COUNTIF(ISNULL(country)) / COUNT(*)
FROM telemetry.table_v1
WHERE submission_date = @submission_date
) > 0.2
) AS "More than 20% of clients have country set to NULL";

-- macro checks
#fail
{{ not_null(["submission_date", "os"], "submission_date = @submission_date") }}

#warn
{{ min_rows(1, "submission_date = @submission_date") }}

#fail
{{ is_unique(["submission_date", "os", "country"], "submission_date = @submission_date")}}

#warn
{{ in_range(["non_ssl_loads", "ssl_loads", "reporting_ratio"], 0, none, "submission_date = @submission_date") }}
```

## Data Checks Available with Examples

### in_range ([source](../../tests/checks/in_range.jinja))

Expand Down Expand Up @@ -176,15 +231,16 @@ Options:
--sql_dir, --sql-dir DIRECTORY Path to directory which contains queries.
--dry_run, --dry-run To dry run the query to make sure it is
valid
--marker TEXT Marker to filter checks.
--help Show this message and exit.
```

### Examples

```shell
# to run checks for a specific dataset
$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01
$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn

# to only dry_run the checks
$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01
$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail
```