Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Testing Data

Within dbt, there’s the concept of testing data – which are effectively assertions of expectations you have of the data. But when it comes to testing, we need to define exactly what we’re trying to test. Are we trying to test the veracity of the data itself; the veracity of the query we’re applying to the data; the veracity of the pipelines the data goes through; or the veracity of the end-to-end process?

For this blog post, we’re going to start with the easiest one first: testing the veracity of the query we’re applying to the data, in the ambition that it’ll also answer the first testing question: the veracity of the data itself.

The other testing questions are much more aligned with integration and regression testing, so we’ll save them another time.

Testing the veracity of the query we’re applying to the data

Testing in dbt falls neatly within the second type of test: the veracity of the query we’re applying to the data, and that’s largely because dbt is focused on the analysis side of data rather than looking at it holistically. How dbt approaches testing is actually very useful, differentiating between singular, specific tests, and generic tests that can be applied repeatedly to different datasets.

How’s it’s done in dbt is documented on the dbt website, but we also covered it in the Getting Started with dbt post. By now we’re familiar with how dbt uses Jinja and SQL templates and the compiled artefacts it produces: SQL Scripts.

The specific tests are still the same SQL that you wrote to define them, but which has been wrapped in a sub query to interact with different bits of the dbt framework to produce and report on test results.

But where dbt really comes in to its own is the ease for which to declare and apply generic tests, you just list the generic tests you want to apply to a column in the schema file for the model and at build and run time it just compiles and executes the SQL.

In our schema file, we’ve specified tests applied on the TripId to be `unique` and `not_null`, which results in the following SQL:

Unique

select
    TripId as unique_field,
    count(*) as n_records
from default.gold_delays
where TripId is not null
group by TripId
having count(*) > 1

Not_null

select TripId
from default.gold_delays
where TripId is null

Really simple, scalable way of applying tests to data.

Now, let’s see how we could potentially do it without using dbt.

Testing data without dbt

An alternative to dbt in this use case is Great Expectations. Great Expectations is an open-source tool for data validation and documentation to ensure the quality of data. Great Expectations is designed to work with a variety of data sources and processing engines.

With Databricks, you can install the Great Expectations python library on your clusters, import the libraries into your notebooks, and define the expectations against your dataset(s). The setup for Great Expectations is a lot more involved than dbt – you can just define which tests you want to perform on the data in a schema file – but you could, with a metadata driven framework, configure Great Expectations to be as simple as dbt is for when you run your pipelines.

Get started with Great Expectations to understand more about the different features.

Great Expectations is more extensible with the amount of tests that can be applied to data than dbt, with a fully comprehensive list of expectations that can be applied on their documentation website.

With both dbt and Great Expectations, by default, poor data still makes its way into your important layers – which isn’t great.

Enter Delta Live Tables (DLT)!

DLT is a Databricks native framework for building and orchestrating data processing pipelines. It has a huge amount of functionality natively, without needing to stitch together different frameworks. It isn’t suitable for very complex or heavy engineering workloads – but is a good alternative to dbt, or a nice compliment to dbt – so with all things, use it if you think it’ll work for you.

DLT uses EXPECT constraints on the table definition to flag or reject data which falls outside your expectations. There’s a great post by Alex Ott which goes into the details behind implementing tests in DLT.

And…If none of those take your fancy, we’ve incorporated, as part of our Hydr8 Framework, a series of checks which can be applied to data as it moves between the data layers to either logically reject (flagging that the quality isn’t right) or physically reject (quarantining the data in a separate layer) rows that don’t meet standards. I covered the approach to logically or physically reject rows during my session at SQL Bits in 2022 on Implementing a Data Quality Framework using Purview and Databricks.

Conclusion

We’ve seen how dbt handles testing, which I have to say is very useful, as well as an alternative in Great Expectations – which feels more powerful.

We’ve also taken a look at DLT and its functionality for logically or physically rejecting data which falls outside the expectations you set.

Finally, we’ve taken a brief look at how to logically or physically reject data based on a more comprehensive engineering framework like Hyrd8. If you want to know more about the data quality aspects of Hydr8, or more about Hydr8 in general, do get in contact.