Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Getting Started with dbt

What is dbt?

dbt is an abbreviation for data build tools. It is primarily a SQL based transformation workflow, supported by yaml, to allow teams to collaborate on analytics code whilst implementing software engineering best practices like modularity, portability, CI/CD, testing, and documentation.

dbt is available using a CLI in the form of dbt core, or as a paid-for SaaS product in the form of dbt cloud.

dbt Core features a CLI for you to be able to create analytical models and tests, whereas dbt Cloud adds auto-generated documentation at run-time and scheduled workflows.

dbt materializes the results of queries as physical tables or views, thereby avoiding the maintenance of tables, such as managing the schema, transactions and other activities which occupy data teams. You can decompose data models into reusable, modular components to be used across different models, as well as leveraging metadata for optimisation.

SQL elements in dbt can utilise Jinga, which helps to reduce repetitive tasks by using macros and packages. The Directed Acyclic Graph (DAG) feature of dbt allows you to determine the order of execution of the objects, but also to propagate changes to the model to dependents. As the files that dbt uses are largely SQL and yaml files, this allows them to be easily source controlled. The power of dbt comes from the build engine itself, that creates the DAG, runs data quality tests on the data, hosts embedded documentation of the models and fields, as well as easily define snapshots and easily maintain seed values and files for repeatable testing.

It can connect to and use data from multiple systems such as: Postgres, Snowflake, Databricks and many others. For the purpose of this blog post, we’ll use Databricks as our source and use dbt core as our dbt instance.

We’re assuming you’ve already got a Databricks workspace available, you are familiar with Git, and tools like VSCode.

Installing dbt locally

For local development, dbt core is installed and configured using Python pip commands

pip install dbt-core

will install dbt core without any connection configuration, whereas something like

pip install dbt-databricks

will install dbt core as well as the connection configuration for Databricks, other options are available.

Initialise dbt

Create / Re-use an existing repo and clone locally using VS Code.

Within your project repo, open a new terminal window and use the following command to initialize the repo with base dbt structure for a given project:

dbt init {project_name}

There’ll be a series of prompts in the terminal, asking for the Databricks connection details. You’ll need to provide the host, the http path, the token and the schema that objects will be deployed to.

You’ll then be greeted with a folder structure that looks like:

In the terminal, navigate to your new project folder using

cd {project_name}

And now the real fun begins – creating data models!

Creating your first model

For this, we’re going to be using some sample data that comes with Databricks: Flights. Bring the data through and register it as a delta table.

delaysDF = (spark.read.option("header", True).csv('dbfs:/databricks-datasets/flights/departuredelays.csv'))
delaysDF.write.saveAsTable("Flight_Delays")
airportCodes = (spark.read.option("header", True).option("delimiter", "\t").csv('dbfs:/databricks-datasets/flights/airport-codes-na.txt'))
airportCodes.write.saveAsTable("Airport_Codes") 

Within the models folder of the project, we’re going to add a subfolder called flights and three sub-folders underneath that called bronze, silver and gold – referencing the medallion architecture.

We’re then going to add the following files to the bronze folder:

bronze_airports.sql

SELECT * FROM Airport_Codes

bronze_delays.sql

SELECT * FROM Flight_Delays

The silver folder:

silver_airport.sql

SELECT
 IATA AS AirportCode,
 City,
State,
 Country
FROM {{ ref('bronze_airports') }}

silver_delays.sql

SELECT
CAST(date AS int) AS TripId,
CAST(concat('2014-', LEFT(date,2),'-', RIGHT(LEFT(date,4),2), ' ', LEFT(RIGHT(date,4),2), ':', RIGHT(date,2)) AS timestamp) AS LocaleDate,
CAST(delay AS int) AS Delay,
CAST(distance AS int) AS Distance,
 origin AS OriginAirport,
 destination AS DestinationAirport
FROM {{ ref('bronze_delays') }}

And the gold folder:

gold_airports.sql

SELECT
 monotonically_increasing_id() AS AirportKey,
 AirportCode,
 City,
State,
 Country
FROM {{ ref('silver_airports') }}

gold_delays.sql

SELECT
 TripId,
 LocaleDate,
Delay,
 Distance,
 OriginAirport.AirportKey AS OriginAirportKey,
 DestinationAirport.airportKey AS DestinationAirportKey
FROM {{ ref('silver_delays') }}
JOIN {{ ref('gold_airports') }} AS OriginAirport ON OriginAirport.AirportCode = OriginAirport
JOIN {{ ref('gold_airports') }} AS DestinationAirport ON DestinationAirport.AirportCode = DestinationAirport

In the terminal, if we execute:

dbt run

It’s now going to compile the scripts, by default it will compile and execute the scripts as views and execute them in order. But how does it determine the order? Through the DAG. In the scripts, we’ve got a {{ ref(‘another_fileName’) }} reference, which tells dbt that this reference is a dependency for the current script.

To get the queries to be materialized as a table instead, we can add the materialization type to the dbt_project.yml :

models:
  project_x:
    +materialized: table

Or can be controlled for each script using a macro at the top of the file:

{{ config(materialized='table') }}

Documenting your model

Documentation is typically something that data professionals are not very good at, as it’s usually in a separate system to where the code is, and therefore away from where we work. dbt makes creating and maintaining documentation easier by including documentation as a code artefact.

This sits alongside the technical code in a schema.yml file, which is also where tests are authored (more on that below).

models:
  - name: gold_delays
    description: Contains the average delay for each flight and airport in 2014
    columns:
      - name: TripId
        description: Primary key
      - name: LocaleDate
        description: Local date of the flight
      - name: delay
        description: Delay of the flight
      - name: Distance
        description: Distance of the flight
      - name: OriginAirportKey
        description: Foreign Key for the Origin airport
      - name: DestinationAirportKey
        description: Foreign Key for the Destination airport

When we build out our model, we can easily build out the descriptions for each component of the model.

If we run dbt docs generate in the terminal, it generates the documentation for your project. dbt generates a JSON file with rich documentation about your project, based on the contents of the schema file as well as the model code itself.

Running dbt docs serve will launch the documentation in a local website.

Testing your model

For those of you familiar with Great Expectations or even table constraints, the tests that dbt will run will be run against the data itself rather than testing the processes.

Tests are added to the schema file and are associated with a column.

models:
  - name: gold_delays
    description: Contains the average delay for each flight and airport in 2014
    columns:
      - name: TripId
        description: Primary key
        tests:
          - unique
          - not_null

Using the schema approach is a generic approach to applying tests.

You can create more specific tests by writing SQL and storing those test scripts in the tests folder.

SELECT
    TripId,
    SUM(Distance) AS TotalDistance
FROM {{ ref('gold_delays' )}}
GROUP BY 1
HAVING NOT(TotalDistance >= 0)

Executing dbt test will run those tests and produce an output in the terminal.

Our TripId test failed, which means we’ll need to inspect the logic for producing the column and make sure that it is unique, so that the test can evaluate as true.

Summary

Using the Flights data we’ve created a basic model, we’ve documented it and we’ve run some tests against it.

What we haven’t done, is explore some more complex features of dbt such as Snapshots, Incremental loading, deployment, scheduling, and seed files.

For what we’ve demonstrated in this blog post, dbt has made the process of creating, documenting, and testing a data model really easy. We’ll soon be exploring how well it handles more complex use cases – namely Incremental loading and Snapshots in a follow-up series of posts.

Hope you’ve enjoyed reading and following along – the full code for this post can be found in GitHub.

Ust OldfieldComment