image_pdfimage_print

This article on how to build an open data lakehouse was originally published on Medium. the blog has been republished with the author’s credit and consent. 

Data lakes represent the first step towards gaining insights from ever-growing data. In many cases, it is the first place collected data lands in the data system. 

A data lake creates two challenges:

  • Data quality and governance. Everything is just a file/object in a data lake.
  • Performance. Limited query optimisation, such as metadata, indexing, etc.

On the other hand, when it comes to data warehouses, often time is the final destination of analytical data. Data in a data warehouse may come from the data lake or directly from the sources. Data warehouse challenges include:

  • Limited support for unstructured data
  • Performance for machine learning. SQL over ODBC/JDBC is not efficient for ML. ML needs direct access to data in an open format.

Enter the data lakehouse. A data lakehouse system tries to solve these challenges by combining the strengths of data lakes and data warehouses. Key features of a data lakehouse include:

  • Designed for both SQL and machine learning workloads
  • ACID transactions
  • Partition evolution
  • Schema evolution
  • Time-travel query
  • Near data warehouse performance

Open Data Lakehouse

Example data lakehouse system design from the paper by Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia.

Implementing a Data Lakehouse

Key components in a data lakehouse implementation include:

  • Leveraging an existing data lake and open data format. Table data is typically stored as Parquet or ORC files in HDFS or an S3 data lake.
  • Adding metadata layers for data management. Popular open source choices include Delta Lake, Apache Iceberg, and Apache Hudi. They typically store metadata in the same data lake as JSON or Avro format and have a catalog pointer to the current metadata.
  • Having an analytics engine that supports the data lakehouse spec. Apache Spark, Trino, and Dremio are among the most popular ones.

Below, I will explain my process of implementing a simple data lakehouse system using open source software. This implementation can run with cloud data lakes like Amazon S3, or on-premises ones such as Pure Storage® FlashBlade® with S3.

Getting Ready for a Data Lakehouse

To implement a data lakehouse system, we first need to be familiar with and ready for the data lake and data warehouse. In my case, I have already set up my FlashBlade S3 data lake, Spark, and Trino data warehouse.

S3

My data lake and warehouse setup.

Refer to my previous blog posts for details of the above setup:

Adding Open Data Lakehouse Metadata Management

One thing that is missing in my previous setup is the metadata management layer for the data lakehouse. I choose Delta Lake for this because it is easy to get started, has less dependency on Hadoop and Hive, and its documentation is good.

Delta Lake is implemented as Java libraries. Only four jars are required to add Delta Lake to an existing Spark environment: delta-core, delta-storage, antlr4-runtime, and jackson-core-asl. Download these jars from Maven repo, and add them under the $SPARK_HOME/jars directory. Because I run Spark on Kubernetes, I add these jars into my Spark container image. 

Next, I add the following configurations to my Spark session, so that Spark will use the Delta catalog and its SQL extension.

That’s all we need for Delta Lake and Spark integration.

Data Lakehouse with Delta Lake and Spark

Now let’s demonstrate some data lakehouse features with Delta Lake and Spark.

Save a Spark dataframe in Delta format:

This writes both the Parquet data files and Delta Lake metadata (JSON) in the same FlashBlade S3 bucket.

Open Data Lakehouse

Delta Lake data and metadata in FlashBlade S3.

To read back Delta Lake data into Spark dataframes:

Delta Lake provides programmatic APIs for conditional update, delete, and merge (upsert) data into tables.

Transaction is not easy, if possible, in a data lake, but is built-in with a data lakehouse. Transactions create snapshots. I can query previous snapshots of my Delta table by using time travel queries. If I want to access the data that has been overwritten, I can query a snapshot of the table before I overwrote the first set of data using the versionAsOf option.

I can also retrieve a Delta table history like this:

S3

Delta table versions.

SQL on Data Lakehouse

While Spark is great for general ETL with its DataFrame APIs, SQL is preferred for advanced analytics and business intelligence. Below, I add data lakehouse support to my existing Trino data warehouse using the Trino Delta Lake connector.

To configure the Delta Lake connector, add the following to the catalog/delta.properties file, and restart Trino.

With this, Trino can understand the Delta spec, query, and update the above Spark Delta format output.

Connect to Trino Delta catalog:

Create a Delta table in Trino, and query the data.

Note that updates, or transactions in general, are not supported in classic Trino tables on S3. To update even just one row in a table, we need to repopulate the entire partition or table. With transaction support in a Trino Delta table, this becomes much easier as shown above. Transactions are first stored in the _delta_log directory and later merged back to the base Parquet files in the backend.

Open Data Lakehouse

An example Delta log including a delete operation.

I also want to stress that, although ACID transaction is normally a built-in feature for data lakehouse systems, this is not meant to be used for general OLTP purposes. Transactions in a data lakehouse system should be infrequent.

As of the time of this blog post, Trino’s Delta Lake connector supports common Delta/Trino SQL type mapping and common queries, including select, update, and so on. Advanced Delta features, such as time travel queries, are not supported yet. For those features, use the APIs from the Delta Lake library.

Conclusion

In this blog post, I explained how to build an open data lakehouse on top of an existing data lake and warehouse system. The technologies I use here are either open source or open standard, so they can be deployed anywhere.

As data lakehouse architectures are getting more and more popular, I started to hear from customers asking about data lakehouse performance. I will write about this and why fast object storage like FlashBlade with S3 is important in a data lakehouse in a follow-up blog post.