Over the weekend I joined an ongoing question about what is a Data Lake / Lakehouse and are we reinventing new solutions for the same problems we have always had with analytical systems? You can argue that most of these problems have been resolved in modern RDBMS (Relational Database Management Systems) and if so, why are we bothering with a new concept? The crux of the reason for the Lakehouse approach is simply scale. Any single server idea has an upper limit (CPU/Memory/IO). When we meet this limit the only other option is to scale out. To scale out we need a scaled storage layer, such as a Data Lake, to which the architecture is typically based on Hadoop, HDFS.

The sheer volume and variety of data being generated today has made it increasingly difficult for traditional RDBMS to keep up. Data lakes and lakehouses provide a new approach to data management that allows organizations to store and process large amounts of data, regardless of its structure or format. They also allow for more cost-effective and scalable solutions, as data can be stored in its raw format and processed on-demand. Additionally, they support both batch and real-time processing and integration with a wide variety of data sources and tools.

Despite all this, a user suggested that in “9/10” cases a relational database is loaded with a subset of data, as a serving layer between a lake and an analytics software such as Tableau or Power BI. At Advancing Analytics we do not agree with this idea. Seldom we find the need for a relational system in a Data Lakehouse architecture - now I will caveat and say that is there is regularly need for a semantic layer, but that can take many different appearances. When pushed to understand why 90% of solutions require a database, the author cited Referential Integrity. Now I have some strong feelings when it comes to Database Theory. I have always been fascinated by the intersection of academia and industry. Through my studies and research, I have become well-versed in the theories and concepts of relational theory, as well as the latest developments in non-relational databases. This has led me to question whether multi-dimensional modeling is truly a relational concept. In my opinion, it is not. I will explain my reasoning in the following discussion.

Microsoft - Understand star schema and the importance for Power BI - Power BI | Microsoft Learn

Multi-dimensional modeling is a powerful technique for organizing and analyzing data in a way that is intuitive and easy to understand. It is often used in data warehousing and business intelligence applications to provide a consistent, reliable view of data across different dimensions and hierarchies. But is multi-dimensional modeling a relational concept or not? Let's explore this question in more detail.

One of the key characteristics of multi-dimensional modeling is that it is based on the concept of a "fact" table and one or more "dimension" tables. The fact table contains the measures or metrics that are being analyzed, while the dimension tables provide context and additional information about the data in the fact table. You can learn more about multi-dimensional modelling from Ralph Kimball. Ralph Kimball is a well-known data warehousing expert and author. He is best known for his work in the field of dimensional modeling, a technique used to design data warehouses and business intelligence systems. He is the co-author of several books on data warehousing and dimensional modeling, including "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling" and "The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence".

Fact and Dimension tables are related to each other through primary key-foreign key relationships, similar to relational databases. However, there is a fundamental difference between multi-dimensional modeling and relational modeling. In a relational model, the focus is on maintaining data consistency and accuracy through the use of referential integrity constraints and relationships. In contrast, multi-dimensional modeling focuses on providing a consistent and reliable view of data across different dimensions and hierarchies, which is optimized for querying and analyzing large amounts of data.

Does Referential Integrity (RI) belong in a multi-dimensional system? No. (RI) is a key concept in database management that ensures data consistency and accuracy by enforcing relationships between tables in a database. In transactional systems, where data is constantly being updated and modified, RI is essential to maintain data integrity and prevent errors. However, in analytical systems, the focus is on querying and aggregating large amounts of data, rather than maintaining data consistency. In this article, we will explore the reasons why RI may not be necessary in analytical systems.

In this example, we have created two tables: "orders" and "customers". The "orders" table has a foreign key column "customer_id" that references the "customer_id" column in the "customers" table. The "ALTER TABLE" statement is used to add the foreign key constraint to the "orders" table.In this example, any insert, update or delete statement that would cause the value of the "customer_id" column in the "orders" table to reference a non-existent "customer_id" in the "customers" table will be rejected. This ensures that any order is only created if there's a valid customer. It's worth noting that some databases have a different syntax to define a foreign key constraint, but the main idea remains the same.

In PySpark, referential integrity (RI) can be enforced through the use of DataFrame operations and user-defined functions (UDFs). One approach to implementing RI in PySpark is to use the "join" operation to combine the data from two or more tables, and then use a UDF to validate the data and ensure that the relationships between the tables are maintained. For example, let's say we have two DataFrames: "orders" and "customers". The "orders" DataFrame has a column "customer_id" that references the "customer_id" column in the "customers" DataFrame. To enforce RI between these two DataFrames, we can use the following code:

In this example, the UDF "validate_customer_id" checks if the value of the "customer_id" column in the "orders" DataFrame exists in the "customers" DataFrame. If it does not, the UDF returns "False" and the "filter" operation is used to exclude that row from the final DataFrame "valid_data". This approach is not as efficient as using foreign key constraints in a relational database, as it requires additional processing, and it may not be as performant when dealing with large datasets. A better approach would be to validate the data upon import.

These systems are typically built to support complex queries that involve joins across multiple tables. Because the data in analytical systems is typically not updated in real-time, enforcing RI can become a bottleneck and impede performance. In many cases, the performance benefits of not enforcing RI outweigh the benefits of maintaining data consistency. Additionally, in many analytical systems, data is often loaded from multiple sources with different schemas and data types. Enforcing RI in these cases would be infeasible, as the data may not be compatible with the relationships defined in the database. This would require additional effort to map the data to the correct relationships, which could be time-consuming and error-prone.

Furthermore, in some cases, the data in analytical systems may not be accurate or complete. Enforcing RI would only mask these issues and give a false sense of data consistency. Instead, data quality checks and validation should be performed to ensure that the data is reliable. One of the main advantages of multi-dimensional modeling is that it allows for efficient querying and aggregation of data. The multi-dimensional structure of the data is optimized for reporting and analysis, which makes it easier to extract insights and identify trends. Additionally, multi-dimensional modeling makes it easy to drill-down and slice-and-dice the data to get a more detailed view of the data.

While multi-dimensional modeling shares some similarities with relational modeling, it is not a relational concept. Multi-dimensional modeling is optimized for querying and analyzing large amounts of data, while relational modeling focuses on maintaining data consistency and accuracy. Both approaches have their own strengths and weaknesses, and the choice between them will depend on the specific requirements and constraints of the analytical system. It's important to evaluate the use-cases and design the schema accordingly.

You want to use the best tool for the job. In an OLTP workload that will mostly be a RDBMS, although Polyglot persistence would argue to use the best solution for part of the problem, and I would agree. I would also argue that a Lake is a fine serving layer and that a multi-dimensional model when curated using an appropriate file format such as Delta can be served to an analytical software with little issues. There are alternative approaches to many relational concepts, but this is nothing new.

A fun idea to debate. Let me know your thoughts in the comments.

Here are a few references that may be relevant to the blog topic of multi-dimensional modeling and referential integrity in analytical systems:

Kimball, Ralph. "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling." John Wiley & Sons, 2002.
Inmon, Bill. "Building the Data Warehouse." John Wiley & Sons, 2005.
Chaudhuri, Surajit, and Umeshwar Dayal. "An overview of data warehousing and OLAP technology." ACM SIGMOD record 26.1 (1997): 65-74.
Kimball, Ralph, and Margy Ross. "The data warehouse lifecycle toolkit: expert methods for designing, developing, and deploying data warehouses." John Wiley & Sons, 2008.
Gray, Jim, et al. "Transaction processing concepts and techniques." Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
Stonebraker, Michael. "The case for shared nothing." Communications of the ACM 53.6 (2010): 88-98.
Abadi, Daniel J., and David J. DeWitt. "The design and implementation of modern column-family systems." Proceedings of the VLDB Endowment 5.12 (2012): 1751-1762.
Stonebraker, Michael. "The future of DBMS." Communications of the ACM 55.11 (2012): 86-92.

These references provide an in-depth look at the concepts and techniques of multi-dimensional modeling, data warehousing, and OLAP technology, as well as the challenges and limitations of using traditional RDBMS in analytical systems.

Blog

Exploring the Differences: Multi-dimensional Modeling vs Relational Modeling in Analytical Systems