Building an End-to-End Open-Source Modern Data Platform

A detailed guide to help you navigate the modern data stack and build your own platform using open-source technologies.

Mahdi Karabiben
Towards Data Science

--

Photo by Christopher Burns on Unsplash

In the past few years, the data engineering field’s meteoric rise in importance opened the door for accelerated innovation and progress — since today more people than ever are thinking about data resources and how to better leverage them. This progress, in turn, led to a third waveof data technologies.

The first wave consisted of ETL, OLAP, and relational data warehouses as the cornerstones of a Business Intelligence (BI) ecosystem that couldn’t cope with the exponential increase in the four Vs of Big Data. Due to the limited potential of the BI-oriented stack, we then witnessed the second wave: An era of scalability in storage and compute thanks to the Hadoop ecosystem (that allowed companies to horizontally scale their data platforms) and Apache Spark (that opened the door for efficient in-memory data processing at scale).

What we can call the “third wave” is this era in which we’re no longer worrying about scalability or distributed storage. Instead, we’re building bricks and new features on top of mature and distributed data platforms. We now have the possibility to think about topics like metadata management, data discovery at scale, and data reliability. We’re in the era of interchangeable SaaS modules, cloud-based platforms, ELT, and democratized data access. Welcome to the modern data stack wave.

In this article, we will build an end-to-end modern data platform from scratch, relying solely on open-source technologies and the resources offered by cloud providers. The article is also accompanied by a GitHub repo containing the necessary code and infrastructure-as-code (IaC) scripts required to build the platform.

The platform will consist of the following components:

  • Data warehouse: This is the most important component of our design since an inefficient data warehouse would cause us issues no matter how sophisticated the other components get. Fundamentally, the 40-year-old concepts and paradigms behind the data warehouse are still relevant to this day, but are combined with the horizontal scalability that came with the “second wave”, allowing for a highly-efficient ELT architecture.
  • Data integration: Unsurprisingly, we actually need to get the data into our platform. What was before a tedious task of configuring and implementing connectors is now a solved problem thanks to the modern data stack.
  • Data transformation: Once we get the data into our warehouse (and so we’re done with the EL part of our ELT architecture), we’ll need to build pipelines on top of it to transform it so that we can directly consume it and extract value and insights from it — this process is the T in our ELT, and up until recently it usually consisted of badly managed large SQL queries or complex Spark scripts, but again, in this “third wave”, we now have the necessary tools to better manage data transformation.
  • Orchestration (optional — for now): We’d still need to orchestrate the execution of our pipelines to ensure that the data is available as soon as possible and that the data lifecycle is running smoothly from one component to the next one. This is marked as optional because some of the tools that we’ll use offer a scheduling feature out of the box, and so you shouldn’t really need a dedicated orchestration component during the first stages of your platform’s lifecycle (it’ll add unnecessary complexity). We’ll discuss orchestrations options in this article nonetheless since you’ll eventually need to add one to your platform.
  • Data monitoring (optional — for now): More data means more potential quality issues. To be able to trust our data we’d need to monitor it and make sure that we’re generating accurate insights based on it. This is marked as optional because when getting started the most efficient option would be to leverage the data testing capabilities of other components, but we’ll discuss data monitoring tools in this article nonetheless.
  • Data visualization: This is where we actually get to explore the data and generate value from it under the form of different data products, like dashboards and reports. One of the main advantages of this era is that we now have mature open-source data visualization platforms that we can deploy in a streamlined manner.
  • Metadata management: Most of the features of our platform (like data discovery and data governance) rely on metadata, and so we need to ensure that the metadata is centralized and leveraged throughout the platform.
The architecture of the platform (image by author)

Finally, please keep in mind that even though the technologies and tools that we’ll discuss are open-source, we will be building the platform on a cloud environment and so the resources that we’ll use (for compute, storage, etc.) aren’t free per se, but we won’t surpass the $300 budget offered by the GCP free trial.

If you want to avoid setting up a cloud environment, you can still experiment with the different tools locally, and you simply need to replace the data warehouse (BigQuery in our case) with an open-source alternative (an RDBMS like PostgreSQL would do).

And now, without further ado, let’s build our modern data platform.

First, let’s talk data

To build a sample data platform, the first step of the process would be to pick one or many datasets to work with. This is an opportunity to explore one of the multiple open datasets available online, and I recommend searching for one that interests you personally — this would make the process more enjoyable since you’d be genuinely intrigued by the data. If you want some inspiration, feel free to get started with one of the following datasets:

  • Formula 1 World Championship (1950–2021): This dataset that you can download from Kaggle or retrieve directly from the Ergast HTTP API contains all the available data points on the Formula 1 races, drivers, constructors, qualifying, circuits, lap times, pit stops, and championships from 1950 until 2021. If you’re an F1 fan like myself, this dataset can give you many interesting insights about the sport.
  • World Development Indicators (1960–2020): This dataset offered by the World Bank is undoubtedly one of the richest open datasets you can find online. It contains around 1500 development indicators that you can get lost in.

The data warehouse: BigQuery

As mentioned above, choosing the right data warehouse for your use case is the most important piece of our puzzle. The main three options are Snowflake, BigQuery, and Redshift — and yes, none of them is open-source, but they all offer a serverless option. This means that we get to leverage the features of a sophisticated modern data warehouse while only paying for the resources that we consume both in terms of storage and compute separately.

The good news is that the serverless option is exactly what we’re looking for at this stage, even if the product isn’t open-source. That’s because we want something that can scale in terms of storage and query performance without necessitating dedicated maintenance efforts. And so the ideal option when getting started is a serverless managed offering — this is true for all of our components that necessitate elasticity, not just the data warehouse.

BigQuery suits this requirement nicely for a variety of reasons, out of which we can mention the following two:

  • First, it’s serverless by nature. The design behind it allows for increased efficiency thanks to the decoupling of storage and compute, making it a very reliable option for all types of use cases. Redshift’s serverless offering, on the other hand, is still in beta.
  • Second, it’s part of the offering of a cloud provider, and so there’s already a seamless integration with all the components of the GCP ecosystem. This simplifies our architecture even further because it minimizes the configuration effort.

And so it would make sense for us to leverage BigQuery as our data warehouse for this platform, but this doesn’t generalize the choice since in other scenarios it may be more interesting to opt for another option. When picking your data warehouse, you should take into account factors like pricing, scalability, and performance and then pick the option that fits your use case the best.

To get started we just need to create a dataset, but feel free to familiarize yourself with some of the more advanced concepts of BigQuery like partitioning and materialized views.

In an ELT architecture, the data warehouse is used to store all of our data layers. This means that we won’t just use it to store the data or query it for analytical use cases, but we’ll also leverage it as our execution engine for the different transformations.

Now that we have chosen our data warehouse, the architecture looks as follows:

Current state of the architecture (image by author)

Before we move to the next component, let’s store the BigQuery audit logs in a dedicated dataset (additional instructions), since this information will come in handy when we set up our metadata management component.

Getting the data in: Airbyte

When considering data integration products within the modern data stack, you’ll find a handful of companies (with closed-source products) competing to add the biggest number of connectors in the shortest amount of time. This sadly means slower innovation (since there are fewer people contributing to each product) and fewer possibilities to customize existing solutions.

The outlier bright spot would definitely be Airbyte, the only major company in this space that opted to open-source its core product from the get-go. This allowed it to rapidly grow a large contributors community and offer more than 120 connectors less than one year after its inception.

Deploying Airbyte is a breeze on all cloud providers. On GCP, we’ll use a Compute Engine instance that has sufficient resources. Ideally, you’d want to configure your deployment via IaC which would make it easier to manage versioning and automate the process. (A sample Terraform configuration is available in the accompanying repo.)

Once it’s up and running, we simply need to add a connection by defining:

  • A source: You can either choose the “File” source type using the UI and then configure it based on your dataset and the location where you uploaded the data, or if you’re feeling more adventurous you can leverage Airbyte’s Python CDK to build a new HTTP API source that fetches the data from the API that you want to use.
  • A destination: Here you just need to specify the settings needed to interact with the data warehouse (“BigQuery” in our case).
Sample Airbyte connection (image by author)

It’s worth noting that Airbyte — for now — is solely designed for batch data ingestion (the EL in ELT), and so it sadly won’t be among your options if you’re building an event-driven platform. If you have such a use case, then your best option would be Jitsu, Segment’s open-source alternative.

And so now that we have Airbyte up and running and the data ingestion is taken care of, our platform looks as follows:

Current state of the architecture (image by author)

Managing the T in our ELT: dbt

When we think of the modern data stack, dbt is probably the first tool that comes to mind. The project that started back in 2016 (and was open-source from the get-go) tackled a widespread issue back at the time: data pipelines were badly versioned, poorly documented, and didn’t follow the best practices of software engineering.

dbt is the ideal poster child for the third wave of data technologies because it represents the main aim behind this wave: adding features and functionalities to better manage existing data platforms with less effort and extract yet more value from the underlying data. Thanks to dbt, data pipelines (the T in our ELT) can be divided into a set of SELECT queries (called “models”) that can be authored directly by data analysts or analytics engineers. This capability then opens the door for multiple features like data lineage, versioning, data testing, and documentation.

Setting up your dbt environment can be done in two different ways:

  • dbt Cloud: This is the web-based Integrated Development Environment (IDE) hosted by dbt Labs. It’s the option that necessitates the least amount of effort and yet offers more features like scheduling jobs, CI/CD, and alerts. The icing on the cake is that it’s actually free via the Developer plan.
  • dbt CLI: This option allows you to interact directly with dbt Core, whether via installing it locally using pip or running a docker image on Google Compute Engine like we previously deployed Airbyte. By using the CLI you get to experiment with the different dbt commands and work in your IDE of choice.

To allow dbt to interact with your BigQuery data warehouse, you need to generate the required credentials (you can create a service account with the necessary roles) and then indicate the project-specific information in your profiles.yml file. This is all very well explained in dbt Labs’ Getting Started” tutorial, which introduces all the concepts that you need to familiarize yourself with.

Expected output after running dbt debug (image by author)

After this initial effort, now is the time when you get to have fun with the data: You can use dbt to define models and dependencies between them. For example, for the F1 dataset, you can generate a championship_winners model that contains the data of championship winners (total number of points, their average pitstop time per race, number of fastest laps throughout the season, their average qualifying position, etc.) For whichever dataset you’re working on, you’ll find that the number of possibilities is just mind-boggling when it comes to the questions that data can answer — this is a great exercise that’ll make you feel more confident when working on new datasets.

After working on your models, you can execute the command dbt docs generate to generate the documentation of our project (the catalog and manifest files).

At this stage, after putting in place our dbt setup, we now have the components that can handle the three steps of our ELT process, and the architecture looks as follows:

Current state of the architecture (image by author)

When we first introduced the architecture, we said that both orchestration and data monitoring/testing can be handled by another component for now — and you probably already guessed that the component is dbt. With dbt Cloud we can manage the scheduling of our pipelines and define different execution triggers (via webhooks for example), while dbt also has a robust SQL-based test functionality that we can leverage to ensure that data quality issues don’t go undetected.

Sample output after running dbt test (image by author)

Where the magic happens: Apache Superset

Now that we have worked on our data and generated different views and tables that would offer insights, it’s time to actually visualize these insights via a set of data products. (If you’re unfamiliar with the term, this great article by Simon O'Regan offers an exhaustive overview of the different types of data products.)

Our aim at this stage is to build dashboards and charts (whether for analytics or monitoring, depending on your dataset) that can be directly accessed by our end-users, just like in the good old days of Business Intelligence (BI).

BI is one of the few areas that weren’t disrupted by the “second wave” of data technologies, mainly because the Hadoop ecosystem focused on processing the data at scale without impacting how end-users consume it. This meant that for a long time the BI and data visualization space was dominated by proprietary tools (Tableau, PowerBI, and more recently Looker) with a few lacking open-source projects that only had niche use cases.

Then came Apache Superset. When it was first open-sourced by Airbnb back in 2016, it represented the first open-source true alternative to existing BI tools, by offering all of the functionalities required at an enterprise level. Today, it’s one of the leading technologies of the “third wave” (with alternatives like Metabase and Looker) thanks to its large open-source community.

A production Superset deployment consists of multiple components (like a dedicated metadata database, a cache layer, authentication, and potentially async queries support) and so for the sake of simplicity, we will rely on a very basic setup.

We’ll leverage Google Compute Engine again to spin up a Superset instance on which we’ll run a container via Docker Compose. The necessary Terraform and init scripts are provided in the article’s accompanying repo.

Once Superset is up and running, you can connect to the instance via the following command:

gcloud --project=your-project-id beta compute ssh superset-instance -- -L 8088:localhost:8088 -N

After logging in to the Superset instance (via the steps provided in the official documentation), you simply need to connect it to BigQuery so that it can start interacting with your different datasets.

Expected result after adding a BigQuery connection (image by author)

After establishing the connection, you can experiment with the different chart types, build dashboards, and even leverage the built-in SQL editor to submit queries to your BigQuery instance.

Querying the BigQuery data via the SQL editor (image by author)

Now that we can offer our end-users direct access to the data via Superset, our data platform looks as follows:

Current state of the architecture (image by author)

You’ll be pleased to know that we only scratched the surface when it comes to Superset’s capabilities. You can manage access roles, leverage caching, build your own custom viz plugins, work with its rich APIs, and even enforce row-level access policies. Additionally, via Preset, you can opt for a managed version that relieves you of thinking about the deployment.

The cornerstone of the stack: OpenMetadata

Metadata management is probably the area where there’s the most disagreement on how to proceed among the data community. It’s a very fragmented space (25 tools and counting) where different tools have taken drastically different approaches on how to tackle this topic.

In my personal opinion, I strongly believe that OpenMetadata, the product open-sourced by Uber’s data platform team, took the right approach in this space. By focusing on offering a horizontal metadata product instead of just being yet one more brick in the architecture, it makes having a centralized metadata store an attainable goal. It has very rich APIs, enforces metadata schemas, and already has a long list of connectors.

Other products are implementing their own way of managing metadata and are doing so behind closed doors, which would cause unnecessary overhead when adding them to our platform, whereas OpenMetadata is focused on providing a single source of truth for metadata that other products can interact with via its APIs. By adding it to the architecture, data discovery and governance become a given, since it already has all the necessary features to attain these goals. If you want to witness its capabilities before adding it to our platform, you can first explore its sandbox.

Like Airbyte and Superset, we will deploy OpenMetadata via a Google Compute Engine instance (as usual, the Terraform and init scripts are provided in the accompanying repo). Once the deployment is complete, you’ll notice that there are actually four containers running on the VM for the following purposes:

  • Storing the metadata catalog on MySQL
  • Maintaining the metadata index via Elasticsearch
  • Orchestrating metadata ingestion via Airflow
  • Running the OpenMetadata UI and API server
Expected output after starting OpenMetadata (image by author)

OpenMetadata dutifully manages these components behind the scenes without any necessary configuration from your end, and so we can immediately start leveraging it as any other product. Once it’s up and running, you can first connect to the Airflow port via this command:

gcloud --project=your-project beta compute ssh openmetadata-instance -- -L 8080:localhost:8080 -N

Then you can access the Airflow UI at http://localhost:8080/ (username: admin , password: admin ). You’ll notice that a few DAGs already ran to load and index some sample data. After that, let’s connect to the OpenMetadata UI via the following command (the UI can be then accessed at http://localhost:8585/):

gcloud --project=your-project beta compute ssh openmetadata-instance -- -L 8585:localhost:8585 -N

Now you can simply log in to the GCE instance via SSH and connect OpenMetadata to BigQuery, BigQuery usage data, dbt, and Superset. After that, you can explore its different features and capabilities like data discovery and lineage.

OpenMetadata UI after connecting to BigQuery (image by author)

Now that we have added OpenMetadata to the platform, let’s take a look at our finalized architecture:

The finalized platform architecture (image by author)

Taking things to the next level: optional components

At the start of the article, we mentioned two optional components: orchestration and data monitoring. On a theoretical level, these are two very important functionalities for our data platform, but as we’ve seen, dbt can fulfill them both capably at this stage. Nonetheless, let’s discuss how these two components could be integrated when they become needed.

Orchestrating our pipelines: Apache Airflow

When the platform matures further, and we start integrating new tools and orchestrating complex workflows, dbt scheduling will eventually become insufficient for our use case. One simple scenario can be invalidating the Superset cache when a specific dbt model is updated — this is something that we can’t achieve solely with dbt Cloud’s scheduling.

Airflow has been the go-to tool in the data workflow orchestration space since it was open-sourced by Airbnb in 2015. This allowed it to become an integral part of large data platforms at multiple tech companies, ensuring a large and very active open-source community around it — which in turn helped it remain the standard when it comes to orchestration, even in the “third wave”.

The reason why you should delay thinking about Airflow (or its alternatives) is the additional complexity that comes with a dedicated orchestration tool. Airflow sees the world in its own way, and for you to be able to fully leverage it, you’ll need to make compromises and adjust your workflows to match its characteristics.

When integrating an orchestration tool, you should also consider how you want to trigger your pipelines/workflows. Airflow does support event-based triggers (via sensors), but issues can quickly arise, making you adapt your needs just because of the tool — instead of letting the tool help you with your needs.

Helping us sleep better at night: Soda SQL

Just like orchestration, data monitoring (which would eventually allow us to think about data observability) is a need that dbt will eventually stop handling sufficiently for our platform.

Instead of just validating the data of our dbt models, we’d want to track data issues throughout our whole platform, so that we can immediately identify the source of a specific issue and fix it accordingly.

Like data integration, data observability is a space where companies still go for the closed-source approach, which will inevitably slow innovation and progress. On the other hand, there are two open-source products that deliver most of our needs for this goal: Soda SQL and Great Expectations.

Soda SQL is a great place to start because it doesn’t require much of an investment and it offers multiple handy features. You just basically need a couple of YAML files to get up and running, and then you can define custom tests and orchestrated scans.

What comes next?

This has been a long ride. We went through different technologies — some of which are the products of the “third wave” that we’re witnessing, and others are time-tested “second wave” veterans. The main takeaway at this point is that building a fully functioning data platform is easier than ever — if you followed along with the implementation, you’ll find yourself with a ready-to-use modern data platform that you built in less than an hour.

The modern data stack is still fragmented, sure, and betting on some of the technologies that we discussed can be a risky decision. Other than dbt, there is no modern data stack tool that’s a clear winner at what it’s doing, and so the ecosystem will keep changing throughout the upcoming years via consolidation and competition. What is certain, though, is that exciting times are ahead of us.

For more data engineering content you can subscribe to my bi-weekly newsletter, Data Espresso, in which I’ll discuss various topics related to data engineering and technology in general:

--

--