Writing design docs for data pipelines

Exploring the what, why, and how of design docs for data components — and why they matter.

Mahdi Karabiben
Towards Data Science

--

Photo by Mike Kononov on Unsplash

Over the past few years, adopting software engineering best practices has become a common theme within the data engineering space. From dbt’s software-engineering-inspired capabilities to the rise of data observability, data engineers are getting increasingly accustomed to the software engineer’s toolset and principles.

This shift had a major impact on how we design and build data pipelines. It made data pipelines more robust (since we moved away from hard-coded business logic and complex SQL queries to modular dbt models and macros) and drastically lowered the number of “Hey can you check this table?” Slack messages (via automated data quality monitoring and alerting).

These changes helped us move the industry in the right direction — but there are still areas in which we still have much to learn from our software engineering counterparts.

Based on recent data released by dbt Labs, around 20% of dbt projects have more than 1,000 models (and 5% have more than 5,000). These numbers highlight one fundamental problem in our data pipelines: we’re not intentional enough with how we design them. We stack layers upon layers of ad-hoc models and use-case-specific transformations and end up with ten models that represent the same logical entity “with subtle differences”.

In this article, we’ll discuss one artifact that can help us design (and build) robust foundations for our data platforms: design docs.

What are design docs?

In the software engineering world, two main documents are usually created during the design phase of a software component to help engineers collaborate and document decisions:

  • The design doc or Request For Comment (RFC) should contain all the relevant information related to the current state, why something needs to be done, and the possible designs/solutions. The content then evolves based on feedback and inputs from different people/teams until a consensus is reached.
  • The Architectural Decision Record (ADR) documents a specific decision. This document is more of a written snapshot containing the core factors that shaped a technical decision and the different aspects of the decision/design itself.

Companies usually have an internal template for both documents (or a variation of them) that different teams can leverage to standardize how technical decisions are made. Gergely Orosz wrote a great post on the topic with multiple examples, and I also recommend going through this public overview of the design doc process at Google.

Why do we need design docs for data pipelines?

One of the unintended consequences of the dbt-based approach to building data pipelines is that it made adding more nodes/models to the graph too easy. This made data engineering teams indulge in the cycle of building ad-hoc data pipelines: Instead of treating data pipelines as complex pieces of software, they are built without a proper design phase nor a governed big picture, resulting in this very common dbt lineage graph:

Sample dbt graph with an ad-hoc approach: increasing complexity and lack of a governed foundational layer (image by author)

The main problem with this approach is that nodes keep getting added to the graph without much oversight. This results in an endless spiral of complexity that leads to unnecessary costs and makes finding the right consumption table to use an impossible task.

To combat this anti-pattern, my recommendation is to always treat data pipelines with the same rigor with which software engineering teams treat complex software components. This means that before writing dbt models, we should define (and agree on) some foundational blocks:

  • The scope of the pipeline/data assets
  • The changes to the global consumption layer
  • The characteristics of the assets that will be delivered
  • The design of the pipeline itself

As you can probably tell, my suggestion is to gather this information (and more) in a design doc and ensure that all concerned stakeholders are in agreement before you start the implementation.

Pipeline vs. Component

Before we start discussing the details of the design doc, it should be noted that the notion of scope, the overall design of the data platform, and how the data engineering team(s) is(are) organized will vary from one company to another (based on a multitude of factors). This means that the concepts introduced in this article should be adjusted based on your own data platform. For example, at Zendesk, we think about our data assets within data domains, so the scope of a design doc would be a given data domain (from its data sources to the foundational and highly-governed consumption layer).

This also applies to the definition of “data pipeline”; the term is used throughout this article since it’s a common logical component within the data engineering space, but a more accurate umbrella term would be “data component” (that can be a set of data products: pipelines, tables, or other artifacts).

What should my design doc look like?

Now that we’ve seen why design docs matter, the next question is how to adapt the concept to data pipelines (or, more accurately: data components) — so what are the main sections that should be present in your data component’s design doc?

1. Design metadata

The design doc represents a golden opportunity to capture all the relevant metadata for a given component. This metadata can include the following:

  • Ownership: The component should have different owners (technical, business, etc.) who are accountable for the design and implementation.
  • High-level description: It’s important to have a very brief description of the component, why we want (or preferably, need) to build it, and which business problems or use cases it’s solving. This allows potential contributors from different teams to have the necessary context before reading further.
  • Technical metadata: This will vary based on your platform, but it’s essential to include the start date (when will the data start being available?) and the proposed history backfill (what’s the oldest date that’ll be available?)
  • Reviewers and review state (optional): This information can be managed outside of the design doc (via a separate ADR document for example), but for the sake of simplicity and to streamline the process this section can also include reviewers' information (preferably part of a central team that can assess how the component fits within the global design) and the current state of the design (whether it’s approved or in an intermediate state).

You can’t have “too much” metadata in this section. If there’s a piece of information worth capturing, then it definitely has a place here.

2. Existing resources

This section aims to capture all the relevant resources related to this component. These can include documentation about the existing design or links to the business use cases or problems this component will solve.

The idea here is that anyone who reads the design doc should have the full context related to this component and the resources related to it.

3. Core downstream metrics and use-cases

A key problem with ad-hoc data pipelines is that they’re built without a clear objective. Instead, tables are added to answer a vague and ill-defined use case and then they’re forgotten.

For this reason, I recommend focusing on the metrics and downstream use cases before we even start thinking about the actual pipelines. This section should provide detailed context on the following items:

  • What are the main use cases that will rely on this data component?
  • What are the main core metrics that will be calculated using the tables part of this component?

4. The consumption data model (or how it will change)

Now that we’ve discussed the core metrics and use cases we’re designing this component for, let’s decide on how the consumption layer should look.

Having already listed the downstream metrics, the data modeling phase becomes less tricky since it’ll consist of defining the foundational data model that can serve all downstream usage scenarios. Here again, the specific design will depend on factors outside the scope of the design doc itself — you can leverage dimensional data modeling or other techniques as long as you maintain consistency across the different components and avoid creating silos or duplicate entities.

How the new data component will update the global consumption data model (image by author)

To this point, we still haven’t talked about data pipelines or dbt models. Instead, the focus is on defining how we want to present data assets that can satisfy all of the expected downstream use cases and metrics.

5. Metadata at the table (and column) level

After defining the logical data model, ideally we’d also provide context at different levels regarding the consumption assets that’ll be built:

  • What information will each table contain? (table-level description)
  • For every table, what are the columns that we plan to provide? What information will each column contain? Are we making any data quality commitments?
  • For every table, what will be the data update frequency? Are we committing to any SLAs?
  • For every table, what’s the cost estimation for the initial backfill and every execution?

Our aim here is to ensure that all stakeholders agree on the assets (tables) that will be built and the different commitments being made. Depending on your data development standards, this section may also include additional information like the business criticality of every table or its certification.

6. The pipeline design

The last section is where I recommend to finally focus on the data pipelines. Having identified all the core downstream use cases and expected commitments, we can now design how we want to get there with much more clarity.

This section would contain a list of all source/raw tables that we aim to use as part of this data component (and whether these tables are already part of the platform or we’d need to bring the data via an Extract-Load process) and the dbt models (or data pipelines) we want to build to reach the desired state of the consumption layer.

Sample design of data pipelines (or dbt models) to build as part of a data component (image by author)

But can it scale?

One initial reaction to this approach might be assuming that it’ll be too difficult to scale — but that’s only true if we have a skewed definition of scalability.

Scalability shouldn’t consist of answering one thousand use cases with one thousand dbt models (because that approach eventually won’t be scalable). Instead, the scalability we’re looking for consists of efficiently answering ten thousand use cases without falling into the endless complexity loop. In that regard, this approach (which focuses on building a robust foundation fine-tuned for the core downstream use cases and with inputs from all relevant stakeholders) is definitely scalable.

Downstream of this foundational highly-governed layer, you can adopt different approaches to tackle specific use cases or customized scenarios: either via using an “automated” semantic layer or opening the door for denormalized metric-oriented tables that are built using the foundational layer.

Conclusion

In this article, we covered the concept of design docs and how they can be valuable artifacts that allow us to design and build more robust data platforms and scalable data components.

The sections presented in this article don’t necessarily need to be present in your own version of design docs. Instead, the most important thing to remember is to be intentional with building data pipelines and components.

For more data engineering content you can subscribe to my newsletter, Data Espresso, in which I discuss various topics related to data engineering and technology in general:

--

--