Pull Requests in Data Engineering are Full of Surprises

There’s a counter intuitive difference between writing code for software and writing code for data - in software you are a lot less surprised. Why counterintuitive? Because you’d think that SQL would not surprise you that much, but it’s oftentimes the opposite:

It’s order sensitive
Query optimization matters a lot
A lot of tools are involved, where SQL would behave differently

Most importantly, when writing code for data, the data layer plays a huge role in how your code would behave, and is usually not visible when the code is being written. That’s a big deal.

Another way to think about this - In Software, a code commit would almost entirely describe what’s changing. This creates predictability, and one obvious outcome of this is that rolling back a faulty version to the previous version is straightforward. In data, rolling back a faulty commit would require a code change but also carefully replaying the data, making this typically a lot more complex.

What’s different about writing code for data?

There are certainly several ways to go about this, but one aspect of why code for data is different is the process it goes through when being deployed. Let’s think about this using a simple example: What would happen if a data engineer changes a field’s type in a table? Would this cause a problem? The answer is of course, sometimes. For example, if that field is involved in a comparison, and now its type is boolean, that comparison is not effective anymore, which would cause a problem. The question is, would *anything* in the build process flag this? In the vast majority of cases, the answer is no. This is because in most data frameworks, the code is checked for very little, sometimes for basic syntax and that’s it.

Another aspect is that in data different projects, representing tools and environments, are isolated from each other. For example, a person that is pushing new code in a dbt project, doesn’t have any mechanism by default to create a constraint around what’s allowed and what’s not allowed, if that new code affects downstream dashboards in Looker or Tableau. These dependencies exist outside the boundaries where the dbt engineer is working, and there’s no exposure to them when writing, building and deploying the code. The dbt project will build correctly, and code will get deployed.

While there are some mechanisms that you could work with, for example Exposures in dbt, SQLFluff, and others, the simple examples above would still pass.

What can we do?

Of course, one can argue that bugs exists in software too, but while this is definitely the case, there is still an argument that in data engineering there’s a surprising amount of simple code changes that cause devastating effects. Even a straightforward change, such as renaming a column in a single table, is often seen as something most data teams avoid unless absolutely necessary.

Why is that the case? Because renaming a column is potentially a breaking change for downstream queries, similar to renaming a field in an API – We have to assume that it’s a change that needs to be carefully thought out while understanding the actual dependencies and determine how they would be impacted by the change. Another way to think about this, is by thinking of these dependencies as implied contracts that data engineers are continuously creating by adding new queries, jobs, and pipelines. In order to avoid breaking changes – that violate these data contracts – we’ll need to analyze the actual code change, understand all of its downstream as well as upstream dependencies, and determine the implied contracts and whether any of those are violated. Ideally we can do this at the time of the change, and before it’s merged and can break dashboards or cause a data incident.

However, doing this in a typical data stack is not easy:

Data engineers often don’t really know what (or who) is using the fields they are changing, especially if there are multiple platforms and tools involved, for example Snowflake and Tableau
Those people who are affected may not be in the same team (This is often the case)
Data engineers, like any developer really, would like to move fast, and rightfully so

What this means is that we need a combination of technology, together with process:

The technology should understand, automatically, at the time of build, what are all the implied contracts, who is affected, and what is the impact
The process should allow for continuous development and deployment of changes, which is really what CI/CD is all about – but applied into data engineering

One encouraging aspect of this representation is knowing that we already have a strong parallel for the process - we know how software engineering and CI/CD look like in modern software development. However, we still need to understand what is the supporting technology needed to make it work seamlessly, with data.

Chat with us

At Foundational, we are solving extremely complex problems that data teams face on a day-to-day basis. Identifying issues in pending pull requests is only one aspect of it – Connect with us to learn more.

Pull requests in data engineering are full of surprises

What’s different about writing code for data?

What can we do?

Chat with us

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.
For Everyone

Related posts

Spark Lineage via Code Analysis

Foundational Now Available on AWS Marketplace!

Expanding the Horizon of OpenLineage: Extracting Lineage from Code with Foundational

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone