Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeIf you want to implement data lineage for your business, you’ve come to the right place. In this beginner’s guide, you’ll learn what data lineage is, why it is important, and the various use cases it can solve. Additionally, you'll discover how to choose a great data lineage tool, implement data lineage effectively, and maintain it over time. We will also address common challenges associated with data lineage.
Let’s get started!
You’ve probably seen many definitions of data lineage — data flow, data provenance, data lifecycle, data tracking, and even a data family tree. Most of them describe data lineage as the process of recording data movement in your business from point of entry to point of exit.
But in reality, it’s much more than that.
Data lineage as a living and breathing map of data movement within your business.
For example, this is how we show data lineage in Foundational:
Data lineage can help your business with:
For example, you can get answers to the following questions (grouped by data movement direction):
And so on.
Here is how you’d answer the question “Where is this data coming from?” in Foundational:
Let’s take a look at a couple use cases.
Let’s say you’re a small B2B software company with about 200 employees. You’re building out your data team with 1-2 data engineers, 2 analysts, a data scientist, and a team lead, and you’re using Snowflake to store your data.
You want to create a marketing dashboard with an ad campaign report.
What do you do?
How can data lineage help?
Data lineage can show you what data is going into which table and where it’s going after.
Let’s say, you’re a mid-sized B2C e-commerce enterprise with millions of customers and a team of 1,500-2,000 employees. The complexity of your data ecosystem is growing every day. You have a few dozen of data engineers, analysts, data scientists, and data platform engineers—all responsible for ingesting, transforming, and making the data usable for your business. And you have a data lake or even multiple data lakes.
Your want to reduce the cost of data storage.
What do you do?
How can data lineage help?
Data lineage can show you the paths to all parts of your data stack so you can understand how it all fits together.
You’ve decided to implement a data lineage tool. Congratulations! But how do you choose the right one from so many?
There are tools that are okay, and tools that are great.
An okay data lineage tool will map only some of the data sets in your data stack. It’ll check the box on compliance and basic usage, but it won’t be easy to adopt. Your data team won’t use it as much. And it won’t address the areas like data management and improving data quality.
A great data lineage tool will map out all your data sets across your entire data stack (dbt, Snowflake, BigQuery, etc.) and all their origins, transformations, incidents, resting stops, and endpoints. It’ll help you increase your data coverage from 60% to 90%, make critical business decisions, ensure data compliance, and improve data management.
Look for a tool that has:
Metadata management: Lets you annotate, search, and understand all your data assets with descriptions, tags, and custom attributes so you can search your data assets, understand impact analysis, and see how changes will affect data assets.
Here's how to implement a data lineage tool step-by-step:
First, you need to prioritize what you need your data lineage for. Is it security? Compliance? Data quality? Data development? And which one is more important to your business?
Let’s go through them one by one.
Security is important to track the movement and transformation of data across your systems and processes to spot any potential weaknesses and protect sensitive information.
Compliance is required to follow regulations like GDPR, HIPAA, and PCI-DSS.
Data quality means tracking where you data is coming from, how it changes, and if there are any issues with its accuracy, completeness, or consistency.
Data development is nearly impossible without data lineage.
Next, determine the scope. What are the must-haves that your data lineage should cover? And what are the nice-to-haves? Do you need it for your databases? Your BI tools? Your data lake?
Here are some examples of must-haves and nice-to-haves.
Must-haves:
Nice-to-haves:
Next, you need to figure out your implementation strategy.
Do you want to buy a tool, or do you want to build one yourself?
Ultimately, there is no right data lineage tool for the job. Some tools will be more mature depending on your needs and scope.
For example, if you’re a bank, there is probably a tool that can help you with your databases. But why do you need it and what for? Answering these questions can help you with your implementation strategy.
You chose a data lineage tool and implemented it. But now you need to determine your ongoing strategy for keeping data lineage up to date (both technical and process).
Let’s say, you created a process where everyone who makes a new dashboard needs to document it. That’s a data lineage process. Most companies don’t do that. Every time someone creates a new dashboard, they manually check to make sure the description is accurate.
Make sure you create a process for every change and always follow it.
Finally, you need to think about how you’ll be publishing your data lineage information.
How will you make it accessible for a variety of uses cases?
Make sure you can tie it back to the use cases—to show people how they should consume and use your data lineage information.
As soon as you implement data lineage, you’ll need to maintain it. That means regularly monitoring it and updating it, to keep it accurate.
If you use Foundational, your updates will be fully automated and easy to embed into your existing workflow.
With Foundational, you can:
Finally, let’s talk about the three most common data lineage challenges.
Let’s say, you have a legacy database. The problem is, no one in your organization knows how the data sets in that database are being used or who owns what. But you found some code, and you see that it’s using the data somehow. You won’t delete the data, but you need to understand all the different assets.
Here is how you can understand what’s happening:
You do something with your data, but you realize that its accuracy is poor. It doesn’t show up correctly.
Here is how to improve it:
Implementing data lineage may take anywhere from six to twelve months. Because it takes so long, some business just give up. But even after you have successfully implemented it, the whole thing goes out of date very quickly. You need to put in lots of processes to keep it up to date. It’s surprisingly manual—because the technology for it hasn’t changed for years.
Here is how you can keep it updated:
There are some clear advantages to using a great data lineage tool for your business. It’s worth evaluating the many data lineage tools out there to choose the right one for you.
Want to learn more about data lineage for businesses and how we can help you with implementing, automating, and updating it?
Chat with us! We’d love to hear from you.