Data Lineage Concepts, Hierarchies, Visualizations and Roles in the Age of Big Data

Published in

Dev Genius

8 min readJun 18, 2022

As a very important part of data governance, the data lineage relationship needs to be paid special attention. In this article, let’s take a closer look at the concepts, hierarchies, visualizations and roles of data lineage.

Data Lineage Concepts, Hierarchies, Visualizations and Roles

The Concept of Data Lineage

In human society, lineage relationship refers to the interpersonal relationships arising from marriage or childbirth, such as parent-child relationship, sibling relationship, and other kinship relationships derived from it. It is the innate relationship of human beings, which existed at the beginning of human society, and is the earliest social relationship formed.

In the era of big data, the explosive growth of data, massive and various types of data are rapidly generated. These huge and complex data and information, through mutual fusion, transformation and circulation, generate new data and gather into an ocean of data.

Data is generated, processed and integrated, circulated, and eventually dies, and a relationship will naturally form between data. We refer to a similar relationship in human society to express this relationship between data, which is called the data lineage relationship. Different from the lineage relationship in human society, the data lineage relationship also contains some unique characteristics:

Attribution. Generally speaking, specific data belongs to a specific organization or individual, and the data has attribution.
Multisource. The same data can have multiple sources. One data can be generated by processing multiple data, and this processing process can be multiple.
Traceability. The data lineage relationship reflects the life cycle of data, and reflects the entire process of data from generation to demise, with traceability.
Hierarchical. Data lineage relationships are hierarchical. The description information of the data, such as the classification, induction, and summary of the data, forms new data, and the description information of different degrees forms the level of the data.

The Hierarchies of Data Lineage:

There are subtle differences in the hierarchy of lineage relationships for different types of data.

Generally speaking, data belongs to a certain organization or a certain person, and the data has an owner. Data flows and merges between different owners to form a relationship between owners through data, which is a kind of data lineage relationship and is at the top level in the hierarchy. This relationship clearly shows the provider and demander of data.

Databases, tables and fields, are storage structures for data. Different types of data have different storage structures. The storage structure determines the hierarchy of lineage relationships. So there are some differences in the lineage relationship hierarchy for different types of data.

The lineage relationship of data at different levels reflects different meanings. The owner level reflects the provider and demander of the data, and the other levels reflect the ins and outs of the data. Through the lineage relationship at different levels, the migration and circulation of data can be clearly understood, which provides a basis for data value evaluation and data management.

The Visualization of Data Lineage

Visualization, from a technical concept point of view, is the theory, method and technology of using computer graphics and image processing technology to convert data into graphics or images and display them on the screen and interactively process them. The significance of visualization lies in the rapid and rapid transmission of signals, and the visual and intuitive display of data and its relationships, which is convenient for users to discuss, explore the essence, and discover problems.

For data lineage, visualization is especially important. Only through visualization can data lineage be clearly displayed to users. According to the characteristics of data lineage, we designed a visualization graph of data lineage. According to the different meanings, the visual graph of data lineage relationship includes 5 kinds of visual elements, which are distributed in different positions of the graph. The visualization elements are:

Information Node

The information node is used to represent the owner of the data and the data hierarchy information or terminal information. According to the different level of lineage relationship, the data information is different. The owner level has only owner information, and other levels include owner information and data level information or terminal information. For example, in the lineage relationship between fields in a relational database, the description information of the node is: owner.database.data table.data field.

There are three types of information nodes: master nodes, data outflow nodes, and data inflow nodes.

There is only one main node, which is located in the middle of the whole graph and is the core node of the visual graph. The lineage relationship displayed by the graph is the lineage relationship of this node, and other lineage relationships unrelated to this node are not displayed on the graph to ensure the simplicity and clarity of the graph.

There can be more than one data inflow node, which is the parent node of the main node, representing the data source, and is located on the left side of the entire graph.

There can also be multiple data outflow nodes, which are child nodes of the main node, indicating the destination of the data, and are located on the right side of the entire graph. The data outflow node includes a special node, that is, the terminal node. The terminal node is a special data outflow node, which means that the data will no longer flow downward. This kind of data is generally used for visual display.

2. Data Flow Line

The data flow line represents the flow path of data, which flows from left to right. The data flow line is converged from the data inflow node to the main node, and then spreads out from the main node to the data outflow node.

The data flow line shows three dimensions of information, namely direction, data update magnitude, and data update frequency. There is no special design for the expression of direction, and it flows from left to right by default. The magnitude of the data update is represented by the thickness of the line. Thicker lines indicate larger data magnitudes, and thinner lines indicate smaller data magnitudes.

The frequency of data update is represented by the length of the line segment in the line. The shorter the line segment, the higher the update frequency, the longer the line segment, the lower the update frequency, and a solid line means only one transfer.

3. Cleaning Rule Node

The cleaning rule node is used to represent the filtering criteria in the data flow process. A large amount of data is distributed in different places, and each place has different requirements for data quality. The data recipient will filter the accessed data according to its own data requirements. These requirements form data standards, and according to these standards for data cleaning.

Cleaning rules may vary. For example, the requirement cannot be a null value, and it is required to conform to a certain format. On the visual graphics, the cleaning rules are represented by a circle marked with a capital letter “E”, and various rules are simplified to ensure the simplicity and clarity of the graphics. Viewing the content of the rules is also very simple, moving the mouse over the circle marked with the capital letter “E” will automatically display a list of standard checklists.

The brief graph of the cleaning rule is located on the data flow line, indicating that the data flowing on the line can continue to flow only if it meets these standards.

4. Transformation Rule Node

The transformation rule node is similar in appearance to the cleaning rule node, represented by a circle marked with a capital “T”. Located on the data flow line, it is used to represent the changes and transformations that occur during the data flow process.

The data from the data provider sometimes needs special processing before it can be accessed to the data demander. This processing may be relatively simple, for example, only the first four digits of the source data are intercepted. It can also be very complex and require special formulas. In terms of visualization, in order to ensure the simplicity and clarity of the graphics, a brief treatment has been done. It is also very simple to check which transformation rules the data has undergone. Move the mouse over the circle marked with a capital letter “T”, and a list of transformation rules will be displayed automatically.

5. Data Archive Destruction Rule Node

We believe that data has a life cycle. When the data no longer has use value, its life is over, either archived or directly destroyed.

It is very difficult to judge whether the data still has use value. Some conditions need to be designed. When these conditions are met, the data is considered to have no use value and can be archived or destroyed.

On the visual graph, we designed a circle marked with a capital letter “R” to briefly represent the data archiving and destruction rules. Moving the mouse over the circle marked with a capital “R” automatically displays a list of filing and destruction rules.

The Role of Data Lineage

The role of data lineage can be summed up in the following aspects:

1. Data Traceability

Traceability refers to the search for the root and source of things. The data we analyze and process may come from a wide range of sources, including government data, Internet data, data obtained from third parties through data transactions, and data owned by ourselves. Data from different sources have uneven data quality and have different effects on the results of analysis and processing. When data anomalies occur, we need to be able to track down the cause of the anomaly and control the risk to an appropriate level.

Data lineage reflects the ins and outs of data, which can help us trace the source of data and trace the process of data processing. On the data lineage relationship visualization graph, the left side of the main node is the data source node, which is very clear and clear at a glance. What transformations the data has undergone can also be seen from the visual graph, which is very helpful for analyzing the causes of abnormal data.

2. Assess Data Value

The value of data is very important in the field of data transactions, involving the pricing of data. To evaluate the value of data, you need a basis. The data lineage relationship can provide a basis for the evaluation of data value from several aspects:

1). Data audience. On the lineage relationship diagram, the data outflow node on the right represents the audience, that is, the data demander. The more data demanders, the greater the data value.

2). Data update magnitude. In the data lineage relationship diagram, the thicker the line of the data flow line, the greater the magnitude of the data update, which reflects the value of the data to a certain extent.

3). Data update frequency. The more frequently the data is updated, the fresher the data and the higher the value. On the lineage relationship diagram, the shorter the line segment of the data flow line, the more frequently updated.

3. Data Quality Assessment

From the data lineage relationship diagram, you can easily see the standard list of data cleaning, which reflects the requirements for data quality.

4. Reference for Data Archiving and Destruction

If the data has no audience, it loses its use value. From the data lineage diagram of the data, if there is no data node on the far right, it is possible to evaluate whether the data represented by the master node is to be archived or destroyed.

Conclusion

Right now, we live in an age of seemingly endless data. Data has invaded our lives. We rely on data for a variety of tasks, from fueling economic development and advancing science to recording our health information. There is no doubt that we have entered the era of big data.