What is data versioning and 3 ways to implement it

Reproducibility of the data is a key element in any data platform. Called data historization, nowadays data versioning, this is the technique to achieve it. This article says everything about it.

Petrica Leuca
Dev Genius

--

To exemplify data versioning, I imagine that I work as a data engineer in an e-commerce company. For simplicity’s sake, the e-commerce app saves its data into a Postgres database and the model is reduced to 2 tables: user and order.

To buy products, an user needs to create an account and provide personal information: name, address etc. This kind of information is saved in a dimension entity (user), due to its descriptive and mutability nature. When the user is placing an order this is called an event and it is stored in a event/fact entity (order), due to its immutability nature.

Never store the entire address in a single field :)

Based on the above data, we know that user demo-user-1 has ordered 2 products and it should receive them at the Street Unknown, no 10 address. In one month from today demo-user-1 is changing its address. First idea which comes to mind is that we can overwrite the existing information and forget about the previous address. By doing so, important information is deleted. Due to law requirements (audit, compliance, law suits) this kind of information needs to be retrievable and this is where data versioning has its main role. This requirement is called reproducibility and it means that the result of a query for a certain point in time is not changing over time, even though the data is changing.

What is Change Data Capture?

Our user updated its address in our application and our data looks like:

ecommerce=# select * from ecommerce_user_table;
-[ RECORD 1 ]---+---------------------------
user_id | 1
user_name | demo-user-1
user_address | Street Unknown, No 20
creation_date | 2021-10-23
insert_datetime | 2021-10-23 13:54:02.815599
update_datetime | 2021-11-23 14:24:20.560314

In our example, the e-commerce application itself is interested in the current address of the user in order to send the products to the correct address. This is why the record in the user table is overwritten. The event of updating the address produces a change in the data and with change data capture we are able to log the changes in the data platform of the e-commerce company:

  • Either we store in a different table all the previous versions of the user (the history approach)
ecommerce=# select * from ecommerce_user_table_hist ;
-[ RECORD 1 ]---+---------------------------
user_id | 1
user_name | demo-user-1
user_address | Street Unknown, No 10
creation_date | 2021-10-23
insert_datetime | 2021-10-23 13:54:02.815599
update_datetime | 2021-10-23 13:54:02.815599
  • Either events are logged with what was changed (the event approach*)
With CDC we are able to record the changes across time

Now that we know what change data capture is, what is data versioning? It is nothing else than the technique of storing the versions of the data in such a way that it can be easily queried.

Data versioning, the Kimball approach

The most common way of doing data versioning is the Slowly Changing Dimension (SCD) approach, implemented by Kimball in his The DataWarehouse Toolkit . There are around 7 types of SCD, but I will display the SCD type 6 approach.

valid_to_date can be defaulted to a date in the far future

For each new different record, a validity interval is being build up based on the unique natural key - user_id in this case. Each record gets a user_key, which is a surrogate key generated from the natural key and the change date.

With the user_key the version at order_date is retrieved, with user_id any version.

Data versioning, the functional approach

Storage is cheap and since Hive metastore was introduced in the data landscape, a functional paradigm was created around data engineering. With this in mind, we do not need to invest in creating complex ETL’s to be able to reproduce a certain version. We can simply create a daily version of the user:

Just an example.

I do consider this approach not efficient enough if I think of:

  • the need to answer if something changed (complex querying needed);
  • non-daily frequency loads: higher frequency duplicates the data even more, lower frequency creates artificial data
  • incremental loads: if the data is incremental at input, artificial data needs to be created in order to create a snapshot for a certain moment.

Data versioning, the event approach

From the CDC chapter, we know that we can do CDC with events. The events can be stored in either a database, either a file, and be queried with window functions:

With window functions we can retrieve a certain version for a moment in time.

If we look at the data closely, we see that with window functions we can generate SCD time intervals:

The fact table would contain only user_id and additional filters would be applied on validity intervals.

While the event approach reduces the need of implementing the SCD complex ETL, there are 2 relevant points to consider:

  • if the event is generated on attributes which are of no interest, a similar version with the previous one might be created;
  • window functions are computational heavy and it is still recommended to process them in ETL jobs.

In this article I have summarized what data versioning is, how change data capture plays a role in it and how to implement it. Depending on the amount of data and the underlying technology of the data platform, any of the 3 ways might be the best choice.

Notes:

  • CDC, event approach: Kafka made it possible. You can stream change events at column level and create your unique view of the data.
  • While in some cases change behavior might be a key factor in an analytics product, I have not yet encountered a project where this is more important than the strain (and cost) you have on your system to generate and integrate this kind of events.
  • By now, you might know that I am a huge fan of docker. This is how you can run postgres with docker-compose:
petra1$ docker-compose up 
demo-postgres-python_postgres_db_1 is up-to-date
Attaching to demo-postgres-python_postgres_db_1
postgres_db_1 | The files belonging to this database system will be owned by user "postgres".
postgres_db_1 | This user must also own the server process.
postgres_db_1 |
postgres_db_1 | The database cluster will be initialized with locale "en_US.utf8".
postgres_db_1 | The default database encoding has accordingly been set to "UTF8".
petra1$ docker ps
CONTAINER ID IMAGE COMMAND
....
10009cb5e923 postgres:14.0 "docker-entrypoint.s…"
petra1$ docker exec -it 10009cb5e923 psql ecommerce -U ecommerce_user
psql (14.0 (Debian 14.0-1.pgdg110+1))
Type "help" for help.

ecommerce=#

--

--

Well-rounded engineer, bringing data at your fingertips. I am not affiliated with any of the tools or products I write about.