In their daily activities, Data Engineers and Data Analysts are required to improve data ingestion processes. Beyond the usual unit tests, it may be interesting to easily and quickly compare two datasets (i.e. tables, views, queries, etc) for different purposes such as impact analysis or non-regression testing. Also, discrepancies identification between two points of view of a snapshot table is quite useful for adhoc analysis or debugging.
With this in mind and for SQL practitioners, a pre-built and reusable script makes sense, hence the purpose of this article.
Ready-to-use SQL template explained
Let’s go to the heart of the matter with the expected query:
Now, let’s explain it:
Note: obviously, having two datasets with comparable fields is a prerequisite to use the above query
Illustrated example
Enough with theory, now to practice! Suppose a supermarket wants to retrieve all products eligible for a promotion in a single table according to the week promotion conditions.
As an initial statement, we consider below products in stock:
According to the promotion conditions in week 1:
the dataset_1 is:
In week 2, the promotion conditions evolve to:
The data pipeline evolution generates the dataset_2:
But… are we really confident with the results? Typical questions are:
Note: even if it seems easy to answer the above questions intuitively due to a trivial example on a small dataset, in real use cases we usually face plenty of rows and columns from complex queries (transformations, joins, aggregates, windows functions, …) letting this query take on its full meaning
Let’s try to answer the 3 questions interpreting the comparison query results:
Observations:
Finally, everything looks good:
Tips and tricks
In CTE, even if you can copy-paste your ingestion SQL query before/after evolution, it could be a good idea to store the results in temporary tables to simplify the comparison, improve the query performances and benefit from caching if you run it multiple times.
In real-world, divergences between two datasets can be messy. Then, sorting results by key (the granularity of the datasets) and flag can hugely help interpreting and comparing equivalent rows coming from both datasets.
To ease the investigations and identify the gap origin, the suspicious datasets common fields can be removed (i.e. commented in CTE) from comparison: if there is no result, it means all compared fields are equal. You can then focus on remaining suspicious fields only for the next comparison and make it step-by-step.
Note: an equivalent analysis could have been performed using LEFT JOIN strategy but would have been much more difficult to maintain (NULLs & fields comparison management) and less efficient as set operators are more powerful than joins.
Summary
This templated query therefore makes it easier for developers to quickly validate changes on complex data pipelines. It is useful as a complement to more traditional unit tests, and can even be considered in a more general way when comparing any two datasets with similar structures. Last but not least, the logic of this query is even easy to learn by heart!
Thank you for reading, I hope it was clear and I would be glad to hear your feedback :)