Slow changing dimensions for software engineers

TLDR: A takeaway for software engineers is when adding a column to a database (often time low cardinality columns - those taking only a handful of values like red, yellow, green), ask yourself if there are good reporting questions that will be asked of the historical states of this column (for audit, debugging, or for product insight). If there are good questions, consider implementing a strategy for this (either cdc, database triggers, system versioned tables, or immutable rows with timestamps). In the next post I'll discuss a real example with a special sort of database.

A full stack developer used to be frontend + backend. The term now seems to be product + ui design + frontend + backend + ops / cx support + reporting analytics (oh you should also do sales and marketing - you'll learn a ton about what to actually build). I think learning each of these areas to some basic level of competence enriches the approach taken to the others.

The single most practical takeaway I can think of from reporting analytics that software engineers could benefit from is understanding if a column in their database is a slow changing dimension for reporting reasons.

Here's an example to illustrate. Suppose you are a restaurant where customers can set their favorite food item. In the database you may model this as:

customer_id - favorite item
1 - chicken
2 - pasta
3 - bread
4 - pizza

This works so the next time a customer comes in you can know what their favorite item is to suggest a new version of that item. However there are other questions you may want to ask: how many other favorite items did they have before (maybe new customers start loving the pasta but then change to pizza), how long do people say they love the chicken (is it inconsistent - and so it was good but now quality has slipped). In the above schema you cannot answer these sorts of questions since the history is gone.

The favorite item column is a slow changing dimension of interest from an analytics standpoint. A good example that illustrates this is the following example: (I admire the authors writing style whoever the ending feels cruelly poetic that there was no person called John in the database): en.wikipedia.org/wiki/Temporal_database#Exa..

Screen Shot 2022-01-15 at 3.38.59 PM.png

Another good example is a detective taking notes on a mystery case here: marklogic.com/blog/bitemporal to illustrate how "In some industries— financial services, insurance, healthcare, intelligence, law enforcement— keeping track of these different times is extremely important. Understanding when information was known, and being able to recreate that historical record in the case of an audit or to perform analytics after the fact, is critical."

Screen Shot 2022-01-15 at 3.48.11 PM.png

Let's examine some other slow changing dimensions, where engineers design for SCD upfront due to the clarity in the product need:

Healthcare - a patient isn't simply assigned to a primary diagnosis once, you'd want to know the entire history of past diagnosis - so you ledger this in an events table.
Linkedin - you don't just want to know the current job but all past jobs and when changes occurred to visualize things appropriately.
Payroll - you cannot simply have a column that says hourly salary, as you'd want to know past hourly rate and the time in that state to plan reviews.

However there are many cases where SCD seems unneeded from product at first but is useful for the analytics team:

Healthcare - the doctor currently assigned to a member, you'd actually want to know history of assignment not just for treating the member but doing longitudinal research.
Linkedin - last profile update with title, do people often change their primary headline when changing roles or do they forget to, how long are titles fixed for - is this an opportunity for an engagement reminder.
Payroll - the address of the individual for reporting reasons (this is much more slowly changing and less eyes on than outright hourly rates)

These should hopefully get you thinking about slow changing dimensions. You can always make a MySQL or Postgres capture this past temporal data by adding things like cdc/db triggers/row level logging columns, but there are databases/tools which do this automatically and help the developer not have to go to these extents (either by replacing the whole database like AS OF sql query or supercharging your existing database through tooling - cdc):

Standard databases (vanilla postgres or mysql) answer what is the current data look like. Temporal functionality in databases are able to answer how did the data look like at a point in time as viewed today. Uni-temporal databases can answer the question what did the data look like at a point in time as viewed in a past point in time. Bi-temporal databases can also answer how something was recorded as it should have been as viewed in the past. Also there are also tri temporal databases: en.wikipedia.org/wiki/Temporal_database

Now back to the beginning: A takeaway for software engineers is when adding a column (often time low cardinality), ask yourself if there are good reporting questions that should be answered on this column (for audit, debugging, or for product insight). If there are good questions, consider implementing a strategy for this (either cdc, database triggers, or immutable rows with timestamps). In the next post I'll discuss a real example with a special sort of database.