Photo by Egor Myznik on Unsplash
Updates, Order, & the Binlog
A theoretical exercise to realize the purpose of the binlog
Imagine you had two separate databases, a read-write primary and a read-only replica, and you wanted to keep a table between them synchronized. Writes would occur on the primary table and should eventually appear on the replica table. To synchronize the databases, you could just write a cron job which diffs the two tables and updates the replica so it properly reflects the primary. Simple enough.
Let's start with a table populated in primary but not in replica (the id is your primary key):
primary_tbl
----------
id, letter
1 A
2 B
3 C
replica_tbl
----------
id, letter
So the script runs some INSERT .... ON DUPLICATE KEY UPDATE
type statement to update or insert the missing rows. You will end up with this:
primary_tbl
----------
id, letter
1 A
2 B
3 C
replica_tbl
----------
id, letter
1 A
2 B
3 C
Now you run UPDATE primary_tbl SET letter = 'A' WHERE id = 3 LIMIT
to get:
primary_tbl
----------
id, letter
1 A
2 B
3 A
replica_tbl
----------
id, letter
1 A
2 B
3 C
Then when the sync runs:
primary_tbl
----------
id, letter
1 A
2 B
3 A
replica_tbl
----------
id, letter
1 A
2 B
3 A
No problem.
Now lets go back to before the update statement where our tables looked like:
primary_tbl
----------
id, letter
1 A
2 B
3 C
replica_tbl
----------
id, letter
1 A
2 B
3 C
Now let's introduce a very normal complication - a uniqueness constraint forcing letter to be unique. With our new constraint, UPDATE primary_tbl SET letter = 'A' WHERE id = 3 LIMIT
would now fail. But we can perform a series of updates to achieve the same result.
UPDATE primary_tbl SET letter = 'D' WHERE id = 3 LIMIT
UPDATE primary_tbl SET letter = 'C' WHERE id = 1 LIMIT
UPDATE primary_tbl SET letter = 'A' WHERE id = 3 LIMIT
Now we end up with a primary_tbl which looks like:
primary_tbl
----------
id, letter
1 A̷ C
2 B
3 C̷ D̷ A
Now when the syncer goes to update the replica. It tries to update row 1 to C - this would throw an error since C already exists in row 3.
To allow row 1 to update to C, you have to update row 3. So let's do the update on row 3 first, then row 1. However, row 3 cannot be updated to A since that will also throw a uniqueness error
And now you realize the crux of the issue. Without a uniqueness constraints our syncer could mindlessly update/insert replica rows such that they matched the primary. But, now with the uniqueness constraint, such operations cannot occurs.
To circumvent this issue, perhaps you could use a transaction, and do all the updates at the same time to prevent any duplicate values, but languages like MySQL do not defer foreign key constraints (see here).
If somehow the syncer knew row 3 was first changed to D then row 1 to C, and then row 3 to A then the updates could happen. That is exactly what a binlog is - it records every transaction and then the replica just runs through the transaction in the same order that occurred on the primary!
Running through this theoretical exercise made me realize why the binlog actually exits and how diffing two tables is not enough. Hopefully, you realized the same!