Hashing large tables
Diving into the hashing stage of our hash-based database comparison and replication.
Our hash-based data replication allows one to compare and replicate very large tables when bandwidth between source and target databases is very small.
Using hashes to determine differences
Naively, we might think we can ask a database to provide us with hashing functions to uniquely determine row identity. While decent relational databases do indeed provide hashing functions, support for them varies wildly. Besides, we want to be able to replicate changes even in non-relational datasets, like CSVs or Excel. We will use SHA1 instead of MD5 as MD5 is banned in FIPS environments.
Client-side hashing
We read all of the data from all tables. However, we do that by reading by clients very close to the databases. This means that:
a) hashing is consistent because our code is doing it for all of the databases
b) large data is traveling only across LAN
c) just PK values and row hashes are traveling across the Internet
d) we can encrypt data that travels across the Internet
No compression for hashes
Unfortunately, good hashes are by definition impossible to compress. While we can compress key column values, considering a typical PK is 4 bytes and SHA1 is 20 bytes, it isn't worth it.
Storing hashes
Today, tables with millions of rows are very common, and we are seeing more and more customers with billions of rows. All these rows didn't end up in their databases overnight and our tiny SQLite state database certain can't ingest all that quickly, being multiple-writers challenged as it is. Therefore, we are storing blobs containing data for 100k rows at once.