What is change data capture?

I never told you what I do for living. My team makes software that seamlessly copies databases even if source and target are made by different vendors (heterogeneous database migration).

There are quite a few scenarios and complexities involved, but one of the hardest problems is called change data capture, or CDC.

If your database is small, you can copy it over by whipping up a few scripts of your own (and hope you don't mess something up). For larger databases you need to approach things more methodically.

There are a few ways to go about it:

1. Polling for changes

Periodically scan source tables, then copy the changes since the last scan to the target. This is easy to implement, but very inefficient.

2. Triggers and change tables

Sprinkle triggers all over the source database, intercepting DML calls. Create additional tables where your triggers will write full log of changes.

3. Database-specific change APIs

For databases allowing that, hook into their existing change tracking mechanisms. Periodically poll for changes. This is pretty decent for most scenarios.

4. Database-specific low level log file sniffing

This is a hard-core hacking. You don't want to poll, ask, or beg. Instead, you write a native agent which reads the database transaction log.

Yes, it's an undocumented binary file format. Yes, it can change whenever a new database version is released. You reverse-engineer it and do everything your self. Very low-level, very dangerous, very fast. We are talking near-realtime.

Want to know more? I will be going into more details for each approach in upcoming posts.