Safe Database Migration Pattern Without Downtime
I’ve been doing a continuous delivery talk for a while now and during my talk I describe a pattern of how to safely migrating one database to another database without downtime. Since many people contacted me and asked for more details about it, I will describe it here in more details as promised.
You can use this pattern to migrate between two different databases, for instance between MySql and MongoDB or between two schemas in the same database.
The idea of this pattern is to do a lazy database migration using feature toggles to control the behaviour of your application and progressing through the phases of the migration.
Let’s assume two databases you want to migrate from the “old” database to the “new” database.
Step 1
Build and deploy the “new” database schema onto production. In this phase your system stays the same, nothing changes other than the fact that you have deployed a new database which you can start using when ready.
Step 2
Add a new DAO to your app that writes to the “new” database. You may need to refactor your application to have a single (or very few) point(s) in which you access the database. At the points you access the database or DAO you add a multi-state feature toggle that will control the flow of writing to the database. The first state of this feature toggle is “use old database”. In this state your code ignores the “new” database and simply uses the “old” one as always.
Step 3
Start writing to the “new” database but use the “old” one as primary. We are now getting into the distributed transaction world because you can never be 100% sure that writing to 2 databases can succeed of fail at the same time. When your code performs a write operation it first writes to the “old” database and if it succeeds it writes to the “new” database as well. Notice that in this step the “old” database is in a consistent state while the “new” database can potentially be inconsistent since the writes to it can fail while the “old” database write succeeded.
It is important to let this step run for a while (several days or even weeks) before moving to the next step. This will give you the confidence that the write path of your new code works as expected and that the “new” database is configured correctly with all the replications in place.
At any time you decide that something is not working you can simply change the feature toggle back to the previous state and stop writing to the “new” database. You can make modification to the new schema or even drop it if you need as all the data is still in the “old” database and in a consistent state.
Step 4
Enable the read path. Change the feature toggle to enable reading from both databases. In this step the it is important to remember that “old” database is the consistent one and should still be treated as the authoritative data.
Since there are many read patterns I’ll describe just a couple here but you can adjust it to your own use case.
In case you have immutable data and you know the record id you first read from the “new” database and in case you did not find the record you need to fall back to the “old” database and look for the record there. Only if both databases don’t have the record you return a “not found” to the client. Otherwise if the record is found you return the result preferring the “new” database.
If your data is mutable you’ll need to perform the read operation from both databases and prefer the “new” one only if the timestamp is equal to the record in the “old” database. Remember in this phase only the “old” database is considered consistent.
If you don’t know the record id and need to fetch unknown number of records you basically need to query both databases and merge the results coming from both DBs.
Whatever your read pattern is, remember that in this case the consistent database is the “old” one, but in this phase you need to read and use the “new” database read path as much as you can, in order to test your application and your new DAO on a real production environment. In this phase you may find out that you are missing some indices or need more read replicas.
Let this phase run for a while before moving to the next phase. Like in the previous phase you can always turn the feature toggle back to the previous states without a fear of data loss.
Another thing to note that since you are reading data from two schemas you will probably need to maintain backward and forward compatibility for the two data sets.
Step 5
Making the “new” database the primary one. Change the feature toggle to first write to the new database (you still read from both but now prefer the new DB). This is a very important step. In this step you already run the write and read path of your code for a while now and when you feel comfortable you now switch roles and making the “new” database the consistent one and the “old” as a not consistent. Instead of first writing to the “old” database first you now write to the “new” database first and do a “best effort” writing to the old database. This phase also requires you to change the read priority. Up until now we considered the “old” database as having the authoritative data but now you would prefer the data in the “new” database (of course you need to consider the record timestamp).
This is also the point where you should try as much as you can to avoid switching back the feature toggle to the previous state as you’ll need to run a manual migration script to compare the two databases as writes to the “old” one may not have succeeded (remember distributed transaction). I call this “the point of no return“.
Step 6
Stop writing to the “old” database (read from both). Change the feature toggle again to now stop writing to the “old” database having only a write path with the “new” database. Since the “new” database still does not have all the records you will still need to read from the “old” database as well as from the new and merge the data coming from both.
This is an important step as it basically transforms the “old” database to a “read-only” database.
If you feel comfortable enough you can do step 5 and 6 in one go.
Step 7
Eagerly migrate data from the “old” database to the “new” one. Now that the “old” database is in a “read-only” mode it is very easy to write a migration script to migrate all the records from the “old” database that are not present in the “new” database.
Step 8
Delete the “old” DAO. This is the last step where all the data is migrated to the “new” database you can now safely remove the old DAO from your code and leave only the new DAO that uses the new database. You now of course stop reading from the “old” DB and remove the data merging code that handle merging data from both DAOs.
This is it you are done and safely migrated the data between two databases without downtime.
Side note: At Wix we usually run steps 3 and 4 for at least 2 weeks each and sometimes even a month before moving on to the next step. Examples for issues we had encounter during these steps were:
On the write path we were holding large objects in memory which caused GC storms during peak traffic. Replications were not configured/working properly. Missing proper monitoring.
On the read path we had issues like missing index. Inefficient data model that caused poor performance which let us to rethink our data model for better read performance.
Reference: | Safe Database Migration Pattern Without Downtime from our JCG partner Aviran Mordo at the Aviran’s Place blog. |