I dunno.. people seem to like 0 downtime migrations, but really, for most compan...

grogers · on May 5, 2022

Even if it takes a little more time to do the migration online, it's way less stressful. If something goes wrong during any stage, just roll back and figure it out tomorrow. If something goes wrong when you have turned everything off, you pretty much have to solve the issue now and keep going forward, or do the entire rollback to get to a working state.

jbverschoor · on May 5, 2022

Sure, but 99% of the databases are small enough to have some degration/downtime/exceptions. If something goes wrong, just roll back.

There's no difference in that. "Zero downtime migrations" like only cover adding columns.

Let's say you change a relation from n-1 to a n-m.. this is not gonna save you. You need to deploy a new version of the code. If you want to roll back, you might loose data, or some code doesn't work. It's just a mess, takes more time, is more error prone.

Most companies are not "big tech".

KiranRao0 · on May 5, 2022

Author here, and a few things I'd like to clear up.

> 99% of the databases are small enough to have some degration/downtime/exceptions

I agree that most DBs are small enough to perform the migration operation in a single transaction. However the choice to have downtime isn't solely an engineering question. It's also a product/business consideration.

> Let's say you change a relation from n-1 to a n-m.. this is not gonna save you. You need to deploy a new version of the code. If you want to roll back, you might loose data, or some code doesn't work. It's just a mess, takes more time, is more error prone.

Agreed. This article isn't meant to cover every possible migration, but a good starting point for most of them. Gives a framework to make think about how to implement n:1 -> n:m

> Most companies are not "big tech".

I'm not working in big tech. I'd consider myself working firmly within small tech. And these technique exactly the same if we had exactly 1 API server and 1 small database instance.

CipherThrowaway · on May 5, 2022

I prefer the online approach regardless of downtime. Breaking the migration into incremental and backwards compatible steps is less stressful with smaller tail risk exposure than trying to do it as a single state transition. I think the parallel change approach seems more complicated at first but ends up taking less work and less effort in the long run because of increased safety and incrementalism.

XorNot · on May 5, 2022

It's simpler then that: "zero downtime" isn't free, and has a direct cost to implement - one you potentially pay every single migration, for which you do not necessarily make enough or any money back on.

In particular most SaaS providers with a subscription model would be hard-pressed to care: they're not selling ads per click, so provided you don't lose users over it, there's zero value. In fact it's probably more valuable to take the downtime and use the savings in dev time and effort to ensure you have an expedient rollback and recovery strategy - that will cost you users.

WJW · on May 5, 2022

At least for Rails there are several gems available (ie https://github.com/WeTransfer/ghost_adapter or https://github.com/departurerb/departure) that seamlessly hook into the existing migration system and will run all eligible migrations through gh-ost or pt-osc as needed. You're right that it's not free but it isn't all that far off either.

That said, online schema migrations are a specialized tool designed for very big tables that take hours to run an ALTER TABLE on. If all your tables are small enough that alterations take less than a second or so, don't bother and just block the table for a bit. It's fine.

KiranRao0 · on May 5, 2022

> "zero downtime" isn't free, and has a direct cost to implement

Thank you for writing this. I was implicitly trying to convey this in the article, but glad to have it be explicit.

nhumrich · on May 6, 2022

I used to believe this too, but the more I have done online migrations, the more I think its actually the same amount of work in most cases, and less work in the worst cases, and almost never more work.

All the steps you have to take here, you have to do anyways. Write to new, read from new, how to translate, deleting old code. It all is the same. The only difference is you do it in chunks vs all at once. Its perception. But all the components and architecting happen anyways.

Sure, you pay the cost of watching more deployments, but you also gain every step being automated, where offline migrations are often ran once, never committed. Offline migration are, like online, not free. If you have to take downtime, its usually not during business hours. You have probably quoted a downtime range to customers. So you have a window of time. You will be stressed. You will "practice", you will write a "plan". Even if everything goes well, those aren't free. But in the worst case, now you have more problems. Let's say something went wrong. Some customer data wasn't as you expected, you have to bring services back up in 20 more minutes. Do you scramble and try to fix it, late at night with limited staff? Or do you roll back? If you roll back, then you have to do this all over again, but you probably need to wait at least a week, because no one wants two planned outages back to back.

With online, every step is "safe". So if you have bugs, no worries, the old way is still working! Maybe rollback the code, but no need to rollback the migration, just leave it in its current state. Take your time, fix it, dont move on till its working.

But even if that doesn't convince you, the number one reason to do online migrations: No more late night planned outages. Do everything during business hours. My employer doesn't get to intentionally make me work when I should be sleeping. If that means less features shipped, then so-be-it.