Metastable Failures Across Systems
As software engineers, the optimality of problems is key to our process. We often will optimize for load times by building caches for static assets, or PXE boot1 for quick booting on machines. One downside to optimization that we talk about is premature-optimization when programmers spend too much time optimizing for the wrong thing. But can you over-optimize without premature optimization?
Take this example:
- You're running a website with a lot of page hits, and off of a few servers. To reduce the load on your server, you decide to cache stuff in Varnish2. Turns out, now 90% of your load is going to your cache! Great, you've reduced a ton of load!
- Your traffic grows and grows, but your cache holds up fine, most of your assets are static, so you scale up the Varnish instances, and keep the original few servers.
- Your cache goes down at some point (networking, machine crash, whatever), and now all the old traffic in 1) and all the new traffic in 2) comes to the few application servers.
- Your application servers can't handle this traffic, so they go down. As you try to bring them back up, your assets in Varnish have expired due to cache-policy, and now you have to wait for both your cache to rehydrate and save your application servers at the same time.
In this specific case, this is a thundering herd problem, where tons of load causes your system to go down. In a more general case, this is called a metastable failure, or when a system gets hit with the load that causes the system to enter a bad state, where now the bad state, even when the load is removed 3. Alternatively, this sometimes gets called grey failure, when you continue to limp along in a degraded state instead of completely failing4.
Let's take another example:
- You have a data center, where all of your servers PXE boot, so they require a live box to boot off of.
- This is extremely fast (you're just copying state from another server), and requires little setup, and allows you to scale up very easily.
- Something happens, and now all of your servers go down.
- How does the first machine start? If all your machines PXE boot, who is alive for the first server to boot off of?
This is known as the black start issue, and it has happened to cloud providers before. In this case, the trigger was an operator error to reboot all boxes in a DC. Once the trigger was over, no machines might have been going to be able to start. Luckily for Joyent, their DC started up fine, but there was no guarantee that things were going to work out.
Metastable failures occur when we over-optimize for the general case. Adding caching servers without scaling application servers and resorting to only PXE boot is optimizing for when your system is on the garden path, but these systems assume an initial happy state and do not provide a way to recover.
Metastable failures aren't limited to computer systems either. In some cases, the supply chain crunch for chips can be seen as a metastable failure, where chip fabrication was assumed to have stable lead times. However, once covid disrupted demand, chip foundries were unable to cope with unstable lead times, causing everyone to wait 18+ months for chips.
These types of failures have been well studied in theory, but are accidentally easy to commit without thinking about it. Software engineers who design scalable systems often think about coordination problems between nodes in failure recovery, but it's also important to remember that cascading failures are caused when a system cannot quickly return to a good state. When your system can reach a metastable state, then the chances of cascading failures go up dramatically.