Superb Supervisors. Designing for Failure
Supervisors are “have you tried turning it off and on again?” turned into a programming strategy. We’ve all made that trade off. Should I keep debugging what’s wrong, or just reboot? Often, the reboot fixes it and we can just move on.
But supervisors can also be used to design systems that go down and stay down. In this post we’re going to talk about when we’d want to design this kind of system and how exactly to do it.
The Problem
We’re going to design a system of RabbitMQ consumers that fail at the first sign of trouble. Do not pass go, do not collect 200 dollars.
Let’s talk about what’s going on here. Our application starts a ConsumerGroup. This is a Supervisor that starts 2 complementary processes, a ConsumerSupervisor, responsible for starting our consumers, and a ConsumerMonitor. We want our consumer monitor to… Monitor Consumers. At the first sign of danger, it will instruct the ConsumerSupervisor to stop the presses, and kill all of it’s children.
After we’ve fixed the problem, we’ll bring everything back online. OK, let’s get started.
Some Homework
Since we covered how to setup a Producer and Consumer Pools in a previous post I won’t go into too much detail here. After setting them up, our Supervison tree should look like this.
Great, on to our supervisors.
Bottom Up
Let’s start at the base of our Supervision tree, the Consumers. We’re going to use ExRabbitPool’s Consumer module to save us some boilerplate, and we’ll customize our restart strategy to support our “Burn the world” approach.
The biggest difference here is we set the restart option
to :temporary
. Supervised processes can set 1 of 3
restart option.
1. :permanent
(default): If it dies, bring it back to
life no matter what.
2. :transient
: Only bring me back if I die under
suspicious conditions. If I die with a “normal” reason,
then it’s fine. Get me nice flowers.
3. :temporary
: If I die at all, leave me dead.
My friends, what are we if not temporary processes, trying
to handle the right messages, lest we be killed by one
destined for someone else?
:temporary
works for us here since we want to stop
consumers from potentially doing more damage. If it dies,
let it die.
The ToxicityConsumer
looks pretty similar, except it has
a different exchange and queue.
On to the ConsumerSupervisor
. Since it’s supervising
processes that have the :temporary
restart option, the
strategy doesn’t really matter. We’re going to leave it
with the default :one_for_one
strategy.
We’ve added a couple of additional functions that we need
to support our ConsumerMonitor
process. The first one is
a list of all of the pid’s this supervisor is managing. The
second function terminates all of the children.
The ConsumerMonitor
will do well by its namesake but take
a look at its restart option.
We’re setting it’s restart strategy to :transient
. Reason
being, if this puppy dies for ANY other reason than what’s
on line 21, I want it alive. Notice we pass in a supervisor
as the argument to start_link/1
.
On start, we monitor each of the supervisor’s pids and just wait… as soon as a process dies, we instruct the supervisor to execute order 66.
The ConsumerGroupSupervisor
ties it all together. Pay
special attention to the strategy option.
Supervisors get started sequentially. We completely start
the ConsumerSupervisor
before we start the
ConsumerMonitor
. The ensures the pids are started and
ready to be monitored. The :rest_for_one
strategy allows
the monitor to fail and recover without disturbing the
consumers, but will allow us to heal the system. More on
that later.
Let’s add this to our application and take it out for a spin.
Let’s take a look at what this looks like in observer.
Supervisors are awesome
Our tree dies when it’s supposed, and comes back up in a fresh state when needed. LOVE IT!
comments powered by Disqus Copyright © 2021 Steven Nuñez - HostileDeveloperThat feeling when you configure the perfect supervision tree β€οΈ #myelixirstatus come see what I’m talking about over at https://t.co/zaKaXqCwt5
— Steven Nunez π©π΄πΊπΈπ (@_StevenNunez) August 8, 2020