A few years ago Werner Vogels coined the phrase, ‘you build it, you run it.’ It’s a very noble sentiment generally. If you are responsible for building it, you should at least take part in running it in production. Trying to go beyond the era of throwing it over the wall to operations. Especially after it got bruised when thrown over the wall to test. If you know it will wake you up, you might make different decisions around the reliability of your system.
Since then, loads of places I’ve worked have been pushing to get developers and testers on call. But they are rarely ready for it. The level of quality and understanding of the application to support is too low. On call becomes a nightmare. As a tester, if I am asked I apply the following heuristics to decide:
- If it’s hard to test, it will be hard to support. If you can’t observe what’s going on, control system state, isolate dependencies, keep an eye on side effects. Production support will suck.
- If we can’t keep our test environments going, similar problems may exist in production. Or they are very different.
- If it took me weeks to onboard (do some testing, push some changes). Then the information to support a system is not in the right state.
- I am still falling over null pointers, unhandled exceptions and the like. Rather than finding the subtle, painful bugs when testing.
If the answer to the above is true, then I wouldn’t be dashing towards being on call. There is still a way to go before being in a position to both build and run. You need some or all of the following:
- First, it is a request and not a mandate. Ask the teams, don’t announce as policy from nowhere. I know the operations team have had this foisted upon them for decades but we can break this cycle of violence.
- You are getting paid for on call and you have the terms confirmed in your contract. I cannot stress this enough, coercion and guilt is not a strategy for ownership.
- If there are people who cannot commit to it (caring responsibilities, health conditions), it is not expected or demanded. Declining it should not be detrimental to current or future career prospects.
- You have the autonomy and time to built something that is supportable. There are so many systems I have worked on that are unsupportable nonsense with no particular plan to resolve it. And a product roadmap bursting with features. Make sure you have a commitment to improvement, set aside at least 30% of your team capacity to make your world better.
- As a minimum, in my opinion, you should have:
- A safe, repeatable deployment process,
- A blend of testing (exploratory, automation, performance, whatever you need) in your pipeline,
- A way to observe the system (centralised logging at the least)
- A way to throttle traffic/toggle problematic features/isolate dependencies.
- Make sure you commit to supporting the things that you wholly own, rather than systems that are in a strange state of multiple ownership. Fix that before committing. Also, make sure you have good contracts in place with other teams and systems and 3rd parties.
- On call is not used as a proxy for quality. As in we need to build some crap for a deadline so can you nurse the system through its inevitable lurching through its early life. I’ve seen this done to some very junior developers who didn’t have the experience or confidence to push back.
- Training is provided about how to react to incidents. Making sure that the priority is to restore service first, then look at how fix later. How to react is of equal importance to what to do. Plus, more generally, support is a set of skills which people may not have. Offer training.
- You have the process in place to deal with incidents, subsequent post mortem, update your run books and automate what you understand well. See being on call as a continuous flow of opportunities to learn and improve.
- That your senior leadership are invested and know it is a learning exercise, not a blame game. This can be tough, as it can mean attempting to change a long established mindset that incidents can be eliminated. Rather than being something thats going to happen but can be a source of learning.
- That the appropriate safeguarding and security for production are in place. Giving protection under pressure. Everyone being able to drop tables in an international incident means things will go very wrong. Also, the cause of the incident is never one thing and its never what you first think it is.
- That your monitoring and alerting is well tested and tweaked. It doesn’t go off all the time, waking you up constantly. Or never go off even when everything is on fire. It should be the same in your test environments as it is in production.
The reality for most systems I’ve worked on is that you might build it but everyone runs it. Pointing at development teams for incidents while product stakeholders pressure for speed is bad. How you market and sell what you build impacts how it runs. All our decisions create the conditions for operability and reliability, or the lack of it.