Things go wrong. Regularly. In our daily work here at Yellow Pencil we run complex systems that span lots of hardware, and many vendors in different locations across the planet. Different groups are responsible to maintain and manage them. These systems get heavy use from teams that need them to do important work every day. Fire departments, mayors, garbage collectors, bus drivers… they all depend on the environments we set up to ensure they can get their work done. And it's only a matter of time before something goes wrong.
Part of our work is to plan resilient systems with our customers so they — the systems — don’t break. But we are a group of humans, working with other humans, and it’s inevitable that something will go wrong, so we create plans to respond when things go wrong. We have a shorthand around our office chat: it’s explosion.gif. There are certain keywords that show up in our chat room that trigger the ominous explosion.gif. It’s our way of acknowledging that things go wrong, and it keeps us on our toes. Whenever I see that gif show up in our chat it makes me chuckle, and it also makes me double-check my next plan.
When something goes wrong you have to fix it. I spend time each day troubleshooting a problem. It may be a small amount of time troubleshooting why the printer won’t work (or sadly, a long time). It may be that I’m spending an entire weekend troubleshooting why a client can’t access his website from inside the company network, but we can from outside the network. It may be tracking down the firewall changes in China that have pummeled our servers with data and requests because of a configuration change in the great firewall of China.
For each problem I have a consistent approach. I want to ask simple questions, collect and share data, and follow up until I have a solution. There are some important principles to problem-solving that I’ve collected over the years, and it’s a short list that I go back to whenever I get stuck with a problem.
1. Ask simple questions
Often, a problem comes as a messy collection of issues. Kind of like when my daughter was learning to put pony tails in her doll’s hair. You have to untangle them and start again at some point. Sometimes you’re picking out each individual strand of hair, and sometimes you ask a question that connects to many others and — magically — the entire head of doll’s hair flows smooth and silky. It’s a great feeling (solving a problem and fixing doll’s hair — trust me).
Work to break your problem into simple, small questions that you can test. Can I access this website from my desktop and my mobile phone? Yes. So it’s not a problem with our office network connection. Can someone from the client’s office access the website? No. So it’s up for some of us.
Sometimes simple questions can lead to complex answers. “Do the logs show any errors?” can lead to a complex review of logs across multiple applications and servers, but you should always be looking for something basic that you can report back on.
2. Collect and share data
Whenever you find the answer to a question, write it down or record it somewhere you can share with your problem-solving group. Even if it’s just you working through an issue it’s always good to write down the problem, your questions, and data. You may end up bouncing the problem off a colleague, and if the information is all in your head you can’t share it easily. Writing it out also helps you process the information and think it through. You’ll also spend far less time asking the same questions a second time. Or a third.
3. Establish communication protocols
Identify who is in your problem-solving group and when you’re going to connect. I worked through one performance problem that required us to bring in about 20 people from eight different places across North America through phones, shared screens, and video conferencing. We met every hour to get status updates, and spent 50 per cent of our time problem solving, and 50 per cent of our time communicating about what we were doing. However, the problem was complex, and involved five different vendors, so communication was key to the resolution. If it’s you fixing the copy machine, then you just have to decide how frequently your foot will update the side of the machine with your frustration.
Establishing clear expectations about when and how you will provide updates will give you and your team the freedom to focus on the problem and answer questions, and a bit of light at the end of the tunnel if you’re stuck. Sometimes if you feel like you’re not able to make progress, and have no more paths to chase talking it out can really help.
4. Review your notes
If you run out of simple questions to ask, and you can’t think of something else to test, read your notes. Review all the data collected. Sometimes reflecting on the summary of what you’ve tested will lead you to the answer, or at least to another question to ask.
5. Provide a summary
When you’ve solved a problem, it’s always good to provide a summary of what happened. We have an incident report format that covers the following:
- Contact info (author, date, phone, email)
- Summary (a high-level overview)
- Details of the incident (describe what happened)
- Notification process (the communication around the incident, who noticed it, when updates happened, etc.)
- Technical details/fix actions (the nitty gritty of what really occurred and how it was fixed)
- Conclusion (wrapping up what happened, identifying if this was a unique, rare incident or likely to happen again)
- Recommendation (identify what you can do better next time, or what needs to change to prevent this from happening again)
- Appendix (reports, details, data from your simple questions)
It’s important to identify what happened, with the plain facts and faults. Often, there are parts of an incident I wish had gone better, or we missed something or made a mistake. But wishing never made much better, so writing out what happened, and reviewing it with the whole team so it’s better next time is the best way to avoid making the same mistake twice.
6. Look for ways to improve
I rarely send through a summary of an event without recommending some way to improve. It can be a simple improvement like, “We should update contact information, as our phone numbers were out of date when we called,” or something significant like “the traffic to this website is growing each month and the current infrastructure can’t support any more traffic.” I look at each incident as an opportunity to improve communication, planning, or infrastructure. Or all three.
Our planning and maintenance work keeps problems to a minimum. And with proper review, I hope we never see the same problem twice. However, we’re a group of human beings, and no matter what, as we work together each day, we’re going to … explosion.gif.