Art of Clean Code — Error Handling

Surprises are inevitable. Handling these early and gracefully helps both customers and developers.

Mohit Gupta
Dev Genius

--

This is the second article in the ‘Art of Clean code’ series. Refer to the first one here.

Error Handling is to accept that surprises are inevitable and to be prepared to manage any surprise gracefully with as minimal impact on customers as possible.

But isn’t error handling something which we do as part of day-to-day coding? We use try/catch and throw the error. Let us try to understand what is more in Error Handling.

Surprises are Inevitable

Back in 2005, I was tasked to implement a BPEL-compliant workflow engine, which can process a large number of requests concurrently without degrading performance. It was a challenging assignment to think about an efficient in-memory data structure, a nimble state machine, and above all to keep it generic so it can be extended anytime to add more nodes, logic, and interfaces.

As soon as we add ‘generic’ words, it makes things easy for the user of the system, however, at the cost of added complexity for the system itself. And that’s the purpose of creating generic components, and utilities to contain complexity in one place.

I did the research for a couple of months, implemented POC, and then developed the system with hundreds of test cases over the next 6 months. The workflow engine was ready following BPEL specifications with a nice API. Hundreds of test cases have been written and all were passed successfully.

I was super excited to showcase the outcome. Soon, I presented it to our chief architect. He praised it with all his heart. Next, he brought me a multi-processor machine and asked to ensure the test cases on that (a multi-processor machine was a luxury at that time).

I ran test cases using the new beast. To the biggest surprise of my coding life, a lot many test cases failed. I was so surprised, as test cases were failing at points where I put a lot of effort to keep implementation strong, covering all the cases which I could think proactively.

I started debugging. I realized that threads behave quite differently than I was thinking. Pre-emption of threads, context switching, complex behavior on sleep, and notify when multiple threads are waiting for a turn in the executor queue were quite different than what I assumed when I tested on a single processor machine.

I was facing surprise after surprise in that exploration. We brainstormed all the error cases for the next few weeks and fixed all those boundary cases.

In this whole episode, error handling, logs, and test cases were the savior collectively. However, error handling was the key to capturing errors at the right point, with the right context information. Error handling was the savior of surprises, which could have been much more complex if errors were not captured in the right place with all the right context information.

Take note of ‘contextual information’ in error handling. This is an important aspect of error handling to collect and dump any possible information which can help to understand, recreate and fix the error later. Without this information, understanding the scenario or recreating the cases could be very time-consuming and frustrating. It can consume a lot of effort also.

Many of us might have witnessed patches after patches in production. It usually happens when we are unable to understand the root cause and try to fix the error at the surface to pass the current hurdle. Eventually, it consumes much more time than we could have invested in proper error handling to capture surprises and log all possible context information for an effective fix.

Surprises are inevitable. However, with the right level of error handling, we were able to find the root cause pretty swiftly in the above example. Hence,

Code for surprises.

There are many scenarios that we may not know at the time of writing code and there will be more as the system evolves. The more we accept it and code for it, the more it becomes easy to capture surprises earlier and fail quickly with the right information to solve.

Right error handling and assertions are the key to capture surprises timely and build evolving and resilient systems.

Manage Surprises Gracefully

Accepting that ‘surprises are inevitable’, is winning half the battle. The rest half is to manage these and recover gracefully. Here is another story of one of such experiences.

In the year 2011, we were developing a financial management application for a Govt organization. Users of the application were govt officials in finance departments and payment collectors in all the departments across the country. The user base was experts in their domain but was mostly new to the computer ecosystem.

Whenever any error occurred in the system, users were directed to log the ticket using the ‘ Redmine’ open-source tool. Users were trying their best to log what they understood about system error and expectations, but being non-technical users, most of the time that information was incomplete to understand the issue. Many of the times, users were not comfortable capturing all relevant information and logging the issue.

We tried to create the documentation, and steps to log the issues, but there were still many challenges due to the very diverse user base, languages, and difficult terrain to reach for training.

The pressure of keeping the system up and running was high, as it could impact Govt timelines for collecting taxes and other payments.

Here is what we did to manage it

  1. Ensured that every error was logged with all possible context information from erroneous code block including user action, system state, expected state, a snapshot of session state, and any relevant database information, etc.
  2. Ensured that errors were propagated correctly to the controller layer where the system could take better decisions to log or direct the user based on system state (mostly unchecked exception model was used)
  3. Created a plugin for Redmine (redmine-jconnector) which could help to perform issue creation and listing operations using Redmine rest API. This plugin was used to add a ‘Report Issue’ function in UI, which users could simply click to report any issue (if there is no exception, it will simply log the ticket with all current system states and user description)
  4. As soon user hit this action to report errors, the system used all the context information in the background which was already captured using points 1 and 2 above, and logs that with the user description (removing any sensitive information) in Redmine as a ticket.
  5. This removed lot of dependency from users to collect and log the error messages, error codes, system state, etc.
  6. Developers had all the right user and system information to debug the issue quickly. Fixes become much faster. We were able to roll out fixes very frequently for any blocking issues.
  7. We also enhanced the on-screen error messages with a lot of user-centric details in a local language slang, sometimes even suggesting the workaround for some usual cases, which could unblock users for the time being. For example, internet connectivity was not good in remote areas. So we enabled a few connection tests and presented relevant information to the user to handle that.

Error handling is not only about putting the right construct in code to capture the error, log it, wrap it in the right throwable, and throw it. But there is more to it,

  • Using this information to handle failures gracefully
  • Using collected information to design the system and UI in a most helpful manner for users so they can understand situations easily and can even become a help for engineering teams to solve the issue.
  • Using this information to present the failure information and possible workaround to the user, to unblock them if possible.
  • And importantly, presenting all relevant information to developers in an effective manner so they can act quickly to fix it.

Debugging errors is stressful many times, and needs a lot of effort. If all relevant system information is available, it can make the whole exercise much easier.

In the above example, we even enabled a few dev modes in the application. Which if enabled, can show all the system state information to developers or admins on the screen itself.

This means any developer or QA or product owner or manager can see the details easily without the need of digging into multiple servers, and databases. Anyone who understands the context can help to fix the issue or even can suggest a solution.

It is one of the most effective ways to include system’s users in solutioning.

This simple act of presenting all the right information to the team brought a multiplier effect to solve the issues and roll out the fixes quickly.

It also helped to keep customers aware of the system state and even to educate them on issues, which in turn helped to have happier customers (and a happier engineering team).

Summary

We are not covering the tips to implement error handling, i.e. whether to use checked or unchecked exceptions, error codes or extended exceptions, etc. There are so many great blogs on these topics.

However, the focus of this story is to highlight the importance of proactive planning to manage the surprises well, for the benefit of both users and the engineering team.

Ultimately, error handling is to help the users of the system by keeping the system usable in case of surprises as much as possible, or at least by handling the error gracefully and presenting all relevant information to the user to avoid frustration.

In turn, it helps to make the developer’s life also easy, saving weekends and nights from production issues.

Surprises are inevitable in Software Systems. Acceptance of this fact and coding for surprises proactively can bring a big positive change in customer experience and product usability.

So next time, when you do error handling, think that ‘how best can I present all possible relevant information to both User and Engineering teams’.

This proactive (and empathic) act can save both from a lot of frustration and many more issues.

Next

Refer to other articles, given below.

Art of Clean Code. Refer here.

Art of Clean Code — Documentation. Refer here.

Art of Clean Code — Logging. Refer here.

More System Design and Engineering Blogs. Refer here.

Enjoyed reading this, please share, give a clap, and follow for similar stories!

For any suggestions, feel free to reach me on Linkedin: Mohit Gupta

--

--

Enjoy building great teams and products. Sharing my experience in software development, personal development, and leadership