Site Reliability Engineering
Imagine you are driving a racecar. Instead of making a pit stop to change tires or refuel, you need to switch cars while driving or change the tires without slowing down. This is a very common analogy used to describe what it’s like to run software in a modern production environment. Now, in addition, imagine doing this without any gauges or indicators. That is what it would be like without observability.
Observability within Diligent is the new effort to provide those gauges and indicators to not only the racecar driver but to everyone involved in the process. This blog post will cover what Observability is and what steps we can take to continuously improve the customer experience.
What is Observability?
There are lots of definitions for observability, but within Diligent, observability can be distilled down to the following:
The ability to estimate the current state based on the information available and then use this information to feedback in and improve the system.
Further hashing this out means that for a system or software application to be considered observable, we need to be able to estimate its current state based on the information we have available. It also means taking the analysis of this data and using it to improve our products. This can take on many forms; from establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs), to adding Observability to processes like deployments or script executions.
SLI – A carefully defined quantitative measure of some aspect of the level of service that is provided.
SLO – A target value or range of values for a service level that is measured by an SLI.
How is Observability Different from Monitoring?
For years we have had monitoring of various types of our systems. How is this different? Isn’t observability just the new “buzzword” for monitoring?
Monitoring is an important subset of observability, but just doing monitoring alone without getting the observability in place just provides us with lots of unstructured “noise”. We already have many cases of this within Diligent.
So how might this effect you in a practical way? Let’s walk through a couple of real scenarios:
For years there has been a service that, due to a bug, leaves behind orphaned child processes. The operations team put a script in place to remove these orphans on a regular basis. Without this process, the production instance of this service would be overwhelmed and customers would be impacted. We have Monitoring that alerts us and tells us the orphans have been removed.
Here are the things wrong with this process and mentality and how observability could help us to improve.
- The information on processes that have been removed and handled this way never makes it back to the development team. Because of this, they don’t know that the issue persists.
Operations get an email alert every time an orphan is removed, but there is no expected action from this alert. Operations is expected to take an action only if someone notices that the emails have stopped.
If the orphan removal process stops working, then these servers will be overwhelmed and there will be customer impact. We must hope that someone notices the emails have stopped and that they know how to resolve.
How can we make This Better with Observability?
- Add a step to the process to send a metric to SignalFx any time a process is removed by the script. This can be fed into a dashboard that would be visible to the Dev team. Dev can then use this information to make improvements to the service and after future releases review the dashboard to observe improvement.
- Add metrics on the child process showing how long each instance is running for. Even once we can stop running the script we can alert on when orphaned processes are running for longer than expected and have an accessible visual record of its history.
- Stop sending alerts for items that do not require an action!
- Add a heartbeat metric for the process if it is still needed, this way if the script does not run there is an actionable alert that needs to be resolved.
As we can see from the above example, monitoring of this is important but is only piece of the whole puzzle.
Production issues usually start in one of two ways. A service goes down and an alert is sent to an SRE team. The worse option is that a client reports an issue to the Support team and this issue is raised to an SRE team.
As with the last example, here is what is broken in this situation:
- External monitoring is the best effort; the team implementing the production monitoring is not the team that owns the service. No one is checking to make sure that these monitors are valid or consistent, which leaves us with lots of gaps.
- Often, even if the monitors are fully set up and configured, the monitored endpoints don’t give a full picture of the availability of these services. This will lead to situations that when something is broken, it will still “look” healthy.
- Scrum teams that are responsible for these services often find out that there was a production issue very late in the process. Worse they can find out before anything has tripped in monitoring by client (internal or external) complaints that have been funneled to them directly.
- Data taken from service issue frequency is never fed into the software application lifecycle. Specific issues are addressed but it’s more of a one-off process when something breaks, rather than continuous improvement.
- Support/Product Owners/and CSMs are often left in the dark as to the status of these services.
So what can we do about this?
- Have the internal and external monitoring owned and defined by the team responsible for the application, and standardize the implementation so it is done consistently and therefore can be observed easily.
- Since monitoring is now owned by the scrum teams, if there is an issue that is found that does not get picked up by monitoring, they have a direct path to make it better.
- These standardized monitors now feed into “single pane of glass” dashboards. They not only give Dev teams immediate feedback as to the state of their services but also allow the Product Owners / Support / CSMs access and visibility (Observability!) that they never had before.
- By making this data easily available to product owners, it can be fed into the development lifecycle process.