Observability in software development is not novel. The concept has only gained revitalized currency in recent years. Thanks to the proliferation of multi-service architectures, distributed systems, containers, and the cloud. It refers to the craft of understanding the internal states of a complex system based on ‘observing’ its external outputs. When a system is observable, it is easier to identify the root cause of a performance issue. Even without relying on extra testing. There are software tools and practices in place to build observability. These tools aggregate, correlate, and analyze streams of performance data from an application. With these tools, effective monitoring, troubleshooting, and debugging, the application issues can be identified and mitigated quickly. This allows software teams to meet customer experience expectations and service level agreements. Observability brings a proactive approach to root cause analysis and mitigation. This allows teams to be prepared to handle any performance issues.
Evolution of Observability
In tracing the origin of observability, it appears that Hungarian-American scientist Rudolf E. Kálmán first introduced the concept in a paper titled “On the General Theory of Control Systems.” This was in 1959, and Kálmán’s efforts were unrelated to software or computers as we know them today. Modern computing was still in its early stages at that time.
Instead, the introduction referred to broader “systems” and related more closely to signal processing and system theory fields. Observability continued to be related to those domains for half a century without more comprehensive application – with notable exceptions in the form of efforts at Sun Microsystems that did not go mainstream. (Those efforts are not the subject of this blog. So, we will move on to the evolution of observability in modern computing).
It wasn’t until the 2010s that observability made a resurgence in the software world. The Observability Engineering Team at the micro-blogging platform Twitter published a blog post in 2016 titled “Observability at Twitter: technical overview.” In this blog, the team discussed observability efforts at length and mentioned monitoring, alerts, distributed tracing, and log aggregation as essential sources of observability. Again in 2016, Cory Watson delivered a whole talk on observability. Watson was then the Principal Engineer and Observability Lead at the payments software company Stripe. The talk was titled “Creating a Culture of Observability at Stripe,” and Watson explained how observability works at Stripe.
Around the same time, Google engineers were developing their observability platforms for internal use. Though, they may not have been expressly calling them “observability” per se, by some accounts.
It is hard to pinpoint one tech company or organization that first jumped on the bandwagon, making observability cool again. However, it is certain that by 2018, observability had indeed become mainstream in computing. The term and its underlying practices were discussed at tech conferences and in tech blogs. Early adoption was seen among microservices communities and Kubernetes adopters.
Cut to the present day, and organizations are investing in observability, bringing on board specialists, and purchasing tools – all efforts with one guiding start – resilient systems and applications.
Three Pillars of Observability
Observability rests on the proper implementation of three telemetry types.
Logs: These are written records of an event that help describe what happened and when. They contain timestamps, payload information, and complete records of application events. Logs create a record, complete with the context, and engineers can refer to them or “replay” them to debug and troubleshoot. There are three types of logs: structured, binary, and plaintext logs. Logs are usually considered the first resource to investigate when a system issue is encountered.
Metrics: These are numerical values that help measure various aspects of a system’s state or performance. Metrics have different attributes like timestamps, names, and KPIs, which help provide additional context. They have a structure and are easy to query, optimize for storage, and enable developers to track changes over time.
Traces: These are the mapped journeys of a given request that capture the end-to-end traversing of a request from the UI through the whole distributed system and then back to the user. Traces encode data for each operation performed in the distributed system to fulfill the request.
Why Observability Matters
Observability aligns well with the larger overall goals of Agile methodology/DevOps/Site Reliability Engineering practices. Each of these aims to deliver error-free software fast. Observability helps find issues not yet recognized as issues. It is easier to track known issues but hard to account for the issues that are unknown. Observability uncovers those issues, provides additional context, and speeds up root cause analysis and resolution. Observability can be configured to scale as per the requirements, allowing teams to do more with less. Finally, if integrated with automation and machine learning capabilities, observability can foretell issues and resolve (at least some of) them without requiring human intervention or reserving it for the most complex instances only.
Observability vs. Monitoring: Similarities and Differences
Observability is often mischaracterized as monitoring. Both the concepts closely relate to each other are crucial in managing complex systems and application. Both aim to gain insights into system performance and help in identifying issues.
But they are fundamentally different. Monitoring involves tracking specific, pre-defined performance metrics and indicators like CPU usage, response time, memory consumption, etc. But observability goes way beyond these predefined metrics and allows a complete exploration of the system’s internal state with ad-hoc queries and analysis. Since monitoring is dependent on pre-determined checks, it might overlook unexpected issues. There is more flexibility that observability offers, allowing teams to be prepared to adapt to new performance issues that were since unknown.
Further, monitoring may be able to provide only a superficial, surface-level view of a system with its dependence on pre-defined metrics. Observability is more in-depth and provides contextual perspectives that allows engineers to uncover root cause issues quicker with the help of detailed traces and logs.
Observability in Action: A HealthAsyst Story
The client, a patient-facing application using more than 10 million patients (about half the population of New York) in the US market, had introduced a new patient portal across all practices. The cloud-based multitenant system had a microservices-based distributed architecture. However, many customers encountered issues during onboarding, slowing the onboarding process and causing significant dissatisfaction among their customers. Their challenges included patients either not receiving the mandatory invitations to register on the portal or unexplained delays in receiving invitations. Similarly, there were delays in mandatory password reset emails triggered after a new patient was migrated to the portal. Some of these issues had a cascading effect on other priority activities, as the communication channels were choked, and there were problems accessing the CCDs (Continuity Care Documents) and reports. This, in turn, could also blow up as a regulatory violation if not given immediate attention, in addition to the threat of losing customers.
The development team had major challenges identifying the root cause of these problems. The problem was not easy to trace in a distributed cloud environment, and they had to resort to workarounds such as rebooting the servers daily. The team came up with the suggestion to implement a tool to improve observability and identified Splunk as the ideal solution. The software code, however, was not designed to provide end-to-end traceability. Therefore, a unique request ID was set up to track the end-to-end journey of the request from the UI through the system and then back to the user. This helped them find the root cause and work towards a swift resolution. The HealthAsyst team started tracing logs across the microservices. Further, the team set up two reports: a weekly executive report for the management that would list the outlier transactions and compare them against the defined KPIs and a daily report for the development team to help quickly trace potential problem areas. In addition, the dev team set up triggers on certain transactions that would alert them immediately when there were violations.
Thus, the team that struggled to cope with the RCA suddenly looked in complete control of the situation and was able to identify potential bottleneck areas and fix them on time proactively. The reports helped the team to set up continuous improvement targets and to transition to a highly reliable system.
If the controls were not in place, the client would have to either roll back the migration plan or risk losing thousands of customers and having regulatory headaches.
Conclusion
At HealthAsyst, our technology teams always work hard to keep clients on top. To ensure that clients are ahead of the game, we ensure our processes are updated and state-of-the-art. This is why we keep a keen eye on technological and modern computing developments and constantly seek ways to leverage new technologies, methods, and practices. To discuss your product engineering needs, please get in touch with us at itservices@healthasyst.com.


