One of the advantages of having a SAAS application is the ability to capture true production telemetry. This telemetry consists of functional and non-functional (performance and security) data points. These data points can be and should be optimized for use by our team to make us a more informed development team about the quality of our product. This by no means implies that live production metrics should be leveraged 100% in lieu of testing. There should be a balance of testing and measurement.
I covered my testing philosophy in one of my earliest blogs in which I stressed and advocated for the need for robust build/test pipelines complete with quality inspection (unit, static, integration and acceptance). This pipeline is nothing original or unique that I’m proposing. The pipeline is a component of Continuous Integration in which developers commit early and often. The pipeline grows in complexity and maturity in an iterative fashion with each day as the team’s commits becomes a robust product or module ready for deployment. Consider this early phase more of an incubation phase in which the product is nothing more than executable code, but not deployable ore useable. When code is being incubated, teams should be placing more emphasis on testing and evaluation. This testing is more Unit and API, not acceptance testing.
If the product is ready for acceptance testing, then the product is ready for a deployment (synthetic or production). If the product is deployed, then it should be measured with deep telemetry (dynamic analysis) such as RUM (Real User Measurement), APM (Application Performance Management) and ASM (Application Security Management). Artifacts such as log files and live telemetry from component systems (Queuing Systems, Ephemeral Caches, RDBMS and Non-Relational Structures) should be captured and used. Why…Because the data is there. Why ignore passive data that can be analyzed, captured and organized in an automated fashion?
I can’t really explain why the data often gets ignored. It simply does because so many development organizations focus on the discrete activities of testing. They often fail to capture the more meaningful data that comes from embedded telemetry into the development process. That same telemetry data that can be captured in the testing process can be captured from live production systems. It’s like a golden egg that gets laid every day. The team has to take advantage of this goldmine of data.
I had the chance to talk with Badri Sridharan from LinkedIn about a year ago. Badri and I both ran Performance Engineering practices in our careers. We were exchanging perspectives on the current and future of Performance Engineering. During the call, Badri shared insight into a system called EKG that the Development and Operations teams introduced at LinkedIn. The blog was written by the Operations team, so it shows a lot of infrastructure data points visually. If you look toward the bottom of the blog, you will see the reference to exception counts and a “variety of other metrics”. Those other metrics as Badri explained are functional verification data points. Teams at LinkedIn can get live production data for their Canary and A/B deployments before they promote code throughout the whole system
EKG compares exception counts, network usage, CPU usage, GC performance, service call fanout, and a variety of other metrics between the canary and the control groups, helping to quickly identify any potential issues in the new code.
I’m still learning what telemetry exists in our systems right now. I’m eager to hear from all of our teams about what data is captured, where it is stored, how it’s made actionable and how the data is brought back into the development process.