As performance engineers we often spend our time studying forensics after an incident occurred and most importantly when the incident was recognized by one or more people. What happens when an issue happens and nobody reports it? Lots of issues go unreported within our customer basis. It’s amazing when I sit down with customers and they walk me through a scenario. I guess the amazing part is how nonchalant they are when they encounter an issue. Often they ignore the issue or work around them. They recognize the defect is a nuisance, but reporting the issue is often more trouble than it’s worth. We all know that most system administrators don’t have hours and hours of free time scanning log files looking for clues from issues that go unreported. Imagine a scene where cops are wandering around a busy street looking for a crime that may have or may not have actually happened…It simply doesn’t happen.
Who’s to say it doesn’t have to happen. In fact, lots of great software companies and organizations actively seek out issues that go unreported. They build frameworks into their products to notify development and/or support teams that an issue occurred. While it might not be reported by the user, it’s actually being reported by the system. Now that would be a dream for me. Imagine an administrative feature set that sends some form of message when an exception occurs? Granted, we might get bombarded with hundreds or thousands of these messages, but we could easily build the capability to aggregate and analyze the data so that issues could be evaluated both individually and as a holistic trend.
Building an administrative feature wasn’t my intention of this blog. Rather, I wanted to talk about the topic of proactively evaluating issues that may or may not be reported by issues. I call this forensic fishing. The idea is to strategically sample a software system with the intent of identifying a performance, scalability or even a functional issue as part of the sampling effort. The goal is to seek out issues or catch some fish. It’s not like you go fishing with the intent of going home empty handed right?
So why do I use the phrase “strategically sample”? Well, my first strategic purpose would be to sample customers who are early adopters of a release. Let’s reward early adopters by responding to issues before customers report them. The second would be to sample a customer who’s taken a patch or SP based on a fix we did. Our team is working dozens of issues per release. We verify in our lab, but ideally being able to verify with an even greater degree of confidence. Third, the whole goal of fishing is to find issues. Issues creep from release to release and go unnoticed. We may find an old issue that gets exposed because our magnify glass was fortunate to pass over it during a sampling period.
Of course our plan is to fish using a pretty awesome rod, aka…Dynatrace. The idea is to build a program with our colleagues in Managed Hosting to selectively pick customers for sampling. We have to identify when we sample and most importantly how long we sample. I’m a firm believer that we do not need to sample every aspect of the customer. We could run with a lightweight profile. We could selectively choose 1 node or maybe 2. If a customer had 10 nodes, it’s not worth our time to sample them all. We obviously need some low hanging fruit to initially focus on and react to. A more sophisticated inspection could and should be performed as well. What would we be looking for? We would look for everything from exceptions to unreleased connections to other forms of software anti-patterns. We would sample over short intervals like 30 minutes and use the compare features across profiles.