Have you ever wondered why outlying transactions are often displaced or ignored? For a long-time I never gave it a second thought. Everything I had read up to a few weeks ago told me to avoid focusing on outliers as worthy transactions. Even my former mentors and respected peers told me to eliminate outliers. Take a look at this article from Scott Barber and you will see he too discourages outliers. He uses his article to talk about finding outliers, which he defines below, but talks about removing them from the result set.
“Outliers are atypical (by definition), infrequent observations; data points which do not appear to follow the characteristic distribution of the rest of the data. These may reflect genuine properties of the underlying phenomenon (variable), or be due to measurement errors or other anomalies which should not be modeled.”
I used to ignore outliers during my performance tests. Now I consider them the most important pieces of data that I can find. They are the first thing I look for when reviewing the results of a performance test. Why might you ask? Well in my opinion outliers tell you how bad responsiveness of the application can really get.
Let’s put it in perspective. Assume I am measuring login performance to the application. I have 100 independent samples of the login transaction. I review the data and I determine that the Mean Response Time is ~1.5s. I also find out that the Mode Response Time is 2.2s. These are all fine transaction response times. What if the maximum response time was 600s? That’s 10 minutes right? I can’t imagine someone willing to wait 10 minutes, but then again I found myself going to the Deli counter at Safeway this weekend when I swore I would never go again. So anything is possible.
600 seconds is a ridiculous outlier. It possibly could have been worse. There might have been a few other samples that performed nearly as bad. If we ignored those data points because our average happened to be 1.5s and our mode was 2.2s, we are more probable to miss a performance issue.
I think it’s important to remember that a performance issue doesn’t need to be a trend in order for it to be a performance issue. What I mean is that performance is relative to the person experiencing it. So if even 1 user experiences a cruel amount of latency then a performance problem exists.
So knowing this, I think it’s in our best interest to figure out the causes of latency inflicted on an outlier transaction. Let me leave you with this abstract example.
This past Saturday my family moved from our house in to a new house. I’ve been dealing with ordering cable television since Friday of last week. It’s been an incredibly frustrating process. When I called on Friday I waited 25 minutes for someone to speak with me directly. My goal on Friday was to cancel my cable at the house I’m selling. On Monday I called the cable company again. This time my goal was to add service. I waited less then 2 minutes to speak with an operator. I called at exactly the same time which was 6:15pm on my way home from work. I was really surprised at how fast my call was answered. So I tried a couple of experiments on Monday and Tuesday. The experiment was to see whether the cable company purposely made me wait when I wanted to cancel my service and provided quick service when I wanted to add or upgrade service.
So here’s what I did. I called the cable company using 2 different phone lines when I got home. On one phone I chose to cancel service. On the other phone I chose the option to add/upgrade service. I did this 2x on Monday and 1x on Tuesday. Here’s what I saw. In the three times I chose to cancel service I waited 25 minutes, 17 minutes and 11 minutes. In the three times I chose to add/upgrade services, I waited 2 minutes, 4 minutes and 1 minute.
I was concerned that maybe the cable company had different people serving in different roles. Maybe they had more people handling new/upgrade services versus canceling service. So the first question I asked was whether the operator answering my call could do either function. In all 3 cases, the operator said they could do both and that there was nothing in place at a system level in which an operator took one type of request versus another. It was entirely based on the availability of the operator. I was equally concerned that maybe the time of day I called was an issue. Two things about that. First, I called at the same time using different phones, so this would be somewhat contradicting. Second, I called at different times concerned that call volume was an issue. The times were nearly identical.
So what do we make of this story? Is it fair to say the cable company was creating an artificial performance issue on purpose? It could be, though it’s very difficult to prove. Does my story have nothing to do with outliers at all? Well actually it has everything to do with outliers. I was really upset waiting so long. I was so upset that I created my own experiment to see whether there was an actual conspiracy theory involved with canceling cable versus adding cable service. I’m starting to believe that the conspiracy theory is more likely true then not. The big issue with my cancellation experience was that I was incredibly frustrated that I will forever equate response time and throughput issues with the cable company…and when you think of cable companies, you want to think about fast, reliable service. Just like the deli counter example from a few weeks back, an outlier can be real problematic. The outlier can become an antagonist causing some pretty severe issues. These could be PR related or something else…
That being said, if I’m an outlier doesn’t my experience matter then?