I’ve been running through Galileo studying how some of the most recent calibration efforts have been going. I wanted to get a sense of what lays ahead to the benchmark team as we prepare for the Dell benchmark next week. What’s interesting about the recent way we have been calibrating is the length of time it takes to run. In a most recent PVT for SP1, it took us 265 to 285 minutes (~5 hours). That’s not bad when you compare it to the old way of calibrating which could take upwards of 8+ hours to complete. I don’t want to waste 5 hours calibrating, when I think the answer can be obtained in 3 hours. So I would like to consider an alternative calibration method. Don’t worry Anand and Nakisa, I’ve done a sample run or two to prove it out. Before I get into this possible new calibration method, let me explain the old and current method.
Our old method of calibration was based on our original concepts of user abandonment. We believed that there was this idea of a peak of concurrency (POC) and a level of concurrency (LOC). In this model, abandonment began when user response time thresholds couldn’t be maintained. The LOC was where the rate of arrival equaled the rate of departure. The AOC was our view of the midpoint between the POC and LOC. Essentially, the AOC was relevant because we felt there had to be a workload that could be studied between the peak and level.
Once we determined the POC, AOC and LOC, we would then run a simulation with abandonment disabled. All simulations were 70 minutes in length. Our process start to finish took about 2 hours, hence the process took a little more then 8 hours to complete. A long while ago we found fault with this approach. I’m feverishly looking for my old blog post about it, but can’t find it. Basically, my frustration with this effort is that we often settled on AOC and LOC results most of the time. I don’t have statistics on it, which I should. I’m basing mostly from anecdotal memory.
There were other faults as well. It turned out that often the workload we settled on wasn’t sufficient to saturate our systems causing our PARs to be off substantially. So from this old method came the current one…
Our current calibration approach is derived from steady-state workloads. We essentially look at a staircase of workloads over a period of time. We use a 2 minute recovery period between workloads. Each workload is responsible for 10 samples after a short ramp-up period. What’s interesting about this type of test is that we look at different workloads.
I have 2 problems right now with this type of calibration. The first issue is that the process takes upwards of 5 hours to complete. This is way too long to calibrate because if it fails we at best have 2 shots a day at running this. The second issue is related to faults I find with arrivals. We ramp-up in each cycle. So if we have a workload of 20 with a 2 minute workload, we essentially are ramping 1 VUser every 8 seconds. If the sample period is based on 10 iterations which occurs over 20 minutes, the first 2 minutes skew our data. We are likely to see 90-9-1 acceptable responses during that 2 minute bin. The data might be interpreted incorrectly.
So with that I’ve come to propose a third type of calibration…
I’ve only had the chance to test this twice at the time of this blog. I’m already interested in what I’m seeing. Basically, I’ve designed a very simple scenario in which we take an extraordinary number of available VUsers into one pool. Let’s say for grins we take 500 VUsers. We arrive that pool of users over an hour. In our case it was about 58 minutes. We arrived 1 VUser every 7 seconds. Abandonment was on
Here’s what we learned. First off, at some point in time we will see diminishing returns. This occurs from an abandonment perspective in which those going in start to get outnumbered by those coming out. Same can be said about hits/second. We eventually reach a max and that’s all we can do.
In our scenario, our problem was unique to us. We were CPU bound on an 8-core system. I couldn’t believe my eyes when I saw that. We achieved that with only 125 VUsers. That’s crazy when you think about it. We should have been memory bound like we’ve been in the past. With the recent JVM changes, the balance of resources has shifted from memory to CPU. This could be problematic at Dell or it could be a blessing.
What’s interesting about calibration is that I still think you have to bring the data into 90-9-1. The difference is that we are really looking at a true VUser curve rather then a VUser cardiograph.