In the past I have discussed an approach for performance testing individual clickpaths by taking a steady-state approach in which we arrive users in bunches and subsequently increase the bunch size over time until the system breaks down. I then stated that we identify the aggregate workload from the sum of executions that occurred from the steady-state. Once we have an idea of the peak concurrent workload in the steady-state, plus the aggregate workload from the entire test, we can then evaluate the maximum workload.
I hypothesize that the maximum workload falls between the peak workload and the aggregate workload.
This can be seen in the visual below:
I definitely think this type of test provides very important insight into concurrency testing. I am not necessarily positive that we can guarantee a tremendous amount of confidence in response time metrics when we simply consider 1 sample per batch. My main argument is based on a presentation that Cary Milsap (formerly of Hotsos) argued in his paper about skew.
Example: All of the following lists have a sum of 10, a count of 5, and thus a mean of 2:
A = (2, 2, 2, 2, 2) Has No Skew
B = (2, 2, 3, 1, 2) Has a Little Skew
C = (0, 0, 10, 0, 0) Has a High Skew
Essentially, if we don’t understand our skew factor, whether it be for response times or resource instrumentation metrics, then we are not effectively looking at our data.
I would like to consider a test that makes use of multiple samples so that we can have a greater confidence factor and determine how accurate or inaccurate our skew is from the sample response times. I would also like the test to start at the lowest workload possible (ie: 1). Here’s how I would consider the test executing:
- (1) VUser executes Use Case X (5) samples (Each sample as a short (1) minute)
- (5) VUsers execute Use Case X (5) samples (Each sample as a short (1) minute recovery time)
- (25) VUsers execute Use Case X (5) samples (Each sample as a short (1) minute recovery time)
- (50) VUsers execute Use Case X (5) samples (Each sample as a short (1) minute recovery time)
- (N) VUsers execute Use Case X (5) samples (Each sample as a short (1) minute recovery time)
My thoughts are that a population of 5 will provide enough samples to determine the skew factor of a given transaction. Although (1) minute is not incredibly long for recovery, we could see that it provides little relief at higher workloads.
The new model would look like this:
There are multiple considerations as to whether a test is considered complete. First, we should have a quantifiable performance objective defined. The most obvious objectives would be a response time threshold and an error rate percentage. Typically, we want to define three dimensions of response times: Satisfied, Tolerant and Frustrated. This would be based on our Apdex philosophies.
Assume we consider a response time chart of 0 to 10 as our threshold definition. Using Apdex, we might define T=2.5s and 4T=10s. This would basically mean that we are satisfied when response times are below 2.5s. We are tolerant when response times fall between 2.5s and 10s. We are frustrated when they are beyond 10s. We essentially will expect abandonment greater then 10s.
The same kind of philosophy could be applied from a performance perspective based on error rate. We ideally would want 0 errors. That would be considered successful and acceptable. There is a chance that we might accept some failure. We might create a similar concept to Apdex based on error. We might set three dimensions E=1% and 4T=4%. We would subsequently graph workloads at 0, 1% and 4%.
This doesn’t necessarily tell us when to end the test. Until we begin defining performance requirements we might be have to make assumptions on whether we believe a use case is scalable or not.