If you’ve done any performance testing, or even seen a demo, you are probably familiar with the “big line graph” that is created as the test runs.
The typical scenario is simple: Have the load tool start with one simultaneous user running a script. Then add a new user, say, every thirty seconds plus a small random pause. The script consists of a set of pre-defined operations – login, search, click on some item, and logout.
As more users are added we expect to see response times get higher. We might see response times go through the roof at some magic number as the system hits a bottleneck, or, perhaps it goes up gracefully. With some cloud-based systems, ideally, performance stays within a range over time.
It’s a simple way to do testing. It is straightforward, even easy to implement.
And it has a pile of problems.
Let’s dig in, shall we?
Do Averages Matter?
The report at top-left gives us one data point: Average Response time. That is not the actual response time to any individual request; it is calculated, by adding up all the time-to-respond, over that time interval, and dividing by the number of requests.
Say we have a million of these requests, averaging around 1.7 seconds. If we plot the number of responses near 0.1 seconds, 0.2 seconds, 0.3 … on up to 3 or 4 seconds, well, we’d expect to see something like this, right?
If we look at all the response times, we see a distinct trend toward that 1.7 second average. With a ‘tighter’ bell curve, the average means more, because the user experience is more like that, more of the time. With a wider bell curve, there are more people on the edges, having better (and worse) experiences.
But who said that data looks like a bell curve? What if we plotted the data over time and it looked more like this?
This distribution above has two modes: An average around 2.2, that is more popular, and a small bump – a ‘little hump’ around 4 seconds.
That little hump is a problem.
If the hump is small enough, and all we are looking at is averages, it can easily be lost in the line noise.
Here’s a simple example with some math:
Say you model the system with a simple, one-time operation (login), and a large number of varied operations – search, create, read, update, delete, tag, and so on. You want to model the user behavior, so you bump around all the features – say thirty in all.
Most of the features have true bell-curve like behavior. The forty operations are relatively fast, and take an average of two seconds each under moderate load – say a simulated load of a thousand users. (Yes, I realize, for Amazon.com, a thousand users is not moderate. But that’s a different blog post.)
Login, however, is a problem. It gets progressively slower. By the time you are at a thousand users, login takes an average of thirty-five seconds. That’s well beyond the point at which users simply fire up a new browser tab and go to a different site.
The problem is, Login is one of thirty operations. The average response time is (2Seconds*29Other_Features+35Seconds)/30Total Features.
That puts our “average response time” at 3.1 seconds.
Not perfect, but reasonable, right?
Not really. That number doesn’t show us our one critical outlier function. Beyond that, the 3.1 number skews the data. If the team thinks 3.1 seconds is not fast enough, they may spend time trying to improve search, tag, or other ‘intuitively’ slow things that actually aren’t.
That’s waste.
Lesson: Don’t stop at the Average Response Time. Graph a frequency distribution of the response time. Better yet, chart the raw data.
Another Wrinkle
Say under a few hundred users, search begins to fails, returning search results for someone else’s page.
Note: This actually happened for a website called cuil.com. Founded by former Googlers, the company raised the largest amount of cash of any silicon valley startup in 2008 … and was out of business by October of 2010. I have experienced it in my consulting career as well.
Again, we’ve got a mistake in the program, and it is returing someone else’s search results.
If all our test does is repeat the same sequence of operations, over and over, with a cleanup process at the end, then the wrong search session will look just fine.
If the application is a performance tool, not a functional tool, it is probably recording and playing back traffic, not looking at the actual links on the page. So the search results could be malformed, corrupt – heck, the wrong page could be served, and we wouldn’t know. The load tester plays back the traffic that occurs when the user clicks the link that should be there — it doesn’t actually check to see if the link is there.
Doing One Better
Several test tools provide the user the ability to capture and watch a session, as it occurs, to check for quality assurance. Even if they don’t, you can achieve the same result by performing functional testing on the system under test while it is under load.
You also want to make sure the load is valid. If something happens to make the request invalid, it is possible that you are not simulating one thousand searches, but instead one thousand page-not-valid rejections.
Most test tools show ‘red’ when the sever throws a 404, 500, or other ‘invalid page’ warning – but it’s possible that the website in question does not. (There’s one more thing to check.)
Lesson: There will be plenty more ‘one more thing’s to check.
Don’t worry, there’s also more to come.