I was fortunate to be able to speak at Eurostar 2013 on performance testing. They had a very interesting concept where the tracks sessions were limited to 30 minutes. The extra time was being used for the questions and answers. I really liked the concept. It did however mean that I could not tell everything I wanted in the half hour. One important thing I just spoke about briefly is on the weak points that usually exist in performance testing. In this blog I want to make up for that and elaborate a bit more on them.
Is there anything wrong with performance testing?
I asked the audience the following questions:
Especially the first question was very familiar to the audience. I have seen it myself as well. In my experience this is really a common phenomenon.
So what is wrong?
I named 5 weak points at the conference:
Let's look at this weak points:
When we perform load or stress tests, we nearly always use a tool to simulate multiple users. To create a test, the test tool will simulate the communication between the client software on the computers of the users with the server. For instance if we have a web application, the HTTP traffic to the server is first captured by performing the test manually with the web browser and have the test tool capture the traffic. Then to simulate the load or stress, the test tool will send the same traffic again, only multiplied by the amount of users we test for. This is a bit oversimplified. In reality we need to do more things such as adapt the calls to represent different data.
The problem here is that we test tool will behave differently than actual users. Even if we apply advanced ramp up scenario's to simulate that users in reality don't click on the same time, the behavior is always a bit different. In reality the server may reply differently for the second user, the test tool may have its network settings a bit different than the client. Even the fact that all calls come from one IP address may cause different behavior.
Another important factor is a bit less technical. In our tests we only simulate a subset of the actual functionality. Even if we combine tests to be running at the same time. There is always a limit on the amount of functionality we actually simulate.
To summarize both aspects, the behavior really is a lot different than what we will experience in production. The tests may teach us a lot, may help us in assessing if we can handle the load in production, but we can never state with certainty that we have simulated load and that our test results are equal to what we'll see in production.
This is a nice one as well. Performance testers are used to making a model of the behavior we expect to serve as a base for the test. This is often also called a load profile.
The model is based on the amount of users we expect we will have, how they will use the application and when they will use it. The thing is: all of this is based on assumptions. Each assumption further from reality than the other. All those assumptions don't add up to decreasing reliability: they multiply this.
I ones saw a report where everything was based around some figure of amount of clicks that a business analyst produced. I had asked the business analyst and he stated that even if the amount of clicks was a proper measure, it still was nonsense since he just guessed.
Sometimes organizations take a very serious approach to making these models. They use usage labs to determine how people actually use the application. They assume worst case scenario's for forecasting the peak loads. This will bring them closer to information that is very useful. But if the behavior is exactly the same that still is a lot of luck.
And then something interesting happens. The test team produces a report that states things like: the application responds in 2.6 ms for function x.
That's interesting. We state as a fact that the application will respond in a certain way, even if we know we our tools behave differently and our tests simulate behavior that is probably very different from what will happen in production.
In high school math I learned that you may never report more accurately than your measurements. I wrote a blog about this some time ago.
This may seem as an effect of the other items. I mention it separately though since I believe to be a problem on its own. If the report was much more in line with the uncertainty of the measurements, it still could have actual value.
This one is trick as well. The brochures of the test tools make it look like recording tests and then running them in the test under load is simple and just a few clicks. In reality recording and adapting the scripts is cumbersome and takes a lot of time. One test performed manually for one user may take 2 minutes. Getting this same test to run as multiple virtual users may cost you days of troubleshooting. So we limit the functionality we put under load very much. As a result: we only test a very small subset of the functionality. The coverage therefore is low and we may miss important functions.
It has happened to me that we judged the performance of an application unchanged for a certain delivery. After go live there were huge performance issues. It turned out that a small function that we didn't even know existed, had just a minor change to it causing it to wreak havoc on the database.
Not being able to test everything is a common fact of life for a tester. For performance testing this however is even worse.
So if it happens so often that performance tests are way off, surely the testers must get into trouble?
Actually, no that doesn't happen. There are many reasons for this. One reason is that when you go live and the performance is not good enough, everyone is focused on getting it fixed. Evaluation on why it went wrong is for later.
And if later actually materializes, it turns out that if you really read the report, you could see that no certainty could be had from the results. There are no blunt lies in there. Usually there just is a load of technical information in there, illustrated with impressive graphs and stories.
Imagine if I were to test the quality of a loaf of bread by measuring how high it bounces when I throw it out the window on the first floor. If I were to report on it I would not actually lie. But you'd realize how useless it was and that it would not tell you anything.
The loaf of bread example is easy to understand. But when there is so much technical information, impressive graphs, jargon, and models that at first glance seem logical, how are you to tell.
Now I certainly don't want to claim that performance testers do this all deliberately. Most will be professional and want to deliver the best result and value. The theme of Eurostar 2013 was question testing. One keynote in particular showed how we assume many things and forget to question the way we work. The usual way performance testing seems appropriate. But as my question to the audience showed, so many have experienced that performance testing is way off, I think it is fair to question it. The 5 points mentioned here are not the only weak points. And these points should and will be questioned also. But for performance testing to improve we will have to look at this.
Complaining is easy. Is there a way to do things better? I've proposed a different approach during the talk at Eurostar. Most of all I think it starts by being aware and being explicit on the value and limitations of the tests we do. We should do performance testing. We should also work on doing it better.