The weak points of performance testing

posted 14 Dec 2013, 06:05 by Albert Witteveen   [ updated 14 Dec 2013, 06:11 ]

I was fortunate to be able to speak at Eurostar 2013 on performance testing. They had a very interesting concept where the tracks sessions were limited to 30 minutes. The extra time was being used for the questions and answers. I really liked the concept. It did however mean that I could not tell everything I wanted in the half hour. One important thing I just spoke about briefly is on the weak points that usually exist in performance testing. In this blog I want to make up for that and elaborate a bit more on them.

Is there anything wrong with performance testing?

I asked the audience the following questions:

  • Have you ever experienced that the performance test passed with flying colors but straight after go live they had to at least quadruple the amount of hardware?

  • Have you ever experienced this the other way around? Where the performance testers said it wasn't good enough, a go was given anyway and in production it turned out nothing was wrong.

Especially the first question was very familiar to the audience. I have seen it myself as well. In my experience this is really a common phenomenon.

So what is wrong?

I named 5 weak points at the conference:

  1. the tools simulate but are not quite equal

  2. load profiles are based on too many assumptions

  3. we report more accurately than we can measure

  4. long setup time → limited amount of tests

  5. we hide it all in complex reports

Let's look at this weak points:

  1. the tools simulate but are not quite equal

When we perform load or stress tests, we nearly always use a tool to simulate multiple users. To create a test, the test tool will simulate the communication between the client software on the computers of the users with the server. For instance if we have a web application, the HTTP traffic to the server is first captured by performing the test manually with the web browser and have the test tool capture the traffic. Then to simulate the load or stress, the test tool will send the same traffic again, only multiplied by the amount of users we test for. This is a bit oversimplified. In reality we need to do more things such as adapt the calls to represent different data.

The problem here is that we test tool will behave differently than actual users. Even if we apply advanced ramp up scenario's to simulate that users in reality don't click on the same time, the behavior is always a bit different. In reality the server may reply differently for the second user, the test tool may have its network settings a bit different than the client. Even the fact that all calls come from one IP address may cause different behavior.

Another important factor is a bit less technical. In our tests we only simulate a subset of the actual functionality. Even if we combine tests to be running at the same time. There is always a limit on the amount of functionality we actually simulate.

To summarize both aspects, the behavior really is a lot different than what we will experience in production. The tests may teach us a lot, may help us in assessing if we can handle the load in production, but we can never state with certainty that we have simulated load and that our test results are equal to what we'll see in production.

  1. load profiles are based on too many assumptions

This is a nice one as well. Performance testers are used to making a model of the behavior we expect to serve as a base for the test. This is often also called a load profile.

The model is based on the amount of users we expect we will have, how they will use the application and when they will use it. The thing is: all of this is based on assumptions. Each assumption further from reality than the other. All those assumptions don't add up to decreasing reliability: they multiply this.

I ones saw a report where everything was based around some figure of amount of clicks that a business analyst produced. I had asked the business analyst and he stated that even if the amount of clicks was a proper measure, it still was nonsense since he just guessed.

Sometimes organizations take a very serious approach to making these models. They use usage labs to determine how people actually use the application. They assume worst case scenario's for forecasting the peak loads. This will bring them closer to information that is very useful. But if the behavior is exactly the same that still is a lot of luck.

  1. we report more accurately than we can measure

And then something interesting happens. The test team produces a report that states things like: the application responds in 2.6 ms for function x.

That's interesting. We state as a fact that the application will respond in a certain way, even if we know we our tools behave differently and our tests simulate behavior that is probably very different from what will happen in production.

In high school math I learned that you may never report more accurately than your measurements. I wrote a blog about this some time ago.

This may seem as an effect of the other items. I mention it separately though since I believe to be a problem on its own. If the report was much more in line with the uncertainty of the measurements, it still could have actual value.

  1. long setup time → limited amount of tests

This one is trick as well. The brochures of the test tools make it look like recording tests and then running them in the test under load is simple and just a few clicks. In reality recording and adapting the scripts is cumbersome and takes a lot of time. One test performed manually for one user may take 2 minutes. Getting this same test to run as multiple virtual users may cost you days of troubleshooting. So we limit the functionality we put under load very much. As a result: we only test a very small subset of the functionality. The coverage therefore is low and we may miss important functions.

It has happened to me that we judged the performance of an application unchanged for a certain delivery. After go live there were huge performance issues. It turned out that a small function that we didn't even know existed, had just a minor change to it causing it to wreak havoc on the database.

Not being able to test everything is a common fact of life for a tester. For performance testing this however is even worse.

  1. we hide it all in complex reports

So if it happens so often that performance tests are way off, surely the testers must get into trouble?

Actually, no that doesn't happen. There are many reasons for this. One reason is that when you go live and the performance is not good enough, everyone is focused on getting it fixed. Evaluation on why it went wrong is for later.

And if later actually materializes, it turns out that if you really read the report, you could see that no certainty could be had from the results. There are no blunt lies in there. Usually there just is a load of technical information in there, illustrated with impressive graphs and stories.

Imagine if I were to test the quality of a loaf of bread by measuring how high it bounces when I throw it out the window on the first floor. If I were to report on it I would not actually lie. But you'd realize how useless it was and that it would not tell you anything.

The loaf of bread example is easy to understand. But when there is so much technical information, impressive graphs, jargon, and models that at first glance seem logical, how are you to tell.


Now I certainly don't want to claim that performance testers do this all deliberately. Most will be professional and want to deliver the best result and value. The theme of Eurostar 2013 was question testing. One keynote in particular showed how we assume many things and forget to question the way we work. The usual way performance testing seems appropriate. But as my question to the audience showed, so many have experienced that performance testing is way off, I think it is fair to question it. The 5 points mentioned here are not the only weak points. And these points should and will be questioned also. But for performance testing to improve we will have to look at this.

Complaining is easy. Is there a way to do things better? I've proposed a different approach during the talk at Eurostar. Most of all I think it starts by being aware and being explicit on the value and limitations of the tests we do. We should do performance testing. We should also work on doing it better.

Get Splunked.

posted 8 Dec 2013, 08:41 by Albert Witteveen

Originally posted July the 5th 2013

Part of my job is to partake in impromptu task forces for complex issues on production. A common denominator for these task forces is that the usual analysis have already been done and we haven’t found a solution so far.

Naturally we follow a process where we try to exclude possible causes, look for correlation, check if anything outside of or direct view has changed etc. But most of all the solution often depends on one of the engineers having a ‘hey what’s happening there?’ moment.

The way to get to this moments is by digging through log files, monitoring traffic, checking event logs. What do they look for? Usually, you can’t tell. You know it when you see it.

Does this sound familiar to testers? To me it does, many of the bugs found are found by experienced testers that have these moments. Whilst testing, either scripted or exploratory, they’ll just notice something and dig deeper.


In production we are in the process of introducing a new tool for this kind of analysis. And I must say, I am impressed. The tool: Splunk, allows you to feed all sorts of data in it. Usually that means unstructured log files, or event logging from databases, middleware etc. Once this has been setup you can very easily drill down through the information, correlate between different systems etc. The product boasts to provide answers to questions you did not even know you had.

And indeed it does. We used it for some hard to tackle issues. And it was amazing how quickly we noticed that one server was showing dramatically different behavior, we dug deeper and saw the root cause. It is a bit too technical to explain what it was, but as soon as we noticed there was an immediate ‘but of course’ moment.

We could have found the issue manually. The particular log files were not loaded into splunk for nothing. But what made it special is that the moment we loaded these files into the system, it practically stared us in the face.

In our regular monitoring we did not see this. Simply because we weren’t looking for this. No matter how much monitoring you apply, to a large degree it is always to prevent issues you already encountered or anticipated you could get.


Another powerful feature is that it has the logging of other systems. And everything in it is properly timestamped. If you see errors in your middleware system and at the same moment you see a lot of database errors popping up, chance is they are related. Correlation does not imply causation, but... (don’t forget the mouseover)

For engineers, not much is more fun than a new toy. And enthusiastically we are now going through the systems to see what we can find. All in all, not all errors in production will lead to incidents, but they do have impact. Getting them fixed will improve the user experience. Possibly we can find performance improvements this way.

What does this have to do with testing?

Well often we are involved in testing complex systems. Those systems have the same logging. As in my view the same thought process occurs for finding the less obvious issues, most likely the same tool can help to find issues in test in a similar way. The more stable new software gets, the more interested I would be in the errors that still show in the logging.

The dashboard options of the tool also provide an easy way to monitor business processes end to end. That same monitoring in test can (if well interpreted!) provide some measurement of the stability. It can even provide some basic info on the test coverage such as amount of transactions, did we see every message type pass by or must we create and extra case. All in all I see possibilities.

The biggest financial limit on introducing Splunk is that if you process more than 500mb in log files per day you need to get a paid for license. Less than 500 mb is free. Fortunately, in test we often have much smaller log files than production as we have much less transactions.

So I would say: go for it. Download it and see what it can do for you. You may need to follow a few tutorials before you know how to get the best out of it, but it may well help you a lot in testing.

Coffee machine syndrome

posted 8 Dec 2013, 08:40 by Albert Witteveen

Originally posted June the 13th 2013

Apparently the medieval book “The Secret” by Francesco Petrarch  discusses the following assertions:

“The assertion that humans experience misery because they do not sufficiently desire not to do so”.

We don't do something about a situation we don't like because we don't actually want to badly enough. This in my view often still applies. Apparently in the middle ages they already experienced what I tend to call the coffee machine syndrom.

Now, in the middle ages people in Europe did not have coffee machines, let alone coffee. This is why it’s called the dark ages. But what I call the coffee machine syndrome is the tendency of people to complain, but take no action.

I first started calling this the coffee machine syndrome a long time ago in one of my first experiences working in a large company. We had just sort of formed a team. The atmosphere was good, teammates got along fine and for some reason we formed the habit of occasionally hang out at the coffee machine for a break. And of course we would discuss the situation at work. We were a team of testers and testers are good at complaining and seeing flaws in things. Maybe that’s what draws us into testing. So naturally, we bitched and moaned about everything that was going on in the project, the organisation and sometimes the rest of the world.

Oddly enough though, we were all drawn to this, the best complaining was about things in the project. That was until I naively suggested something weird: namely that we do something about it.

The project had its flaws, although I now can’t even remember what they were. But in the entire project there was a good atmosphere. Nothing was blocking us from discussing this with the project manager and propose some changes. So that’s what I suggested. When I actually suggested this, the faces changed. It was as if was taking their favorite toy away. If we fix things, then what do we have left to complain about?

After a few ‘yeah, but’s’ the break at the coffee machine ended prematurely with people taking their coffee to their desk to finish it there. I had broken the magic. I ended up actually talking to the project manager. He was responsive and changes were made. Including some changes for me, where I got some extra responsibilities. So it seems the same as the assertion what we started with: ‘people don’t want change enough to do something about it’ should actually be ‘people prefer complaining over improvements for the sake of complaining’ …

Something else I learned quickly afterwards: I am part of this ‘people’ too. Yes I took initiative that time, but I too have a tendency to enjoy complaining at the coffee machine. In so many other situations and clients I noticed the same thing including how easy it is to join in. I am no expert but I believe a large part of this has to do with the built in need for humans to be part of the group. It makes you feel part of the group. This behaviour is not productive. It often serves as an excuse for why your team does not make too much progress.

I have even seen sometimes cases where it can get you into trouble. Sometimes an entire team thinks of itself as the only ones who ‘get it’ and all end up getting the boot. While they were busy complaining in their own circle, others in power started to question their value. Had they acted, the same people in power could have concluded that the team itself had value but that they had solve the issues to achieve the value.

So what can you do about it? The most important thing is to be aware of it. Complaining in a group feels good, but it does not improve neither the issue itself or your situation. Go ahead and join the group in doing so, but as soon as you start to repeat yourself ask yourself and the group: so what are we going to do about it? Usually what you can do about the issue is come up with a solution. Present a solution to those that will have to approve it.

Be sure you don’t go to them to complain and be negative. Go to them with a positive attitude. If you are just going to complain about issues to them, they will focus on you and not the problem! They will defend, rationalize the issue and try to make you feel more comfortable with the situation. By bringing them a solution and being positive you can get them in the right mindset to focus on the issue instead of handling you. As long as you only complain it is your problem not theirs and their only problem is you.

Why should you do anything about it? In my personal experience it has not only actually changed things for the better, it has gotten me ahead as well. There really are opportunities for most people to get ahead, but no one is going to point them out for you. As soon as you start helping management by providing them with solutions they’ll be more than happy to give you the room to grow.

Context driven testing and Mars rovers

posted 4 Dec 2013, 10:03 by Albert Witteveen   [ updated 4 Dec 2013, 10:03 ]

(originally posted may 25 2013)

I stay out of heated discussions on context driven testing versus scripted testing and plan to stay out of it. However if I were to be put on the spot I would certainly consider myself context driven. I never bothered though to define why.

A discussion on if we should put astronauts on Mars or rovers however somehow gave me a nice analogy. In the discussion someone gave some nice numbers. Now I don’t know if the numbers are exactly true, but I think there is a lot of truth in it.

Here are the numbers: The Mars Rover Opportunity has been happily roving on Mars since early 2004 and is still at it. That is over nine years of activity. That is a great and awesome achievement for NASA. But as the guy in the discussion pointed out: in that nine years it had travelled the same distance as the astronauts of the last Moon mission had in a week. And the science performed was equivalent to a months worth of an astronaut.

These numbers are probably debatable. But I do believe men and/or women could do a lot more in the same time on Mars.

Mars is long distance away from us. A radio signal takes between four and twenty minutes to get from Earth to Mars and vice versa. So assuming an engineer on Earth sees something on the camera that forms a hazard, that was at least four minutes ago. If the engineer sends the signal to brake, that takes another four minutes. We don’t have to do the math here, that is much too long.

If you were to take all steps sequentially, i.e. move ten centimeters and then judge if you the rover is to go on or make bit of turn things take for ever.

The answer to this is to carefully plan and ‘script’ how the rover moves around and what it does. And there is no option to quickly respond to something interesting. If for instance it would be drilling in some rock and the rock would show some green material, by the time the stop signal arrives the drilling has gone on for at least eight minutes and the opportunity to stop and change what you are doing has gone.

All this means that a lot of work goes into planning what the rover is to do and I do think we could safely say that an astronaut on site could do a lot more. Time is spent on planning, scripting. Then in comparison a relatively short time compared to all the planning real actions take place with no opportunity to act on anomalies.

I am still in awe over NASA’s achievement! And of course this is a way over the top comparison with scripted testing. You would assume that a scripted tester does notice things and stops and acts if during a scripted test an anomaly occurs. It still doesn’t quite explain why I prefer context driven testing, but I do believe in the value of it.

You know, maybe it actually does explain it: ever since I was eleven, in my heart I have always wanted to be an astronaut!

Latency: Trucks, Blondes, Ferrari's and Beer

posted 4 Dec 2013, 09:53 by Albert Witteveen   [ updated 4 Dec 2013, 09:53 ]

(originally posted may 19 2013)

"We have a gigabit connection so it can't be the network". Sound familiar? It is odd how even some seasoned IT staff don't know or forget the difference between speed and transfer rate.

So what is the difference? This is best explained by a simple example. Suppose you want to bring beer to someone 10 miles from where you are. What is the fastest way to get the beer there: in a Ferrari or in a truck?
The answer depends on if you want to bring just a crate or whole pallet. For the crate the Ferrari will be the best option, for the pallet the truck.

If the Ferrari drives 120 miles per hour, the 10 miles is done in 5 minutes. The truck driving 60, will take twice as long and deliver in 10 minutes.

Now delivering is one thing. You expect to get something back for the beer, such as cash or a hot blonde to go with the Ferrari. Every Ferrari needs one. So we care about when the Ferrari and the truck return.

Ignoring the time to load and offload, the round trip for the Ferrari takes 10 minutes and the truck takes 20 minutes for a round trip.

So how many cases can each transport? If a pallet has 60 cases and truck is limited to one pallet, the truck can transport 3*60=180 cases per hour. If the Ferrari can only hold one to leave space for the blonde, it can transport 6 cases per hour.

The Ferrari is faster, but the truck can transport an amazing 30 times more in an hour.  

For network performance we need to keep this in mind. The performance of the network is nearly always shown as Xbit/s like Gb/s or Mb/s. But this only tells us how much data it can transport in a certain timeframe. It does not tell us the speed!

The way to quantify speed is by measuring latency. Latency is the time it takes one data packet to reach its destination. If we use the familiar ping command from a shell or a Windows prompt we see something like this:

64 bytes from icmp_req=3 ttl=249 time=26.3 ms

The value time here is called the round trip time. That’s how long it takes to send a data packet to the pinged computer and back again. The round trip value we remember from example.

So is the analogy with the truck and the car completely correct? No not quite. The difference is that on the network the large packets travel just as fast as the small ones. It is more like a highway where the speed limit is set and every packet travels at exactly that limit. What we need to realise most of all is that speed and bandwidth are not the same.

What does influence speed?

There are several factors that do influence speed. One of the most important factors is distance. It is not uncommon for international organisations to have systems that connect and rely on each other to be in different countries. Most of the journey for the data will be on fiber optics, which means it travels with the speed of light. That is very fast, but still the round trip for a connection to a machine next door or one thousands of miles away is very different.

Another factor obviously is the underlying network equipment. The quality and performance of the network interface cards, switches, firewalls etc is crucial.

The third factor is other use of the network. You are sending your packets across the network, but other applications are doing the same. When we overload the network, things seem to slow down. The packets however travel at the same speed. The speed of light does not all of sudden slow down. What happens is that if there is too much traffic, packets get dropped. In our car analogy, we put vehicles on the highway, but if there is no place, they simply get dropped.

That is great way to prevent jams, but in real life you would not want to be in a car that simply disappears. For the network it is not so bad as there are failsafes in place that ensure that missed packets are sent again. But it takes time before it is resend. Packets being dropped (lost) is a major cause of latency. So the Xbit/s does have a relation with speed. If we do have a lot of bandwidth, we decrease the chance of overloading and subsequent packet and speed loss.


Why should we care? If I have a high bandwidth connection with bad latency the big file I’m downloading gets here nearly as fast.  The few milliseconds difference are not noticeable to me. For performance of applications on the other hand, latency can become a huge issue if there is a lot of back and forth communication over a connection.

A recent example demonstrates this well. There was an issue with an application on a development server which took very long to start up. On production the startup was fast whereas on development the startup took over half an hour. For development that is actually a bigger issue as they have restarts much more often than on production.

They had some issues tackling this problem and when I let it slip that I dabble with performance issues I was asked to join the group trying to solve this.

What they had already established was that the delay was caused by a single query. When the query was performed on the application server it took half an hour. Performing the same query on the server itself, in this case: login to a shell on the server and run the query there returned the query within seconds. Considering how light weight the client was we used for running the query, this pointed at the network. But the network seemed fine with a round trip time of 3ms. Not a very fast value but not that bad. We had other network settings checked like if it was on full duplex, bandwidth etc. Everything seemed fine.

Baffled, I performed the test myself. That was a good reminder to be careful of assumptions. The statement that the query took a long time made me think it would take a lot of time before we would see a result. The result came within seconds, but on the badly performing development server it took a long time to finish. There result was 50000 records and they scrolled slowly over the screen.

After some digging and googling we found the cause: the protocol used by the client performing the query was a ‘chatty’ protocol. It received the result in small chunks and it would get them sequentially. So for 50000 records, that was a lot of round trips of the client saying: "give me the next bit" and getting the next bit. We were able to increase the size of the chunks it retrieved and improved the performance dramatically.

And remember, this was not even a load test. This was just an SQL query by one user. Image this in a load situation.

Latency is important when we have: many sequential calls to another system. This can be application related, where a function does many calls instead of just one large call or like in the previous example based on the underlying infrastructure where there was just one call, but the retrieval method was ‘chatty’. Many calls running in parallel do not have to be an issue. To sum it up: Latency matters when:

  • the volume of requests is high

  • and the requests are dependant on each other (sequential)

The remedy can sometimes be cheap like changing some settings, it can be complex like changing your infrastructure to bring the different servers closer with a dedicated connection or it can be very expensive if you need a big change in software to account or the lack of speed.

But most of all, latency in the network cannot be neglected when dealing with performance issues. Not even when dealing with just a single user and just one query!

We don’t like our applications and protocol to be chatty any more than we like our blondes to be.

High school lessons for performance testers

posted 3 Dec 2013, 12:40 by Albert Witteveen   [ updated 3 Dec 2013, 12:40 ]

(originally posted on may 14 2013)

On the latest testnet event I attended a presentation on the need for calibration of performance test tools. The speaker showed by tests he and some colleagues had performed that performance test tools don’t behave exactly like real users. For instance, when testing manually the browser didn’t open more than 5 TCP/IP connection whereas the load generator opened nearly 32 sessions.  This can make a big difference.

That may not be news and the tests may not meet high scientific requirements  on how to prove this, but it was good thing they showed this aspect of the discrepancy between an artificially generated load by tools and the load generated by real users. They did a good job making this clear.

It was their conclusion however that reminded me that performance testers have forgotten some basic lessons we learned in high school math. The reason they investigated the difference between the artificially and user generated load was that they, like so many of us, have been confronted with situations where software passed the performance test but failed in production as well as the other way around. They used their findings to state that we should ‘calibrate’ the performance tool. Perhaps we should, but I doubt that doing so would prevent the issue of performance testing being way off.

High school math

In ‘high school’ (it’s not called high school where I am from) I did take math classes. And there was something I learned there. Which is by itself odd since the few times that I actually showed up for class, I hardly ever paid attention.  What I learned is that an answer cannot be more precise than the measurements on which it is based.

As an example, if you measure two distances in mm with the accuracy of maximum of half a millimeter the following two values: Value A: 53 value B:27.5 and then divide them, what is the correct answer?

  1. 1.9272727272727272727272727272727
  2. 1.9
  3. 2
  4. Something else

Answer 1 is what you get if use a calculator to divide the values. If you would choose answer 1, it would not only be considered wrong , you would actually get points deducted. The answer may never be more precise than the measurements. The accuracy is never better than the lowest denominator. So if one measurement is on half a millimeter accurate, there is no reason to get the second measurement accurate on the micrometer

Yet in performance testing this basic knowledge seems forgotten or people don’t realize it applies. In performance testing we are faced with a lot of assumptions that make the base of our test inaccurate:

  • Concurrent users dilemma. What are concurrent users? Is that the amount of users that at exactly the same moment click on something?  How to define concurrent users is subject of many debates and at most an estimated guess.
  • How do we expect the users to actually use the system and with how many? If you ask the project manager of the development team he will expect that the application will be used ones every two hours by some old and very patient granny. If you ask marketing, they will expect that the entire user base of Facebook is anxiously awaiting the new functionality and will open it directly at launch. Reality however is often even more bizarre. We just don’t know. We just assume some behavior.
  • We often test under load a particular function. In production however,  the other functions get usage at the same time as well. Simulating this is not only hard, but we often can only guess at what the exact ‘mix’ of usage is. There will be other processes, batch jobs, maintenance jobs, reports generated etc.
  • Sometimes, just sometimes, we get a test environment that has the same power as production to be. But then we still will have some different parameters. The network latency for instance often cannot be guaranteed to represent production. Test is often in a different network with completely different values. And yes, network latency can have a huge impact.
  • Forecasts on usage are usually based on averages, whereas usually the biggest thing we need to test for is a peak load. Forecasting feasible peaks is difficult and more often than not just wishful thinking.

All in all, we create a test scenario and a load profile based on many assumptions. The more assumptions you have, the less accurate you can actually predict. So we base our conclusions on a situation that hardly looks like reality, put load on it that does not represent reality in type nor quantity. And then: someone concludes passed…. In reality we can only report on the risk that the system in production will actually meet or fail generic requirements.

I don’t think they overcame these issues mentioned. If so I would very much like to see a presentation on how they achieved that!

So while I share their conclusion that we should be aware of the inaccuracy caused by load generating tools not behaving exactly like reality, I think it’s accuracy is still orders of magnitude higher than other aspects of our test.

It is a bit like, someone asks you why the men’s room is so smelly. You find out that the mop is so large it doesn’t reach the corners. You could advice to use a toothbrush for cleaning the men’s room. Although your analysis is true, the reason the men’s room is so smelly is because men have lousy aim, not because of the few spots in the corner.  

So yes, be aware of the limitation on how the load generated by tools represent real load. Be aware of the possible settings (such as caching, network settings etc.) You’ll find that most tools actually are aware and offer you to control these settings and this behavior.

Cleaning large areas with a toothbrush by the way is not a good idea. That’s another lesson some of us got in high school.

1-6 of 6