Originally posted July the 5th 2013
Part of my job is to partake in impromptu task forces for complex issues on production. A common denominator for these task forces is that the usual analysis have already been done and we haven’t found a solution so far.
Naturally we follow a process where we try to exclude possible causes, look for correlation, check if anything outside of or direct view has changed etc. But most of all the solution often depends on one of the engineers having a ‘hey what’s happening there?’ moment.
The way to get to this moments is by digging through log files, monitoring traffic, checking event logs. What do they look for? Usually, you can’t tell. You know it when you see it.
Does this sound familiar to testers? To me it does, many of the bugs found are found by experienced testers that have these moments. Whilst testing, either scripted or exploratory, they’ll just notice something and dig deeper.
In production we are in the process of introducing a new tool for this kind of analysis. And I must say, I am impressed. The tool: Splunk, allows you to feed all sorts of data in it. Usually that means unstructured log files, or event logging from databases, middleware etc. Once this has been setup you can very easily drill down through the information, correlate between different systems etc. The product boasts to provide answers to questions you did not even know you had.
And indeed it does. We used it for some hard to tackle issues. And it was amazing how quickly we noticed that one server was showing dramatically different behavior, we dug deeper and saw the root cause. It is a bit too technical to explain what it was, but as soon as we noticed there was an immediate ‘but of course’ moment.
We could have found the issue manually. The particular log files were not loaded into splunk for nothing. But what made it special is that the moment we loaded these files into the system, it practically stared us in the face.
In our regular monitoring we did not see this. Simply because we weren’t looking for this. No matter how much monitoring you apply, to a large degree it is always to prevent issues you already encountered or anticipated you could get.
Another powerful feature is that it has the logging of other systems. And everything in it is properly timestamped. If you see errors in your middleware system and at the same moment you see a lot of database errors popping up, chance is they are related. Correlation does not imply causation, but... (don’t forget the mouseover)
For engineers, not much is more fun than a new toy. And enthusiastically we are now going through the systems to see what we can find. All in all, not all errors in production will lead to incidents, but they do have impact. Getting them fixed will improve the user experience. Possibly we can find performance improvements this way.
What does this have to do with testing?
Well often we are involved in testing complex systems. Those systems have the same logging. As in my view the same thought process occurs for finding the less obvious issues, most likely the same tool can help to find issues in test in a similar way. The more stable new software gets, the more interested I would be in the errors that still show in the logging.
The dashboard options of the tool also provide an easy way to monitor business processes end to end. That same monitoring in test can (if well interpreted!) provide some measurement of the stability. It can even provide some basic info on the test coverage such as amount of transactions, did we see every message type pass by or must we create and extra case. All in all I see possibilities.
The biggest financial limit on introducing Splunk is that if you process more than 500mb in log files per day you need to get a paid for license. Less than 500 mb is free. Fortunately, in test we often have much smaller log files than production as we have much less transactions.