Wednesday, July 21, 2010

Who (re)moved my 'testing' ?

What is the most under emphasized aspect of IT projects and always swept under the carpet without blinking an eyelid, citing reasons like timeline, budget etc? I am sure everyone would sound in chorus that it is ‘testing’.  I am not going to cite here the importance of testing but would like to express my horror of seeing lot of testing activities (non-functional) ignored in one of the large programs I was recently involved in. To elaborate a little I was bemused with the tactical, non-exhaustive approaches taken for key activities like performance testing and failover (local and DR) testing etc and here I am referring to a really large re-engineering program.

In general, there are multiple reasons for testing to take a back seat in almost all large programs. In most of the cases the foremost reason is the inability to capture the business SLAs and then mapping them to system SLAs (NFRs-Non Functional Requirements).  Now this is where the business-IT gap becomes quite evident ! It is very difficult to carry out activities like load, performance, stress and failover testing etc. without the NFRs which form the objective and goals for these kind of tests.

There are also other reasons such as;
  • Lack of organizational maturity and no governance framework for IT testing
  • Non-availability of toolsets to automate these tests for some packaged applications
  • Lack of skill-sets to perform these activities and interpret the results
  • Over dependence on product and IT services vendors who historically treat testing as a mundane task.

I also feel that there is very little understanding of these non-functional testing aspects.  I am not very choosy about the terms and their definitions. There are people who define load and stress testing under the umbrella of performance testing. I am ok with it; however, the following aspects need to be tested irrespective of what you call them.
  1. How fast can a system respond to a single user request or perform a single automated activity? Normally I call this as performance testing for simplicity (many may disagree). Identify components that can possibly attribute to the unsatisfactory performance, profile these components for possible reasons (ex. identify if the database call takes for more time, or a thread is waiting for an object lock to be released etc) and look at possible ways to improve them, at times revisiting the design if required.This is more of white box testing unlike other testing like load and stress testing that are kinds of black box testing. 
  2. What is the average time taken to complete a specific activity (process a user request) under normal load? This should be well within the latency SLAs.  If not, look at steps to improve the design.
  3. What is the load that the system can take without misbehaving and still give good enough performance?  This helps in terms of identifying the stability of the system under peak load. Measurement of resource consumption (Memory, CPU, IO etc) and understanding the resource constraints helps us knowing ways to scale and size the system. This also helps in identifying the system parameters that should be monitored after the system goes live.
  4. Identifying limits of the system, sometimes called stress testing…
  5. Fail-over tests for the system; both local and DR based on deployment strategy, architecture and infrastructure

It is quite common for people to identify features and stability as the foremost factors of system or application design. Then come scalability and availability and the other aspects like governance, security etc…. In my opinion this is not true in every situation. You cannot have facebook like social applications add feature after features without making it scalable and available. So overall testing strategy should be in line with the expectations from the system.
 
Talking of application performance, there is a great
 post on top ten performance problems taken from Zappos, Monster, Thomson and Co.  I have worked on resolving performance problems of many applications and platforms that are mostly written in java. I have seen platforms like webMethods always cache the services (code) in the memory. This results in huge amount of memory being consumed after the system startup. Of course there are advantages of having the whole object tree loaded in memory to improve performance. However, this is a double edged sword. Having more objects in memory means heap size being bigger and hence the GC cycle being longer! Some may say it’s the language to blame. However, this is not exactly true, as mentioned in this interesting post on how everything that’s written is C, C++ is not always faster than Java or other languages.

To conclude, it is extremely important to understand the requirements/design (ex parallel or sequential execution) of the application/program and the nature of its system resource requirements (CPU, memory, disks, network, etc). Only then, can the right steps to achieve the expected performance requirements be worked out cost effectively.