Wednesday, July 21, 2010

Who (re)moved my 'testing' ?

What is the most under emphasized aspect of IT projects and always swept under the carpet without blinking an eyelid, citing reasons like timeline, budget etc? I am sure everyone would sound in chorus that it is ‘testing’.  I am not going to cite here the importance of testing but would like to express my horror of seeing lot of testing activities (non-functional) ignored in one of the large programs I was recently involved in. To elaborate a little I was bemused with the tactical, non-exhaustive approaches taken for key activities like performance testing and failover (local and DR) testing etc and here I am referring to a really large re-engineering program.

In general, there are multiple reasons for testing to take a back seat in almost all large programs. In most of the cases the foremost reason is the inability to capture the business SLAs and then mapping them to system SLAs (NFRs-Non Functional Requirements).  Now this is where the business-IT gap becomes quite evident ! It is very difficult to carry out activities like load, performance, stress and failover testing etc. without the NFRs which form the objective and goals for these kind of tests.

There are also other reasons such as;
  • Lack of organizational maturity and no governance framework for IT testing
  • Non-availability of toolsets to automate these tests for some packaged applications
  • Lack of skill-sets to perform these activities and interpret the results
  • Over dependence on product and IT services vendors who historically treat testing as a mundane task.

I also feel that there is very little understanding of these non-functional testing aspects.  I am not very choosy about the terms and their definitions. There are people who define load and stress testing under the umbrella of performance testing. I am ok with it; however, the following aspects need to be tested irrespective of what you call them.
  1. How fast can a system respond to a single user request or perform a single automated activity? Normally I call this as performance testing for simplicity (many may disagree). Identify components that can possibly attribute to the unsatisfactory performance, profile these components for possible reasons (ex. identify if the database call takes for more time, or a thread is waiting for an object lock to be released etc) and look at possible ways to improve them, at times revisiting the design if required.This is more of white box testing unlike other testing like load and stress testing that are kinds of black box testing. 
  2. What is the average time taken to complete a specific activity (process a user request) under normal load? This should be well within the latency SLAs.  If not, look at steps to improve the design.
  3. What is the load that the system can take without misbehaving and still give good enough performance?  This helps in terms of identifying the stability of the system under peak load. Measurement of resource consumption (Memory, CPU, IO etc) and understanding the resource constraints helps us knowing ways to scale and size the system. This also helps in identifying the system parameters that should be monitored after the system goes live.
  4. Identifying limits of the system, sometimes called stress testing…
  5. Fail-over tests for the system; both local and DR based on deployment strategy, architecture and infrastructure

It is quite common for people to identify features and stability as the foremost factors of system or application design. Then come scalability and availability and the other aspects like governance, security etc…. In my opinion this is not true in every situation. You cannot have facebook like social applications add feature after features without making it scalable and available. So overall testing strategy should be in line with the expectations from the system.
Talking of application performance, there is a great
 post on top ten performance problems taken from Zappos, Monster, Thomson and Co.  I have worked on resolving performance problems of many applications and platforms that are mostly written in java. I have seen platforms like webMethods always cache the services (code) in the memory. This results in huge amount of memory being consumed after the system startup. Of course there are advantages of having the whole object tree loaded in memory to improve performance. However, this is a double edged sword. Having more objects in memory means heap size being bigger and hence the GC cycle being longer! Some may say it’s the language to blame. However, this is not exactly true, as mentioned in this interesting post on how everything that’s written is C, C++ is not always faster than Java or other languages.

To conclude, it is extremely important to understand the requirements/design (ex parallel or sequential execution) of the application/program and the nature of its system resource requirements (CPU, memory, disks, network, etc). Only then, can the right steps to achieve the expected performance requirements be worked out cost effectively.

Wednesday, May 05, 2010

Lessons from Designing a Two Billion Page View Mobile Application

In his guest post on High Scalability Jamie Hall, CTO MocoSpace detailed some key architectural lessons that can be a good guide in designing large scale enterprise and web applications. Below are my views on some of them.

1. Make your boxes/servers sweat.

In my experience most of the enterprises have their server resources under utilized. This is mostly due improper or no capacity planning, over-sizing, less emphasis on monitoring the utilization etc. Surprisingly, inspite very little utilization of resources, they never achieve the performance SLAs. This brings us to the second point that Jamie mentions...

2. Understand where your bottlenecks are in each tier

There is limited understanding of the application and technologies... for example, whether an application is CPU, memory or IO intensive? Enterprises going for COTS applications like SAP very rarely understand the application architecture and internals. Without this knowledge they have to blindly depend on the vendors to do the intial sizing for them and no way ready to understand the application behaviour themselves during the application life-cycle and resolve the associated issues. Also there is limited load and performance testing carried out in house... All this results in having extra processing power in machines that needs more memory and vice versa !

3. Profile the database religiously.

Database is normally the most critical component of any business application. All performance related issues can be generally traced back to the database.  While optimizing databases for performance apart from doing database profiling, focus should also be on caching of read-only data on app layer, database sharding and alternative data stores (NoSQL key value stores).

4. Design to disable.

Hot deployment and disabling of rolled out features through configuration are very critical for the application life-cycle management. That's where evolving languages like Erlang that provide hot deployment of code are very promising inspite of the fact that there is  still some way to go for their enterprise adoption.  

5. Communicate synchronously only when absolutely necessary

This is the key to identify failures and error conditions, and thereby easily manage distributed applications. Yet I see people find ways implement synchronous interfaces between applications.

6. Think about monitoring during design, not after.

Do have your applications designed for monitoring. Identify the KPIs that needs monitoring. Otherwise you would have no way to troubleshoot when you end up with issues in production.

7. Distributed sessions can be a lot of overhead.  

Inspite of the fact that distributed session management feature using technologies like application clustering are common with all server side applications and tools, this is always a bottleneck for scalability particularly when you want to scale-out. If you can design applications where you can have stateless sessions or the session info stored in the client to be passed with every request, life can become very easy. Jamie also advises to use sticky sessions that are now a days available with all loadbalancing appliances.

8. N+1 design.

Have N+1 design rather than clustering and local failover for the web and app servers.

Finally, a few other things that could be important are..

a. Keep it simple... Use the best tool/framework for your requirement that has a low CPU and memory footprint...  You can compromise some of the non-realistic non-functional requirements to keep it simple.

b. Design your application to manage failure rather than to avoid failure.

c. Try and leverage client side processing as much as possible keeping in mind the browser or other client capabilities and the client devices to be used... Ofcourse for mobile applications client side processing should be kept to minimum.

Thursday, March 04, 2010

Cynicism is not always bad...

“The Paradoxical Success of Aspect-Oriented Programming” by Friedrich Steimann includes a fantastic quote and graphic from an IEEE editorial by James Bezdek in IEEE Transactions on Fuzzy Systems.

Every new technology begins with naive euphoria—its inventor(s) are usually submersed in the ideas themselves; it is their immediate colleagues that experience most of the wild enthusiasm. Most technologies are overpromised, more often than not simply to generate funds to continue the work, for funding is an integral part of scientific development; without it, only the most imaginative and revolutionary ideas make it beyond the embryonic stage. Hype is a natural handmaiden to overpromise, and most technologies build rapidly to a peak of hype. Following this, there is almost always an overreaction to ideas that are not fully developed, and this inevitably leads to a crash of sorts, followed by a period of wallowing in the depths of cynicism. Many new technologies evolve to this point, and then fade away. The ones that survive do so because someone finds a good use (= true user benefit) for the basic ideas.

How true.....Without cynicism true potential of a technological innovation can not be discovered...

Note: I ended up on this while reading an
post from Dennis Forbes on his defence of SQL/RDBMS.

Wednesday, February 17, 2010

Where not to Cloud?

Latest Buzz... Intel's Cloud Chip. It will have 48 cores and will increase the power of what is today by 10-20 times (as quoted by Intel)...

Many more cloudy things popping up everyday... However, it would be good to know the shortcomings of the cloud and related technolgies before jumping onto it like every other chap round the corner...

As rightly pointed out by Gojko Adzic in his excellent post, with cloud platforms all you have is "a bunch of cheap servers with poor IO" .

He also mentions the key constraints of Cloud Deployment

- All cloud servers are equally unreliable
- All servers will be equally impacted by network and IO constraints
- Fundamentally Network is unreliable
- There is no fast shared storage

He mentions some fundamental guidelines while doing cloud deployments

Partition, partition, partition: avoid funnels or single points of failure. Remember that all you have is a bunch of cheap web servers with poor IO. This will prevent bottlenecks and scoring an own-goal by designing a denial of service attack in the system yourself.

Plan on resources not being there for short periods of time. Break the system apart into pieces that work together, but can keep working in isolation at least for several minutes. This will help make the system resilient to networking issues and help with deployment.

Plan on any machine going down at any time. Build in mechanisms for automated recovery and reconfiguration of the cluster. We accept failure in hardware as a fact of life – that’s why people buy database servers with redundant disks and power supplies, and buy them in pairs. Designing applications for cloud deployment simply makes us accept this as a fact with software as well.