What's in the Architecture?

Wednesday, July 21, 2010

Who (re)moved my 'testing' ?

What is the most under emphasized aspect of IT projects and always swept under the carpet without blinking an eyelid, citing reasons like timeline, budget etc? I am sure everyone would sound in chorus that it is ‘testing’. I am not going to cite here the importance of testing but would like to express my horror of seeing lot of testing activities (non-functional) ignored in one of the large programs I was recently involved in. To elaborate a little I was bemused with the tactical, non-exhaustive approaches taken for key activities like performance testing and failover (local and DR) testing etc and here I am referring to a really large re-engineering program.

In general, there are multiple reasons for testing to take a back seat in almost all large programs. In most of the cases the foremost reason is the inability to capture the business SLAs and then mapping them to system SLAs (NFRs-Non Functional Requirements). Now this is where the business-IT gap becomes quite evident ! It is very difficult to carry out activities like load, performance, stress and failover testing etc. without the NFRs which form the objective and goals for these kind of tests.

There are also other reasons such as;

Lack of organizational maturity and no governance framework for IT testing
Non-availability of toolsets to automate these tests for some packaged applications
Lack of skill-sets to perform these activities and interpret the results
Over dependence on product and IT services vendors who historically treat testing as a mundane task.

I also feel that there is very little understanding of these non-functional testing aspects. I am not very choosy about the terms and their definitions. There are people who define load and stress testing under the umbrella of performance testing. I am ok with it; however, the following aspects need to be tested irrespective of what you call them.

How fast can a system respond to a single user request or perform a single automated activity? Normally I call this as performance testing for simplicity (many may disagree). Identify components that can possibly attribute to the unsatisfactory performance, profile these components for possible reasons (ex. identify if the database call takes for more time, or a thread is waiting for an object lock to be released etc) and look at possible ways to improve them, at times revisiting the design if required.This is more of white box testing unlike other testing like load and stress testing that are kinds of black box testing.
What is the average time taken to complete a specific activity (process a user request) under normal load? This should be well within the latency SLAs. If not, look at steps to improve the design.
What is the load that the system can take without misbehaving and still give good enough performance? This helps in terms of identifying the stability of the system under peak load. Measurement of resource consumption (Memory, CPU, IO etc) and understanding the resource constraints helps us knowing ways to scale and size the system. This also helps in identifying the system parameters that should be monitored after the system goes live.
Identifying limits of the system, sometimes called stress testing…
Fail-over tests for the system; both local and DR based on deployment strategy, architecture and infrastructure

It is quite common for people to identify features and stability as the foremost factors of system or application design. Then come scalability and availability and the other aspects like governance, security etc…. In my opinion this is not true in every situation. You cannot have facebook like social applications add feature after features without making it scalable and available. So overall testing strategy should be in line with the expectations from the system.

Talking of application performance, there is a great post on top ten performance problems taken from Zappos, Monster, Thomson and Co. I have worked on resolving performance problems of many applications and platforms that are mostly written in java. I have seen platforms like webMethods always cache the services (code) in the memory. This results in huge amount of memory being consumed after the system startup. Of course there are advantages of having the whole object tree loaded in memory to improve performance. However, this is a double edged sword. Having more objects in memory means heap size being bigger and hence the GC cycle being longer! Some may say it’s the language to blame. However, this is not exactly true, as mentioned in this interesting post on how everything that’s written is C, C++ is not always faster than Java or other languages.

To conclude, it is extremely important to understand the requirements/design (ex parallel or sequential execution) of the application/program and the nature of its system resource requirements (CPU, memory, disks, network, etc). Only then, can the right steps to achieve the expected performance requirements be worked out cost effectively.

Wednesday, May 05, 2010

Lessons from Designing a Two Billion Page View Mobile Application

In his guest post on High Scalability Jamie Hall, CTO MocoSpace detailed some key architectural lessons that can be a good guide in designing large scale enterprise and web applications. Below are my views on some of them.

1. Make your boxes/servers sweat.

In my experience most of the enterprises have their server resources under utilized. This is mostly due improper or no capacity planning, over-sizing, less emphasis on monitoring the utilization etc. Surprisingly, inspite very little utilization of resources, they never achieve the performance SLAs. This brings us to the second point that Jamie mentions...

2. Understand where your bottlenecks are in each tier

There is limited understanding of the application and technologies... for example, whether an application is CPU, memory or IO intensive? Enterprises going for COTS applications like SAP very rarely understand the application architecture and internals. Without this knowledge they have to blindly depend on the vendors to do the intial sizing for them and no way ready to understand the application behaviour themselves during the application life-cycle and resolve the associated issues. Also there is limited load and performance testing carried out in house... All this results in having extra processing power in machines that needs more memory and vice versa !

3. Profile the database religiously.

Database is normally the most critical component of any business application. All performance related issues can be generally traced back to the database. While optimizing databases for performance apart from doing database profiling, focus should also be on caching of read-only data on app layer, database sharding and alternative data stores (NoSQL key value stores).

4. Design to disable.

Hot deployment and disabling of rolled out features through configuration are very critical for the application life-cycle management. That's where evolving languages like Erlang that provide hot deployment of code are very promising inspite of the fact that there is still some way to go for their enterprise adoption.

5. Communicate synchronously only when absolutely necessary

This is the key to identify failures and error conditions, and thereby easily manage distributed applications. Yet I see people find ways implement synchronous interfaces between applications.

6. Think about monitoring during design, not after.

Do have your applications designed for monitoring. Identify the KPIs that needs monitoring. Otherwise you would have no way to troubleshoot when you end up with issues in production.

7. Distributed sessions can be a lot of overhead.

Inspite of the fact that distributed session management feature using technologies like application clustering are common with all server side applications and tools, this is always a bottleneck for scalability particularly when you want to scale-out. If you can design applications where you can have stateless sessions or the session info stored in the client to be passed with every request, life can become very easy. Jamie also advises to use sticky sessions that are now a days available with all loadbalancing appliances.

8. N+1 design.

Have N+1 design rather than clustering and local failover for the web and app servers.

Finally, a few other things that could be important are..

a. Keep it simple... Use the best tool/framework for your requirement that has a low CPU and memory footprint... You can compromise some of the non-realistic non-functional requirements to keep it simple.

b. Design your application to manage failure rather than to avoid failure.

c. Try and leverage client side processing as much as possible keeping in mind the browser or other client capabilities and the client devices to be used... Ofcourse for mobile applications client side processing should be kept to minimum.

Thursday, March 04, 2010

Cynicism is not always bad...

“The Paradoxical Success of Aspect-Oriented Programming” by Friedrich Steimann includes a fantastic quote and graphic from an IEEE editorial by James Bezdek in IEEE Transactions on Fuzzy Systems.

Every new technology begins with naive euphoria—its inventor(s) are usually submersed in the ideas themselves; it is their immediate colleagues that experience most of the wild enthusiasm. Most technologies are overpromised, more often than not simply to generate funds to continue the work, for funding is an integral part of scientific development; without it, only the most imaginative and revolutionary ideas make it beyond the embryonic stage. Hype is a natural handmaiden to overpromise, and most technologies build rapidly to a peak of hype. Following this, there is almost always an overreaction to ideas that are not fully developed, and this inevitably leads to a crash of sorts, followed by a period of wallowing in the depths of cynicism. Many new technologies evolve to this point, and then fade away. The ones that survive do so because someone finds a good use (= true user benefit) for the basic ideas.

How true.....Without cynicism true potential of a technological innovation can not be discovered...

Note: I ended up on this while reading an post from Dennis Forbes on his defence of SQL/RDBMS.

Wednesday, February 17, 2010

Where not to Cloud?

Latest Buzz... Intel's Cloud Chip. It will have 48 cores and will increase the power of what is today by 10-20 times (as quoted by Intel)...

Many more cloudy things popping up everyday... However, it would be good to know the shortcomings of the cloud and related technolgies before jumping onto it like every other chap round the corner...

As rightly pointed out by Gojko Adzic in his excellent post, with cloud platforms all you have is "a bunch of cheap servers with poor IO" .

He also mentions the key constraints of Cloud Deployment

- All cloud servers are equally unreliable
- All servers will be equally impacted by network and IO constraints
- Fundamentally Network is unreliable
- There is no fast shared storage

He mentions some fundamental guidelines while doing cloud deployments

Partition, partition, partition: avoid funnels or single points of failure. Remember that all you have is a bunch of cheap web servers with poor IO. This will prevent bottlenecks and scoring an own-goal by designing a denial of service attack in the system yourself.

Plan on resources not being there for short periods of time. Break the system apart into pieces that work together, but can keep working in isolation at least for several minutes. This will help make the system resilient to networking issues and help with deployment.

Plan on any machine going down at any time. Build in mechanisms for automated recovery and reconfiguration of the cluster. We accept failure in hardware as a fact of life – that’s why people buy database servers with redundant disks and power supplies, and buy them in pairs. Designing applications for cloud deployment simply makes us accept this as a fact with software as well.

Tuesday, October 27, 2009

Application Design or Hosting Strategy.. What should be addressed first?

Larry O'Brein recently interviewed three of Gang of Four (GoF) on the applicability of design patterns to application design after 15 years. The consensus among the authors was that these patterns are more or less associated with object oriented languages like c++, Java, smalltalk and C# etc. Some of the current languages have different ways of solving the same problem (ex. for functional languages there are different set of design principles/patterns). It makes lot of sense to understand the different ways to resolve a problem within the constraints before jumping onto something. Constraints can be of any nature (may be the language of choice, deployment options, computing resources available etc)

I am at present working on a solution (a transformation project) where the vendor packaged applications and their technologies more or less decide the deployment architecture, sizing and infrastructure requirements. There are cases where virtualization of servers can add up to 50% overhead on the server infrastructure. So the question is “do you decide on the deployment/hosting strategy first (where and how you want to deploy your application) before designing it or design the application and then decide the deployment strategy and infrastructure requirements ”.

With new paradigms in computing emerging day by day (ex. Cloud, grid and space based architecture, REST etc) application can now be designed based on how you plan to host them (i.e. what is the cost effective way of deploying them). However, you are bound to fixed application designs when you are using packaged applications (Most of the business application vendors like SAP, Oracle are still mostly in the standard client server or three tiered architecture space) and can not do much about it like my current project.

Normally infrastructure and operations are an afterthought with no consideration for them during application design. However, future trends are more towards using the existing/available infrastructure options and operations requirements to help drive the application design thereby closing the gaps between apps and ops in an organization.

Sunday, September 06, 2009

Amazon Virtual Private Cloud - A Sliver Lining in the Cloud !

Cloud as a technology is gathering momentum. It is quite an onerous job to keep track of the developments everyday with cloud service providers mushrooming as minutes go by and lots of venture capitalists throwing their weight around it. It is not uncommon for the skeptics to expect a 'Cloud Burst' in the times to come.

Who does not want to be there at the center of attention. Every vendor has thrown a substantial amount of their R&D budget for cloud offerings and research. There has been efforts by number organizations to 'standardize the cloud' with their versions of standardization requirements around Cloud Resource Definition, Cloud Federation, Cloud Interops et al. There has been number of ongoing efforts, including US Government to create communities and de-facto standards for cloud computing.

Inspite of the so much hype around the technology, there has been efforts by many vendors to make Cloud as a feasible alternative for many enterprises. In my opinion Amazons latest effort around virtual private cloud (VPC) that allows customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work, is a step in the right direction.

Elasticity and Pay as you Go are the two key requirements for any cloud Platform. Till the time Cloud Platforms can truly prove themselves as extensions of the existing data centers of an enterprise leveraging the existing investments in tools and technologies, every IT decision maker has a difficult task of sell it to all stake holders. Amazon CTO Werner Vogels has a good post introducing Amazon VPC.

Introducing Amazon Virtual Private Cloud

We have developed Amazon Virtual Private Cloud (Amazon VPC) to allow our customers to seamlessly extend their IT infrastructure into the cloud while maintaining the levels of isolation required for their enterprise management tools to do their work.

With Amazon VPC you can:

Create a Virtual Private Cloud and assign an IP address block to the VPC. The address block needs to be CIDR block such that it will be easy for your internal networking to route traffic to and from the VPC instance. These are addresses you own and control, most likely as part of your current datacenter addressing practice.

Divide the VPC addressing up into subnets in a manner that is convenient for managing the applications and services you want run in the VPC.

Create a VPN connection between the VPN Gateway that is part of the VPC instance and an IPSec-based VPN router on your own premises. Configure your internal routers such that traffic for the VPC address block will flow over the VPN.

Start adding AWS cloud resources to your VPC. These resources are fully isolated and can only communicate to other resources in the same VPC and with those resources accessible via the VPN router. Accessibility of other resources, including those on the public internet, is subject to the standard enterprise routing and firewall policies

Amazon VPC offers customers the best of both the cloud and the enterprise managed data center:

Full flexibility in creating a network layout in the cloud that complies with the manner in which IT resources are managed in your own infrastructure.

Isolating resources allocated in the cloud by only making them accessible through industry standard IPSec VPNs.

Familiar cloud paradigm to acquire and release resources on demand within your VPC, making sure that you only use those resources you really need.

Only pay for what you use. The resources that you place within a VPC are metered and billed using the familiar pay-as-you-go approach at the standard pricing levels published for all cloud customers. The creation of VPCs, subnets and VPN gateways is free of charge. VPN usage and VPN traffic are also priced at the familiar usage based structure

All the benefits from the cloud with respect to scalability and reliability, freeing up your engineers to work on things that really matter to your business.

Friday, May 08, 2009

Cloud Ecosystem - US Federal View

Peter Mell and Tim Grance - National Institute of Standards and Technology, Information Technology Laboratory has put the following definition of Cloud Computing in Draft NIST definition of Cloud Computing. This is the most exhaustive cloud definition I have seen till date.

Definition of Cloud Computing:

Cloud computing is a pay-per-use model for enabling available, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is comprised of five key characteristics, three delivery models, and four deployment models.

Key Characteristics:

· On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed without requiring human interaction with each service’s provider.
· Ubiquitous network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
· Location independent resource pooling. The provider’s computing resources are pooled to serve all consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. The customer generally has no control or knowledge over the exact location of the provided resources. Examples of resources include storage, processing, memory, network bandwidth, and virtual machines.
· Rapid elasticity. Capabilities can be rapidly and elastically provisioned to quickly scale up and rapidly released to quickly scale down. To the consumer, the capabilities available for rent often appear to be infinite and can be purchased in any quantity at any time.
· Pay per use. Capabilities are charged using a metered, fee-for-service, or advertising based billing model to promote optimization of resource use. Examples are measuring the storage, bandwidth, and computing resources consumed and charging for the number of active user accounts per month. Clouds within an organization accrue cost between business units and may or may not use actual currency.
· Note: Cloud software takes full advantage of the cloud paradigm by being service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability.

Delivery Models:

· Cloud Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure and accessible from various client devices through a thin client interface such as a Web browser (e.g., web-based email). The consumer does not manage or control the underlying cloud infrastructure, network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
· Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created applications using programming languages and tools supported by the provider (e.g., java, python, .Net). The consumer does not manage or control the underlying cloud infrastructure, network, servers, operating systems, or storage, but the consumer has control over the deployed applications and possibly application hosting environment configurations.
· Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to rent processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly select networking components (e.g., firewalls, load balancers).

Deployment Models:

· Private cloud. The cloud infrastructure is owned or leased by a single organization and is operated solely for that organization.
· Community cloud. The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations).
· Public cloud. The cloud infrastructure is owned by an organization selling cloud services to the general public or to a large industry group.
· Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (internal, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting).

Each deployment model instance has one of two types: internal or external. Internal clouds reside within an organizations network security perimeter and external clouds reside outside the same perimeter.