Archive for ‘Performance Engineering’

June 13, 2011

Symantec.Cloud Review

I recently had an opportunity to attend a Symantec Conference (Enterprise 2011). I was mostly attracted by the 45  minute session on Cloud Computing – which was towards the end of the half day conference.

I waited with eager anticipation as I was ignorant of any of Symantec Cloud offerings. Being an ex Veritas File System developer with a long association at Symantec I was naturally curious.

As the hour approached, the presenter went on to unveil a set of Symantec Products offered via SaaS model (mostly concerning with Data Security/Availability – full list here http://www.symantec.com/business/theme.jsp?themeid=symantec-cloud).

I was disappointed, to say the least, at this attempt to pass off SaaS as cloud computing. I was hoping to learn of some cool new storage/compute virtualization story (Symantec and Vware are buddies) tied together with utility computing and security thrown in the mix. Alas, no such thing.

Something did not add up in my own mind after that presentation. I was not sure why I should be disappointed that “SaaS is not the same as Cloud Computing” (A similar debate is going on concerning Apple’s iCloud).

SaaS – Cloud Computing? It is and It isn’t.

It is difficult to find a precise and widely accepted definition of what cloud computing is and what it isn’t. However, there are some Cloud Computing Guarantees (well, promises) that are generally well accepted. Some of which are –

  1. Availability – promises on service availability (five 9 availability for e.g.)
  2. Scalability  – promises on how well the service scales (horizontally and vertically)
  3. Utility Computing Based Billing – pay as you go
My own thoughts when someone mentions cloud computing is the cloud computing infrastructure – hardware and operating system software sans the (business logic) applications. Similar to Google App Engine, Amazon EC2, Microsoft’s Azure.
Applications written to run on such infrastructure are expected to exhibit availability and scalability properties.
It is interesting to note that Symantec promises 100% service availability and a guaranteed latency (from which you can draw inferences about the scalability). I suppose from a service consumer’s point of view this is all that matters.
It is anybody’s guess as to what kind of infrastructure powers Symantec.Cloud and consumers should not be unduly concerned. They should instead focus on the service availability and response times which are very very good indeed.
April 18, 2011

competitive analysis

Comparing performance is a game of how to do apples-to-apples comparison and yet make your system look better. The simplest trick is to chose a hardware or a software that favors your system. And use the same platform for the competing system. Then under identical conditions, using same benchmark, your system would obviously perform better.  If you look at the experiment as a whole, it’s a clean apples-to-apples comparison. Many  competitive papers published or sponsored by vendors may be such studies.

To understand the real game, one needs to know more about how the two particular softwares are designed. How they work with different underlying hardware. Each system has unique features which are designed to perform better under specific applications. For example, one may be using on-board cache to give better performance while other may be using multiple paths to the storage to give better performance. Any differentiating features between the two softwares can be exploited to tilt the level of the playing field. By ‘correctly’ choosing the underlying hardware one of them can be shown in better light than the other.

That is why even though system performance is a very rigorous subject, it’s application to compare performance of commercial systems is a subtle art.

So, the next time you come across a ‘performance comparison’, be sure to look at the conditions under which the performance measurement was carried out.

April 18, 2011

the correct measure of system performance

System performance is a general term. It means, how the system (hardware, software or both) is doing under some specified conditions. Or how best can system performs under real load conditions. An appropriate measure for the performance should be chosen based on the most valuable aspect given any situation. There isn’t a single common parameter that fits all situations.

One of the most commonly used parameters for measuring performance is the number of transactions the system performs in one second or, ‘transactions/sec’. This simply counts the number of operations a system can do in one second. This parameter is all right if it is required that the system deliver maximum number of transactions.

Another frequently used parameter is –  how long does a specified operation take to complete. This is ‘inverse’ of the previous parameter. Here the interest is in the average time (say in milliseconds) taken to complete a single transaction. This may be of interest if one wants a measure of how FAST the system serves users.

There are other performance parameters as well. For example, how efficient the code is (especially, code which does computation), then one is interested in looking at the number of cpu cycles spent in doing transactions every second or, ‘transactions/sec/cpu’. An efficient code will deliver greater transactions per second while using less cpu.

If the interest is in the measure of raw data performance – written or read from the disk then the appropriate parameter is the number of blocks written/read form the disk. If one is studying the performance of data-base then one may want to look at how many SQL operations (or scripts) are executed per second.

For a web-server, it could be the time taken to load the pages from multiple users.

The measure of performance should capture the quality that is being sought in the system. The performance numbers or claims may prove misleading simply because one is not using the correct measure of performance. It is thus very important to choose the correct measure of system performance.

April 12, 2011

Looking at code performance

In any system under performance consideration, there  are 3 important timescales.

  1. The timescale at which the CPU works
  2. The timescale at which the Memory can be accessed
  3. The timescale at which data can be stored on disk

These three have widely different timescales in a traditional computer setup. CPU runs at nano second resolution, memory at micro-second and Disks at milli-second.

This makes it difficult to estimate the true efficiency of  code. Because, the rate at which the instructions can be processed also now depends on the rate at which the memory is accessed and the disk is utilized. Hence, to truly measure the code efficiency one needs to make sure that the disk response times do not play a major role in the flow of instructions. In other words on e should take care of removing the IO bottlenecks before analyzing the performance of code.

In presence of strong influence of disk IOs the measurements will be biased by the characteristics of the storage system. The order of magnitude difference between the response times of disks and memory and memory and CPU makes it even more difficult to remove the IO bottleneck. While the CPUs have become faster in the last few years and there are now CPUs with multiple cores the memory and disk access speeds haven’t kept pace.  As a result the contrast between the response times of CPU, Memory and Storage has widened.

April 9, 2011

3 simple steps to adopt cloud computing

Cloud computing is now synonymous with Flexible Provisioning and Scale. Find out below if you are taking full advantage of cloud computing.

The As Is deployment – lowest adoption cost, reasonable benefits:

Move the server application “as is” to a cloud server. This is nothing but a co-located server, at Amazon for example. The provisioning and maintenance of the application is still a self driven task.

The win is in the dynamic on demand provisioning. Easy to compute the ROI here. Let us say that your application needs to be available all year round – but cater to seasonal demands. Say it costs $400 to host your application to cater to peak demand. You would end up paying 12*400 = $4800 per annum to keep your application up. Most of the time it would be under utilized. Cloud computing has made it really simple to change your compute capacity as easily as setting a reminder in your out look calender. With amazon or google, you could just log into the admin panel and say that you need additional resources only on certain dates. At the end of the month you get billed for the amount of resources you actually consume.

The Managed RDBMS deployment – reasonably low adoption cost, reasonable benefits:

A lot of work has to be done to ensure that the application is available. i.e. a replication strategy and policy to keep the database available. This is still a lot of effort and money. The alternative is a managed RDBMS, where the provider (amazon or google) manages the database. They worry about keeping the data safe from being lost. Much harder to do the ROI here – as the time spent in managing this would have to be offset against opportunity costs. Note that there would be some amount of code restructuring (not a lot) to get this going. An example of this is the Amazon MySQL RDS. At the time of writing, google is yet to announce the availability of their hosted sql service.

The Application Rewrite – highest adoption cost, highest benefits (arguably)

If your goal is to write an application which scales very well then you should consider a complete application rewrite to take advantage of the storage APIs. Hosted RDBMS is still a single machine (or a cluster) running a database server – with bottlenecks – be it memory, cpu, networ or disk.

Cloud computing offers storage APIs to access and manage data unlike traditional methods of file or rdbms storage. Because of the underlying architectural differences, cloud datastore offers better scalability – http://labs.google.com/papers/bigtable.html.

April 8, 2011

Performance Engineering – SSD file systems.

Solid state device threatens to challenge and change existing computing paradigms.

While traditional disk access times are of the order of a few milli seconds, ssd access times are under 100 micro seconds for reads and writes respectively. Speeding up by a factor of 100. And that is significant.

Operating system components have evolved over the last 4 decades at a much slower speed. For an enterprise platform like IBM AIX or HPUX it takes A few years (my guess is a minimum 6) to push a new technology. The cycle is as follows – a new hardware technology is invented. OS vendors take a few years to adopt and evangelize. Enterprise customers longer to test, adopt and deploy.

SSD promises to deliver better performance by lowering IO latency and increasing throughput. File systems have evolved to do the same. Specialized caches have been invented to speed up performance. For example directory name lookup cache, page cache, inode cache, large directory cache, buffer cache and so on.

Quite a lot of focus is on being clever with reads and writes of application data. Engineers go to great extents to squeeze the last bit of performance out of the system. Sadly performance is not a main consideration during implementation (functionality is) and is often times applied as an afterthought.

The result is hacks rather than elegant solutions to performance issues.

Coming back to SSDs – a large portion of the file system implementation has to be re looked at and parts of it have to be thrown away completely. Especially true when complete filesystems are laid out on SSD. We need to look at how filesystems can take advantage of SSDs.

April 7, 2011

Balance File System Caching and Flushing policies

Most file-systems have intelligent caching policies. The policies are designed to increase the throughput and decrease IO latency rates thereby providing faster service times to the users and applications. Based on the nature of the workload, appropriate caching policies can be set to achieve maximum cache-hit rates.

However, this works as long as there is enough memory to store revisited pages. Once the number of pages cached exceeds the set memory limit, the file-system decides to reclaim space by flushing older pages to the storage. If flushing is not done frequently, a lot of data may suddenly be dumped on the disk leading to storage bottleneck.

Depending on storage bandwidth, flushing large amount of file-system data can lead to very large service times for the users or the application. For all-round good performance one thus needs to also look at how frequently and how much of data is flushed to the storage. Just like caching policies should take into account the nature of workload, flushing policies should take into account the nature of storage and storage i/o bandwidths.

A good sustained file-system performance is possible only when both, caching and flushing, policies are set optimally.

April 7, 2011

Performance Engineering: Watch out for dynamic CPU revving.

These days, most machines come with advance power saving options. To save power, machines reduce cpu speeds whenever it is not in great demand. Even though the cpu may have a specified speed of 2 or 3 GHz, these speeds are reached only when there is enough load on the system and when full processing power is needed.

This is very important to remember which discussing or measuring system performance. Since the performance can drastically  change with cpu speed, cpu speed must be held constant while doing any performance measurement. Typically, you will need to switch of the dynamic cpu speed revving by setting appropriate flags in the power-saving setup files.

Measuring maximum performance of a system or comparing performance of two software products is meaningful only when cpu power is held constant.

April 5, 2011

Performance Engineering: Best Practices

Documenting best practices is often a sign of failure. In the software development environment one often starts to document best practices. The intention here is to make everyone in the team aware of how good work can be done.

 

However over time, writing down best practices becomes a norm and they become a set format in which the work needs to be done. Over time,  as the team grows, best practices are used as rules. However the scope and the magnitude of the teams work might have far overgrown the original mandate. And so, the best practices may not necessarily be applicable. In such cases writing down best practices sounds like admitting that the team members cannot evolve appropriate strategies to deal with contemporary nature of projects.

April 5, 2011

Performance Engineering: True measure of code performance

If you understand throughput as the effective number of transactions that users experience per second from your hardware software stack then you would ideally want maximum possible transactions per second from your setup.

 

We understand that any software stack would ultimately use CPU cycles to process all these transactions. Hence, the transactions per second delivered per CPU cycle is the true measure of the performance of your system.

 

Ideally we should be measuring throughput of a system and how many CPU cycles does it take to deliver that throughput. Inefficient code would spend many more CPU cycles to deliver X number of transactions per second aka tps. Optimized and efficient code would deliver the same transactions per second using far lesser CPU cycles.

 

Thus a good measure of system performance is throughput per CPU usage. This is the number to watch.