Monday, November 7, 2011

A look at the Fundamentals in the Cloud

If you are interested in the Cloud and testing the following is a talk I did at GTAC2011 that might be interesting to you.

Part the Clouds and See Fact from Fiction

Sunday, October 30, 2011

Old performance addage... Polling is bad

Its long been known that polling is bad.  It uses a ton of resources.  The challenge is it trips up even great developers.  Check out... http://m.guardian.co.uk/technology/2011/oct/29/iphone-4s-battery-location-services-bug?cat=technology&type=article

One way to catch this is to have a good set if resource monitoring tests.  Its very likely Apple had these however its hard to catch with so many ways to configure software.  This is where collecting these same resources from released devices can help (crowd sourcing test).  Check out for example Microsofts SQM (aka Customer Improvement Program) data.

Should you decide to collect telemetry just remember the second addage... Bad collection is like polling.

Friday, October 28, 2011

Crowd sourcing Apple iPhone 4S power performance

Interesting easy to solve the issue...

http://m.guardian.co.uk/technology/2011/oct/28/iphone-4s-battery-apple-engineers?cat=technology&type=article

Monday, October 17, 2011

Performance Test Pattern

One of the biggest challenges in monitoring and tracking performance is getting stable and repeatable numbers.  Check out the following plot of two performance tests.  On which do you think it will be easier to spot regressions?


There are two test here.  Test 1 which looks pretty erratic and test 2 which looks pretty stable and repeatable around 3 ms.  Given this I'm pretty sure you are going to choose test 2 on which to monitor and track performance.  The test is very stable meaning the variance is very low and repeatable because it does not drift off over time.  For an example of drift check out the following:


Here you can see the results are pretty stable in that the overall variance from number to number if pretty low however the results are not very repeatable and seem to drift up over time.  For this particular test high ping time is bad.  This is also and example of "death-by-a-thousand cuts" where from test to test results look good but over long periods of time you see performance is dropping off.

So the question then comes up how do you make stable and repeatable performance tests?  The answer is to follow a test pattern like the xUnit pattern with a couple of extra steps. The pattern is the following:

  1. setup 
  2. warmup
  3. execute
  4. something most tests forget
  5. publish
  6. cleanup
  7. teardown
Notice the following additional tests - warmup, step 4, publish, cleanup.  Now let me explain these steps.   

Warmup - This step is here to allow the performance test to "warm-up" the system under test.  For example if you want to measure database queries generally you have to decide if you want hot (most likely the common case) numbers where the database has been in use for a while or cold numbers which is the state right after boot/init/etc.  By having warmup you can test both hot and cold tests by the additional or removal of this step.  An example might be selecting 10 rows from a database before doing the general select tests.

Step 4 - Ahh... the mystery.  What is step 4?  Take a quick look back at the first graph.  Any ideas?  Well the answer is VALIDATE.  Most performance tests forget to validate the results they are getting.  In the previous step of warmup we said to select 10 rows.  Did the test actually return 10 rows?  If not there is likely some error.  Be sure to check your results and dont publish them if there was an error.  Generally on performance graphs invalid results look like super high, 0, or super low numbers.

Publish - This is the act of pushing the result into your tracking infrastructure.  Performance results tend to have a lifttime of usefulness however there is always good cause to look back over time.

Cleanup - Cleanup is like teardown without exiting all layers of initialization.  Generally the role of cleanup is to get things put back in order so the test can be run again with minimal side-effects.  For cold performance results you will need to teardown.

While execute is not a new step in the performance pattern I wanted to mention it because often times in performance tests you want "stable" numbers.  This is generally achieve by running the execute step a number of times and averaging or repeating steps 3 - 6 a number of times.  While averaging is often the right answer it can sometimes hide performance issues.  Perhaps I blog on that another day.

Now that you have a solid performance test pattern go forth and create amazing results....

  Tony





Friday, September 9, 2011

Remember WPR?Ccheck out the drive for better power use at Google

A couple of posts ago I talked about Watt Per Request (WPR) how power is becoming ever more important in http://perfguy.blogspot.com/2011/05/single-most-import-performance-metric.html.  What is cool is Google just released its power consumption to the world and it gives some good insights.  In Google fashion everything was accounted for right down to the Google Street Cars.  Check out the NY times article http://www.nytimes.com/2011/09/09/technology/google-details-electricity-output-of-its-data-centers.html.  To learn more about Google power use and the industry standard metric PUE (Power Usage Effectiveness) check out http://www.google.com/about/datacenters/index.html.

Enjoy,
   Tony

Monday, July 11, 2011

Ahh... the world has evolved. No more 1TB sorts.

The 1TB sort competition has ended because winners take less than a second now. Good news is there are other competitions... http://sortbenchmark.org/

Thursday, June 30, 2011

The second most important metric - Location, Location, Location

You might think from the title "Location, location, location" this will be a post about real estate however in the new era of Cloud and Mobile Computing location is going to big a huge factor both for the design of your service as well as testing it and in particular performance.  Cloud Computing is a growing trend which is enabling all kinds of new applications.  If you want to find out more on Cloud Computing you can check our the slides from a talk I did at the Better Software Conference this year in Las Vegas here.

So why does location matter?  Location matters because of physics and no one has figured out how to out run the speed of light.  Ok... that is a little abstract so let me give an example.  Image you live in South Africa and want to deploy your cool new service in California because lots of Cloud providers have data centers there - superfast to do and cheap.  A single packet of data will take a half of a second to go from South Africa to California and back.  Thats how long it takes light to travel (roughly). Now image if your new site serves pages with images to fetch, database rows to read, etc.  Each new object you serve means the user (using a browser) in South Africa has to request data from California.  Each round trip is 0.5 seconds so a page with 10 images could take 5 to 10 seconds just is trying to initiate a fetch.  Wow... its not looking good for your service if you want people in South Africa to use it.

Some generally accepted time constrains for operations to happen are the following -
  • User interfaces should respond in around 100 millisec or less.  Human perception is around 30 millisec.
  • A user can detect a software hang in round one second and will take action at around 5 seconds to fix it.
  • The good news for the web is users are willing to wait up to 15 seconds for a page however they will likely never come back if it takes more than 30 seconds.
So now that you have seen deploying your new service in California for you South Africa users is not such a good idea what should you do?  The answer is to find zones closer to you like Europe or possibly Asia.  At the time of writing this I dont know of anyone providing Cloud resources directly in Africa however the landscape is changing quickly as demand rises.

Another example of how location matters is interactive games.  Image a multi-player games really popular on the East Coast of the US with all the game servers on the West Coast of the US.  In general a packet takes around 50 to 70 millsec to travel there an back.  This means a game can only get around 10 to 15 corrections a second.  These long latencies can show up in you shooting anther player first but for some reason you die.  Gamers hate this.

Given the growing dependence of cool applications on Cloud resources its time really start thinking about where your users are and where you services are located.  The shorter the physical distance the better.

Location also matter for legal and privacy issues however thats a whole topic onto itself. 

Sunday, May 8, 2011

The single most import performance metric - WPR

Before we dive into WPR I'd like to take a moment to write about metrics because without metrics there is nothing to measure or tune.  Metrics are the quantities you are going to measure on software and hardware.  More formally a metric is a unit of measure.  There are tons of interesting metrics like %CPU for CPU utilization, Packets Per Second, QPS (Queries per second), FPS (frames per second), RTT (round trip time), and so on.

In general metrics are thought of in two classes - Utilization and Throughput/Latency.  Utilization is a measure of how much something is used so from the previous example %CPU is a utilization metric. Throughput metrics are a measure of the rate at what things are getting done like QPS.  Latency is how long it takes for an individual piece of work to complete like RTT.  Another example of Throughput/Latency is while  Google might do millions of queries a second (Throughput) you the end user are concerned with how fast your query runs (latency).  Server software tends to tune for throughput while interactive software like mobile phone apps tune for latency.

Another term you might have heard is efficiency.  Efficiency is a measure of wasted work.  The more efficient something is the less work is wasting (ie driving around the block twice before parking is likely wasted work).  I dont list it in the metric classes above because both utilization and throughput/latency metrics can be used to derive efficiency.

In my experience throughput/latency measures are more reliable than utilization metrics.  There are lots of reasons for this like Virtual Machines and advanced CPU's tend to skew utilization but not throughput.   You can see a past post of mine that talks about skew on virtual machines here.  If there is interest I can write more on this topic.

Now back to WPR... WPR is Watts per request.  Watts is a measure of the power used.  You might have seen references to Power Performance or Power Utilization etc over the last couple of years but why does it matter so much?  Power utilization is so important these days because of portable devices and data centers.

Ten years ago most computing was under the desk and prior to that it was in a central room.  Power in the central room was interesting however important issues like the speed of computation drove engineering.  Under the desk the costs of inefficient computations (High WPR) was so spread out most people did not notice or care.  However we all have really fast processors now (and yes they can be faster) and a computer in your pocket[book].  In your pocket[book] watts = surfing/talking/dorking time and in the data center watts = heat.  Heat means you have to pay a lot for space and cooling.  The biggest costs for a data center is not the computers but rather the space and power used for cooling.

Now in simpler concepts as to why WPR is so important - lower WPR can make your phone/tablet/laptop last longer and save you money in the datacenter.

So now you might be wondering how to measure power. On Windows machines you can use "c:\windows\system32\powercfg -energy" and on Linux machines you can read sensors (lm-sensors).  The internal computer sensors can be useful however most engineers looking to drive WPR are using external measurement tools like Extech or Intech which are more accurate and have computer readouts which can be used for automation.

Little did you know that Qsort [O(n*log(n))] vs. BubbleSoft [O(n^2)] was making happier users, saving money, and making the world a little greener.  Happy power hunting.

  Tony Voellm

Sunday, May 1, 2011

Three steps to making great performant software

Over the last 20+ years I have been teaching and learning about performance and its time to return more of what I've learned to the public domain.  My knowledge is based in OS (was a Windows and IRIX kernel developer as well as the Hyper-V perf lead), DB (lead the SQL Server perf team), web apps, compilers, image processing, optimization, and much more.  I've worked at the best companies like SGI, MSFT, and now Google which has also given me a wider perspective.

So now that you have a little of my background I'm going to teach you the three steps to making performant software.  You Ready?

Step 1: Have a plan
Step 2: Instrument
Step 3: Measure and Track

Yep... thats it.  Now to put this into perspective the diet industry also has three steps

Step 1: Eat less
Step 2: Exercise more
Step 3: Keep doing 1 and 2

However simple those three steps are there is a mutil-billion dollar business out there to teach them to us.  The steps are not easy and there are a lot of nuances like "What should I eat less of", etc. The three steps to great performance is a lot like the diet steps.  There are a lot of nuances and in coming posts I'll detail them more.  For now I'll give you a quick rundown.

Step 1 is to have a plan.  This means you have an idea of why you are trying to improve the software and how you want to improve it.  You have some goal in mind.  If you have no goal then why are you performance tuning?

Step 2 is to instrument. This means you will be putting markers into the code you are measuring in such a way you can figure out how close you are to your goal.  There are lots of ways to instrument with the simplest being  a printf, performance counters, Windows ETW, etc.

Step 3 is to measure and track.  This mean with each change you make you'll measure the impact it has on your performance goals and track it overtime.  If a regression shows up you'll be ready to fix it.

I can't wait to dig into the three steps more with you...

  Tony Voellm

Monday, April 25, 2011

Ask your Hyper-V or any other software fundamentals questions here...

If you have questions about Hyper-V or other topics around software fundamentals in general you can post a comment here and in due time I'll answer it.

What's up with "/", "\", ";", "," ?

Since this blog is dedicated to all things fundamental I thought I put out a quick post on usability.   Over the  years of computing there have been many common meme's to come about such as using "CRTL-C" (or Apple Key-C on Mac) for copy, holding the power key for 8+ seconds to do a hard power off, and much more.  These meme's work across all platforms and really make computing accessible.  This is great.

Why is it then that email clients like Outlook don't accept both ";" or "," to separate names?  To Gmail's credit while it prefers "," it accepts ";" without issue.  It seems like such a simple usability improvement for Outlook and Windows Phone 7.  If anyone knows the history behind the choice of ";" vs. "," and why or why not to accept both I'd love to hear it.

As for "/" and "\" for directory naming we could go on for ages on this topic.  Fortunately most users never have to worry about this now because directory access is abstracted via GUI's.

This post was really intended to just help you think about the little things... they really matter!

Wednesday, March 2, 2011

Which is faster Windows Hyper-V or VMWare ESX?

Catchy title and I am sure you want to know the simple answer - Hyper-V or ESX.  The challenge is the answer is not so simple.  This blog will help you understand why the answer is not so simple and should help you in asking some questions when trying to decide.

Before I go too far I should let you know I am definitely biased toward Hyper-V after being the Performance Lead for three releases.  You can check out my Hyper-V only blog on http://blogs.msdn.com/tvoellm.  However now that I am no longer with Microsoft I'll do my best to give some good balanced insights.

While ESX has been around longer than Hyper-V I dont think you should use this to determine how fast for functional one is over the other.  For example is MySpace better than Facebook?  What you will likely see from a more mature product will be higher reliability just because the engineers for the mature product has had more time to shake out bugs.  I can't really speak to ESX reliability however I can save Hyper-V runs on everything from 1 processor up to 64 processors and we literally tested it day and night for thousands of hours of up time fault free.  I think it is more than ready for your mission critical applications.

Now back to performance.  First you need to understand there are a couple of types of virtualization.  You can get all the details on http://en.wikipedia.org/wiki/Virtualization however for purposes of this article you just need to understand Hyper-V and ESX virtualize the CPU, network, storage, and graphics using hardware support, emulation and binary patching.  Hardware support means hardware vendors like Intel/AMD have added capabilities to their hardware that allows certain operations like page resolution or packet routing to a Virtual Machine (VM) to be done in hardware (hardware helps because it does not require expensive switches into the hypervisor).  Emulation means the CPU/etc instructions to be executed are being emulated in software rather than running on real hardware (you might wonder how this can be faster - it can be when there is a lot of events in the hardware that keep engaging the hypervisors).  Last there is binary patching where the original software being run in a VM is changed in some why - only ESX does this (binary patching is a useful technique because it allows for more direct control of what a piece of software - the virtualized operating system in this case - should do rather ahead of time rather than trying to determine it when an event happens).

Now that you have the basics of vitualization you can begin to understand some of the questions to ask and you might also realize that there is a lot more than the Hyper-V and ESX bits that determine how performant your virtualized workloads will be.

Really there is more than just Hyper-V and ESX to worry about?  Yes.  For example Hyper-V and ESX both support special instructions in the CPU's to support things like second-level address translation, etc however that does not mean the CPU you have supports that function.  For example CPU's that are three years old likely do not support second level address translation which is a key feature to making VM's run fast expecially VM's running memory intensive operations.  So the first question you should ask when looking to virtualize is "What machine should I buy and in particular what CPU does it have?".  The simple answer is to look for a Intel Nehalem based processor such as the Xeon 5500+ or Core i7+ and for AMD look for recent Opteron or Phenom II processors. You can see more on Intel virtualization here and AMD virutalization here.  I can't stress enough how much the CPU virtualization features have an impact.  Both Hyper-V and ESX make good use of the CPU features and none really has an advantage over the other.  Choose your CPU with the workload you want to run in mind.

This leads us to question #2 - What do the workloads look like that you are planning to run?  The reason the workload matters so much is how Hyper-V and ESX virtualize networking and storage.  In for example if you have a fully cached web server you want to run in a virtual machine its very likely that Hyper-V will run better because its networking virtualization is better although ESX is catching up.   If however if you are running a database it may be ESX will be better because it has more support from big hardware vendors like EMC and NetApp to improve storage performance.  As for 3D graphics I dont have a clear winner for you.   If you are a Microsoft shop mostly you should go with Hyper-V because of the deep integration with Terminal Services.  Your workloads is largely determined on how it uses the network, storage and graphics primarily.   More use more a more intensive workload.  For example databases are storage intensive, web servers are generally network intensive, and simulations like Weather modeling are CPU and Graphics intensive.

So on to question #3 - What is your storage environment?  The environment not only includes the host machine where the virtual machines will be running but also the storage infrastructure.  For example will you be running a SAN or iSCSI storage network?  If you want iSCSI to a VM then Hyper-V will likely be better because its networking performance is better overall however if you run a SAN than ESX might be the better choice.  There are also other questions to ask around storage like LUN provisioning, snapshotting, and migration (moving storage between host machines).  The deeper the level on integration of the solution the more performant it is likely to be.  Hyper-v has great basic I/O performance however I've seen more integration of VMWare with storage solutions.

Given we touched on storage we need to cover the importance of networking.   It would be worth asking what virtualization networking features does your NIC support? Believe it or not Intel and Broadcom have both adopted certain features like VMQ (aka Netqueue), TCP offload (checksum and large send), Jumbo Frames, and RDMA.  Hyper-V has traditionally been ahead here but there has been some leap frogging.

Another important performance dimension is power so the question is what power management features does your virtualization solution support?  This is important because lots of power use means lots of heat and lots of heat means lots of cost for cooling.  Both Hyper-V and ESX have power management features however at the time of this article VMWare is a bit ahead on this front.  Overall virtualization with either solution will save power because of true hardware to virtual machine consolidation.  What I am taking about is once you have virtualized which solution will use less power per VM operation.  They are both very competitive here.

Last but not least what virtualization features do Hyper-V and VMWare support.  For Hyper-V you can see here and for ESX see here.

So to recap the questions:

#1 - What machine should I buy and in particular what CPU does it have?  Intel here and AMD here
#2 - What do the workloads look like that you are planning to run?
#3 - What is your storage environment?  Check out the NetApp and EMC sites.
#4 - What virtualization networking features does your NIC support?
#5 - What power management features does your virtualization solution support?

There are many many more questions you could ask however the real purpose of this article was to help you understand that asking "Which is faster Windows Hyper-V or VMWare ESX?" is not such an easy question to answer and to arm you with some question you should ask.

In the end my recommendation is to try before you buy.  You can ask all the questions you want however in the end you need to make a decision. My suggestion is to borrow an environment if you can and try your workloads on it.  Whichever is better for you go for it (PS.... dont forget the cost).

  Tony Voellm

Sunday, February 13, 2011

Welcome!

Welcome to my new blogs where I'll be exploring a range of topics from performance to security of real systems along with the occasional topics like software testing and computer science in general.  My previous blog http://blogs.msdn.com/tvoellm was dedicated to Microsoft technologies however this blog will have wider coverage.  If there are questions you would like me to answer just send a message and I'll get to them as time allows.