All topics fundamental

2012-10-26T10:41:00.001-07:00

Google Compute Engine breaks Hadoop Terasort World Record

While I've been light on posts as of recent, I though I'd share a cool result that Google Cloud and MapR accomplished together.

This has been the result of a lot of performance analysis, tuning, and much more but nothing is specialized. You get this experience without any effort - just boot virtual machines and go.

Here is a press release on the accomplishment:
http://insights.wired.com/video/mapr-google-compute-engine-set-new-world-record-for-hadoop-teraso

The Cloud is powerful.

Enjoy,
Tony

PS. My Google Cloud Security, Performance and Test team is hiring :)

Testing 2.0: Enter the human

2012-10-03T00:56:00.001-07:00

Over the last couple of weeks I've started writing and speaking on what I'm calling Testing 2.0. To get an overview on this new chapter in test check out this post.

As a follow on to the Google Testing Blog post I had the honor of speaking at YaC in Russia. The talk was focused on how to improve engineering productivity one of the focuses of Testing 2.0. You can see the talk here (video soon).

On a side note... visiting Russia was awesome.

Ads can steal your power - Mobile trade-offs

2012-03-24T19:00:00.001-07:00

In reading the following article the other day "Study: Free Android Apps Can Steal Your Phone's Power" I was reminded of all the trade-offs one has to make when designing mobile applications. Before we dig into some of the trade-offs one does have to wonder about the purpose of one company doing a sanctioned study of another companies products. We'll leave digging into that topic for the another day.

Now back to the topic of mobile trade-offs. Some might argue this, but the single most important thing to design for is minimal power consumption. Power is so important on mobile platforms because users really don't want to hang out at charging pods in airports, plug in on the train, plugging in at a friends house, ... One of the big delighters of the original Kindle was it could run for weeks on a single charge. The newer Kindle Fire lasts less than a day like most of current gen platforms. In looking at the new iPad3 it has a battery almost twice the size without any addition device life ~10 hours. So where is all that power going? The two biggest draws on power are the screen and radio. You have some control of the power consumption on the screen. The dimmer you go the longer you go. The iPad3 has a super dense screen and new graphics CPUs. I wonder how many people would keep the old screen and CPUs for 20 hours of use on an iPad2?

I have a Windows Phone 7 (yes I am willing to admit I have one - I'm a techie) and just use it as an in house wi-fi device. What totally surprised me is the phone lasts for weeks without the radio on. The first time I had that happen I was pretty surprised by just how much impact the radio did have.

While I've talked able the screen and radio we can't ignore use of the processor (CPU). Most mobile devices have ARM chips which do all kinds of cool things like clock the CPU lower when not in use, have low power cores for when the phone is idle, etc. The take-away is doing really CPU intensive work like my fractal app at http://www.tonyware.com/fractals will drain your power. It has my phone because zoom in is so cool :)

So what are some of the trade-offs for mobile you should be thinking about?

Do you really need to send data to your backend server every 10sec or would once an hour be good? Not only will this save power it will also cost your customers less of their network bandwidth. (Rule #1 - Use the radio less)
Does having a white background really make the App better? Can you choose dimmer colors? If the App is idle should you dim the screen? (Rule #2 - lighting up more pixels uses more power)
What about pushing more computation to the server rather than having the phone do it? (Rule #3 - Dont use the CPU for big calculations)

The last point around doing work on the server rather than the phone is one of the reasons Google AppEngine is growing so quickly for mobile. The more you can do on the server the more power will be saved. Its also very likely the computation will go much faster on the server. The trade-off on the developer end is how to manage cost while getting the best customer experience. By the way computation in the Cloud are far "greener" than on any other computer device. But that is a topic for another day.

Hope this got you thinking....

Enjoy,

Anthony F. Voellm (aka Tony)

Old rules of thumb always need to be reconsidered

2012-03-02T11:29:00.000-08:00

After being in the computer industry for a while you begin to appreciate just how much machine capabilities change and the need to change designs along with them. For example just 10 years about developers would spend hours trying to find ways to save a few bytes of memory. Now most of the code the world runs is via interpreters (JiT, PY, PhP, Jscript, ...) and a few bytes is less interesting. I'm not saying to waste them but I personally would not put it as my first priority.

Let me give you an example of how changes in machine capabilities caused the rethinking of an OS. In Windows XP Microsoft Engineers designed the memory manager to aggressively push data from main memory (RAM) to disk. This was done because RAM was costly and very small (~128MB) at the time and so if more memory could be freed up new applications could start faster. If you waited until an application started to free memory users would wait 30 seconds to minutes before the application was usable because of paging RAM to disk. Between the time Windows XP and Vista shipped RAM prices dropped dramatically (from $40 for 128MB to just $2 dollars).

With the dramatic change in memory prices and the fact disks did not really get any faster Vista fundamentally broke from the past rule of thumb of free up as much RAM as possible and push it to disk to just the reverse. RAM was cheap and relatively plentiful so a feature called SuperFetch was created to aggressively page in data from disk to RAM. Based on the decision to not force RAM to disk overall UI performance seemed to be more snappy in Vista. No more shaking the mouse after lunch with XP and waiting a minute+ before logging in.

Well it looks like with the improved performance of CPU's and networks old rules of thumb around UI responsiveness are starting to be reconsidered. Some early UI research in 1968 by Miller and 1991 by Card lead to rules of thumb for UI regularly cited in "Response Times: The 3 important limits" and extended for the World Wide Wait, I mean Web.

Here is a recap of those rules and a few more that have been adopted from experience and very likely paper I've read long ago and forgotten:

Users consider 100ms response times fast
At around 1 seconds users will notice a delay but are tolerant
At 5 seconds users are starting to get impatient and may take action
At 10 seconds they lose focus
At 15 seconds they are likely to hit “refresh”
At 30 seconds they generally navigate away and don’t come back if there is an acceptable alternative.

Well it looks like even hard earned rules of thumb for UI and Web are now falling as seen in a recent NYTIMES article "For Impatient Web Users, an Eye Blink Is Just Too Long to Wait". Based on this article it looks like 250millisec is the new goal for web responsiveness rather than 1 second as we had all used.

The overall morale of the story is don't hold on too dear to those rules of thumb and perhaps you should rethink them often.

-- Anthony F. Voellm

Fix security bugs early - Interesting paper

2012-02-27T13:53:00.002-08:00

Interesting paper - Find security bugs before they release because of the high cost to fix later. Internet Apps change some of the cost dynamics however the that does not mean fixing early is less important because its hard to fix your reputation.

http://www.stickyminds.com/Files/Automated%20Testing%20With%20Commerical%20Fuzzing%20Tools.pdf

A look at the Fundamentals in the Cloud

2011-11-07T14:52:00.000-08:00

If you are interested in the Cloud and testing the following is a talk I did at GTAC2011 that might be interesting to you.

Part the Clouds and See Fact from Fiction

http://www.youtube.com/watch?v=nXIA3VYN1To&list=PLBB2CAFDDBD7B7265&index=9&feature=plpp_video

Old performance addage... Polling is bad

2011-10-30T07:42:00.001-07:00

Its long been known that polling is bad. It uses a ton of resources. The challenge is it trips up even great developers. Check out... http://m.guardian.co.uk/technology/2011/oct/29/iphone-4s-battery-location-services-bug?cat=technology&type=article

One way to catch this is to have a good set if resource monitoring tests. Its very likely Apple had these however its hard to catch with so many ways to configure software. This is where collecting these same resources from released devices can help (crowd sourcing test). Check out for example Microsofts SQM (aka Customer Improvement Program) data.

Should you decide to collect telemetry just remember the second addage... Bad collection is like polling.

Crowd sourcing Apple iPhone 4S power performance

2011-10-28T15:04:00.001-07:00

Interesting easy to solve the issue...

http://m.guardian.co.uk/technology/2011/oct/28/iphone-4s-battery-apple-engineers?cat=technology&type=article

Performance Test Pattern

2011-10-17T08:55:00.000-07:00

One of the biggest challenges in monitoring and tracking performance is getting stable and repeatable numbers. Check out the following plot of two performance tests. On which do you think it will be easier to spot regressions?

There are two test here. Test 1 which looks pretty erratic and test 2 which looks pretty stable and repeatable around 3 ms. Given this I'm pretty sure you are going to choose test 2 on which to monitor and track performance. The test is very stable meaning the variance is very low and repeatable because it does not drift off over time. For an example of drift check out the following:

Here you can see the results are pretty stable in that the overall variance from number to number if pretty low however the results are not very repeatable and seem to drift up over time. For this particular test high ping time is bad. This is also and example of "death-by-a-thousand cuts" where from test to test results look good but over long periods of time you see performance is dropping off.

So the question then comes up how do you make stable and repeatable performance tests? The answer is to follow a test pattern like the xUnit pattern with a couple of extra steps. The pattern is the following:

setup
warmup
execute
something most tests forget
publish
cleanup
teardown

Notice the following additional tests - warmup, step 4, publish, cleanup. Now let me explain these steps.

Warmup - This step is here to allow the performance test to "warm-up" the system under test. For example if you want to measure database queries generally you have to decide if you want hot (most likely the common case) numbers where the database has been in use for a while or cold numbers which is the state right after boot/init/etc. By having warmup you can test both hot and cold tests by the additional or removal of this step. An example might be selecting 10 rows from a database before doing the general select tests.

Step 4 - Ahh... the mystery. What is step 4? Take a quick look back at the first graph. Any ideas? Well the answer is VALIDATE. Most performance tests forget to validate the results they are getting. In the previous step of warmup we said to select 10 rows. Did the test actually return 10 rows? If not there is likely some error. Be sure to check your results and dont publish them if there was an error. Generally on performance graphs invalid results look like super high, 0, or super low numbers.

Publish - This is the act of pushing the result into your tracking infrastructure. Performance results tend to have a lifttime of usefulness however there is always good cause to look back over time.

Cleanup - Cleanup is like teardown without exiting all layers of initialization. Generally the role of cleanup is to get things put back in order so the test can be run again with minimal side-effects. For cold performance results you will need to teardown.

While execute is not a new step in the performance pattern I wanted to mention it because often times in performance tests you want "stable" numbers. This is generally achieve by running the execute step a number of times and averaging or repeating steps 3 - 6 a number of times. While averaging is often the right answer it can sometimes hide performance issues. Perhaps I blog on that another day.

Now that you have a solid performance test pattern go forth and create amazing results....

Tony

Remember WPR?Ccheck out the drive for better power use at Google

2011-09-09T10:03:00.000-07:00

A couple of posts ago I talked about Watt Per Request (WPR) how power is becoming ever more important in http://perfguy.blogspot.com/2011/05/single-most-import-performance-metric.html. What is cool is Google just released its power consumption to the world and it gives some good insights. In Google fashion everything was accounted for right down to the Google Street Cars. Check out the NY times article http://www.nytimes.com/2011/09/09/technology/google-details-electricity-output-of-its-data-centers.html. To learn more about Google power use and the industry standard metric PUE (Power Usage Effectiveness) check out http://www.google.com/about/datacenters/index.html.

Enjoy,

Tony

Ahh... the world has evolved. No more 1TB sorts.

2011-07-11T16:40:00.001-07:00

The 1TB sort competition has ended because winners take less than a second now. Good news is there are other competitions... http://sortbenchmark.org/

The second most important metric - Location, Location, Location

2011-06-30T10:29:00.000-07:00

You might think from the title "Location, location, location" this will be a post about real estate however in the new era of Cloud and Mobile Computing location is going to big a huge factor both for the design of your service as well as testing it and in particular performance. Cloud Computing is a growing trend which is enabling all kinds of new applications. If you want to find out more on Cloud Computing you can check our the slides from a talk I did at the Better Software Conference this year in Las Vegas here.

So why does location matter? Location matters because of physics and no one has figured out how to out run the speed of light. Ok... that is a little abstract so let me give an example. Image you live in South Africa and want to deploy your cool new service in California because lots of Cloud providers have data centers there - superfast to do and cheap. A single packet of data will take a half of a second to go from South Africa to California and back. Thats how long it takes light to travel (roughly). Now image if your new site serves pages with images to fetch, database rows to read, etc. Each new object you serve means the user (using a browser) in South Africa has to request data from California. Each round trip is 0.5 seconds so a page with 10 images could take 5 to 10 seconds just is trying to initiate a fetch. Wow... its not looking good for your service if you want people in South Africa to use it.

Some generally accepted time constrains for operations to happen are the following -

User interfaces should respond in around 100 millisec or less. Human perception is around 30 millisec.
A user can detect a software hang in round one second and will take action at around 5 seconds to fix it.
The good news for the web is users are willing to wait up to 15 seconds for a page however they will likely never come back if it takes more than 30 seconds.

So now that you have seen deploying your new service in California for you South Africa users is not such a good idea what should you do? The answer is to find zones closer to you like Europe or possibly Asia. At the time of writing this I dont know of anyone providing Cloud resources directly in Africa however the landscape is changing quickly as demand rises.

Another example of how location matters is interactive games. Image a multi-player games really popular on the East Coast of the US with all the game servers on the West Coast of the US. In general a packet takes around 50 to 70 millsec to travel there an back. This means a game can only get around 10 to 15 corrections a second. These long latencies can show up in you shooting anther player first but for some reason you die. Gamers hate this.

Given the growing dependence of cool applications on Cloud resources its time really start thinking about where your users are and where you services are located. The shorter the physical distance the better.

Location also matter for legal and privacy issues however thats a whole topic onto itself.

The single most import performance metric - WPR

2011-05-08T08:43:00.000-07:00

Before we dive into WPR I'd like to take a moment to write about metrics because without metrics there is nothing to measure or tune. Metrics are the quantities you are going to measure on software and hardware. More formally a metric is a unit of measure. There are tons of interesting metrics like %CPU for CPU utilization, Packets Per Second, QPS (Queries per second), FPS (frames per second), RTT (round trip time), and so on.

In general metrics are thought of in two classes - Utilization and Throughput/Latency. Utilization is a measure of how much something is used so from the previous example %CPU is a utilization metric. Throughput metrics are a measure of the rate at what things are getting done like QPS. Latency is how long it takes for an individual piece of work to complete like RTT. Another example of Throughput/Latency is while Google might do millions of queries a second (Throughput) you the end user are concerned with how fast your query runs (latency). Server software tends to tune for throughput while interactive software like mobile phone apps tune for latency.

Another term you might have heard is efficiency. Efficiency is a measure of wasted work. The more efficient something is the less work is wasting (ie driving around the block twice before parking is likely wasted work). I dont list it in the metric classes above because both utilization and throughput/latency metrics can be used to derive efficiency.

In my experience throughput/latency measures are more reliable than utilization metrics. There are lots of reasons for this like Virtual Machines and advanced CPU's tend to skew utilization but not throughput. You can see a past post of mine that talks about skew on virtual machines here. If there is interest I can write more on this topic.

Now back to WPR... WPR is Watts per request. Watts is a measure of the power used. You might have seen references to Power Performance or Power Utilization etc over the last couple of years but why does it matter so much? Power utilization is so important these days because of portable devices and data centers.

Ten years ago most computing was under the desk and prior to that it was in a central room. Power in the central room was interesting however important issues like the speed of computation drove engineering. Under the desk the costs of inefficient computations (High WPR) was so spread out most people did not notice or care. However we all have really fast processors now (and yes they can be faster) and a computer in your pocket[book]. In your pocket[book] watts = surfing/talking/dorking time and in the data center watts = heat. Heat means you have to pay a lot for space and cooling. The biggest costs for a data center is not the computers but rather the space and power used for cooling.

Now in simpler concepts as to why WPR is so important - lower WPR can make your phone/tablet/laptop last longer and save you money in the datacenter.

So now you might be wondering how to measure power. On Windows machines you can use "c:\windows\system32\powercfg -energy" and on Linux machines you can read sensors (lm-sensors). The internal computer sensors can be useful however most engineers looking to drive WPR are using external measurement tools like Extech or Intech which are more accurate and have computer readouts which can be used for automation.

Little did you know that Qsort [O(n*log(n))] vs. BubbleSoft [O(n^2)] was making happier users, saving money, and making the world a little greener. Happy power hunting.

Tony Voellm

Three steps to making great performant software

2011-05-01T16:51:00.000-07:00

Over the last 20+ years I have been teaching and learning about performance and its time to return more of what I've learned to the public domain. My knowledge is based in OS (was a Windows and IRIX kernel developer as well as the Hyper-V perf lead), DB (lead the SQL Server perf team), web apps, compilers, image processing, optimization, and much more. I've worked at the best companies like SGI, MSFT, and now Google which has also given me a wider perspective.

So now that you have a little of my background I'm going to teach you the three steps to making performant software. You Ready?

Step 1: Have a plan
Step 2: Instrument
Step 3: Measure and Track

Yep... thats it. Now to put this into perspective the diet industry also has three steps

Step 1: Eat less
Step 2: Exercise more
Step 3: Keep doing 1 and 2

However simple those three steps are there is a mutil-billion dollar business out there to teach them to us. The steps are not easy and there are a lot of nuances like "What should I eat less of", etc. The three steps to great performance is a lot like the diet steps. There are a lot of nuances and in coming posts I'll detail them more. For now I'll give you a quick rundown.

Step 1 is to have a plan. This means you have an idea of why you are trying to improve the software and how you want to improve it. You have some goal in mind. If you have no goal then why are you performance tuning?

Step 2 is to instrument. This means you will be putting markers into the code you are measuring in such a way you can figure out how close you are to your goal. There are lots of ways to instrument with the simplest being a printf, performance counters, Windows ETW, etc.

Step 3 is to measure and track. This mean with each change you make you'll measure the impact it has on your performance goals and track it overtime. If a regression shows up you'll be ready to fix it.

I can't wait to dig into the three steps more with you...

Tony Voellm

Ask your Hyper-V or any other software fundamentals questions here...

2011-04-25T11:52:00.001-07:00

If you have questions about Hyper-V or other topics around software fundamentals in general you can post a comment here and in due time I'll answer it.

What's up with "/", "\", ";", "," ?

2011-04-25T10:32:00.000-07:00

Since this blog is dedicated to all things fundamental I thought I put out a quick post on usability. Over the years of computing there have been many common meme's to come about such as using "CRTL-C" (or Apple Key-C on Mac) for copy, holding the power key for 8+ seconds to do a hard power off, and much more. These meme's work across all platforms and really make computing accessible. This is great.

Why is it then that email clients like Outlook don't accept both ";" or "," to separate names? To Gmail's credit while it prefers "," it accepts ";" without issue. It seems like such a simple usability improvement for Outlook and Windows Phone 7. If anyone knows the history behind the choice of ";" vs. "," and why or why not to accept both I'd love to hear it.

As for "/" and "\" for directory naming we could go on for ages on this topic. Fortunately most users never have to worry about this now because directory access is abstracted via GUI's.

This post was really intended to just help you think about the little things... they really matter!

Which is faster Windows Hyper-V or VMWare ESX?

2011-03-02T06:59:00.000-08:00

Catchy title and I am sure you want to know the simple answer - Hyper-V or ESX. The challenge is the answer is not so simple. This blog will help you understand why the answer is not so simple and should help you in asking some questions when trying to decide.

Before I go too far I should let you know I am definitely biased toward Hyper-V after being the Performance Lead for three releases. You can check out my Hyper-V only blog on http://blogs.msdn.com/tvoellm. However now that I am no longer with Microsoft I'll do my best to give some good balanced insights.

While ESX has been around longer than Hyper-V I dont think you should use this to determine how fast for functional one is over the other. For example is MySpace better than Facebook? What you will likely see from a more mature product will be higher reliability just because the engineers for the mature product has had more time to shake out bugs. I can't really speak to ESX reliability however I can save Hyper-V runs on everything from 1 processor up to 64 processors and we literally tested it day and night for thousands of hours of up time fault free. I think it is more than ready for your mission critical applications.

Now back to performance. First you need to understand there are a couple of types of virtualization. You can get all the details on http://en.wikipedia.org/wiki/Virtualization however for purposes of this article you just need to understand Hyper-V and ESX virtualize the CPU, network, storage, and graphics using hardware support, emulation and binary patching. Hardware support means hardware vendors like Intel/AMD have added capabilities to their hardware that allows certain operations like page resolution or packet routing to a Virtual Machine (VM) to be done in hardware (hardware helps because it does not require expensive switches into the hypervisor). Emulation means the CPU/etc instructions to be executed are being emulated in software rather than running on real hardware (you might wonder how this can be faster - it can be when there is a lot of events in the hardware that keep engaging the hypervisors). Last there is binary patching where the original software being run in a VM is changed in some why - only ESX does this (binary patching is a useful technique because it allows for more direct control of what a piece of software - the virtualized operating system in this case - should do rather ahead of time rather than trying to determine it when an event happens).

Now that you have the basics of vitualization you can begin to understand some of the questions to ask and you might also realize that there is a lot more than the Hyper-V and ESX bits that determine how performant your virtualized workloads will be.

Really there is more than just Hyper-V and ESX to worry about? Yes. For example Hyper-V and ESX both support special instructions in the CPU's to support things like second-level address translation, etc however that does not mean the CPU you have supports that function. For example CPU's that are three years old likely do not support second level address translation which is a key feature to making VM's run fast expecially VM's running memory intensive operations. So the first question you should ask when looking to virtualize is "What machine should I buy and in particular what CPU does it have?". The simple answer is to look for a Intel Nehalem based processor such as the Xeon 5500+ or Core i7+ and for AMD look for recent Opteron or Phenom II processors. You can see more on Intel virtualization here and AMD virutalization here. I can't stress enough how much the CPU virtualization features have an impact. Both Hyper-V and ESX make good use of the CPU features and none really has an advantage over the other. Choose your CPU with the workload you want to run in mind.

This leads us to question #2 - What do the workloads look like that you are planning to run? The reason the workload matters so much is how Hyper-V and ESX virtualize networking and storage. In for example if you have a fully cached web server you want to run in a virtual machine its very likely that Hyper-V will run better because its networking virtualization is better although ESX is catching up. If however if you are running a database it may be ESX will be better because it has more support from big hardware vendors like EMC and NetApp to improve storage performance. As for 3D graphics I dont have a clear winner for you. If you are a Microsoft shop mostly you should go with Hyper-V because of the deep integration with Terminal Services. Your workloads is largely determined on how it uses the network, storage and graphics primarily. More use more a more intensive workload. For example databases are storage intensive, web servers are generally network intensive, and simulations like Weather modeling are CPU and Graphics intensive.

So on to question #3 - What is your storage environment? The environment not only includes the host machine where the virtual machines will be running but also the storage infrastructure. For example will you be running a SAN or iSCSI storage network? If you want iSCSI to a VM then Hyper-V will likely be better because its networking performance is better overall however if you run a SAN than ESX might be the better choice. There are also other questions to ask around storage like LUN provisioning, snapshotting, and migration (moving storage between host machines). The deeper the level on integration of the solution the more performant it is likely to be. Hyper-v has great basic I/O performance however I've seen more integration of VMWare with storage solutions.

Given we touched on storage we need to cover the importance of networking. It would be worth asking what virtualization networking features does your NIC support? Believe it or not Intel and Broadcom have both adopted certain features like VMQ (aka Netqueue), TCP offload (checksum and large send), Jumbo Frames, and RDMA. Hyper-V has traditionally been ahead here but there has been some leap frogging.

Another important performance dimension is power so the question is what power management features does your virtualization solution support? This is important because lots of power use means lots of heat and lots of heat means lots of cost for cooling. Both Hyper-V and ESX have power management features however at the time of this article VMWare is a bit ahead on this front. Overall virtualization with either solution will save power because of true hardware to virtual machine consolidation. What I am taking about is once you have virtualized which solution will use less power per VM operation. They are both very competitive here.

Last but not least what virtualization features do Hyper-V and VMWare support. For Hyper-V you can see here and for ESX see here.

So to recap the questions:

#1 - What machine should I buy and in particular what CPU does it have? Intel here and AMD here
#2 - What do the workloads look like that you are planning to run?
#3 - What is your storage environment? Check out the NetApp and EMC sites.
#4 - What virtualization networking features does your NIC support?
#5 - What power management features does your virtualization solution support?

There are many many more questions you could ask however the real purpose of this article was to help you understand that asking "Which is faster Windows Hyper-V or VMWare ESX?" is not such an easy question to answer and to arm you with some question you should ask.

In the end my recommendation is to try before you buy. You can ask all the questions you want however in the end you need to make a decision. My suggestion is to borrow an environment if you can and try your workloads on it. Whichever is better for you go for it (PS.... dont forget the cost).

Tony Voellm

Welcome!

2011-02-13T22:07:00.000-08:00

Welcome to my new blogs where I'll be exploring a range of topics from performance to security of real systems along with the occasional topics like software testing and computer science in general. My previous blog http://blogs.msdn.com/tvoellm was dedicated to Microsoft technologies however this blog will have wider coverage. If there are questions you would like me to answer just send a message and I'll get to them as time allows.