Wednesday, 17 October 2012

Distributed testing simulation

Developing the distributed framework

In the midst of the chaos that is the last week before a release, there's not a right lot for a build and release manager to do! You'd be forgiven for thinking otherwise, but I'd fear for my life if I interrupted the testers during this busy phase, so I find myself more or less locked out of my environments until after we've gone live, and the dust has settled.

The Skills Matrix is a no go since the team are busy bug hunting.
The FXCop implementation is postponed likewise.

Time then, to return to the distributed testing project.

Today, I finished the framework and rejoiced at the site of my temp folder flooding with test results. So far this framework has only been run against my own PC, but I effectively treated as a remote host, using a CredSSP authenticated session.  I'm very confident that little or no modifications will be needed to bring in the rest of the hosts.

Change of host

I'd spoken to our developers a lot about this project (a few read this blog), and there's not much love for the idea of sharing their cores. Whilst some have argued against sharing their CPU on the grounds of it interrupting their own local builds, I suspect ever-so-slightly that its got more to do with an unscheduled interruption during a lunch-time session of StarCraft or WoW at a crucial moment.  

Anyhow, all I want to do is keep the devs happy, so I scrapped the idea of using their PCs and instead turned my focus towards our two build agents.

As I'd previously mentioned, our two build agents are liquid cooled monsters, but they do sit idle when there's no build going on. Even then, it's only really the unit-tests that can be orchestrated to near-saturate the CPU. The build and packaging stages seem to peak at about 70% CPU utilisation, then some other bottleneck manifests. So, more often than not, the build agents have plenty of spare capacity.

So, my new plan, which is the same as the old plan, is to develop a distributed testing framework. The only difference is, I'll be load balancing over the build agents, instead of the developers own workstations.

The structure

Today, I created four scripts effectively.
  1. The client
  2. The invoker
  3. The receiver
  4. The command
Hmm, sounds familiar doesn't it. But, don't get too excited or hung up about it, it's similar to the command pattern, but it is not the command pattern, nor am I pretending it to be.

DistributedTests.ps1
The Client? perhaps

Discovers all the test containers from our solutions. At present, it finds over 50, which is a nice number to work with. 

It then performs a random allocation of test containers to available hosts. I chose a random allocation to ensure that some of the heavier and lengthier tests don't go to the same host over and over. 

Once the tests have been allocated to hosts, they're packaged up into n PSJobs, where n is the number of hosts I have at my disposal. This is effectively creating packaged commands, that will be given to another process to invoke. One package per remote host.

Each PSJob takes the package it's given, and executes the RemoteAgentController.ps1 script.

RemoteAgentController.ps1
The Invoker? possibly

This script is designed to run on the local host. It has the test package from DistributedTests which reflects the work that will be done.

This script is designed to be the point-of-contact for the remote tests once their running, so there is one instance of this script per host, each instance runs as a PS background Job. 

The remote host session is created and then given RemoteAgentReceiver.ps1 to process.
The test package is also passed up to the RemoteAgentReceiver session. 

RemoteAgentReceiver.ps1
The Receiver? whatever gave you that impression?

This script runs on the remote test agent, within a CredSSP authenticated session.

It's purpose is to finally do some work. It churns through all the tests that in the test package defined by DistributedTests.

Each test container will be performed locally, and the results packaged up and returned to the RemoteAgentController.

Using the same tried and tested technique as the current build process, it breaks out as many PSJobs as it can, and uses each one to process a single test container. Collecting the results as it goes.

TestRunner.ps1The Command? you may very well think that

This script is that is performed by each PSJob created by RemoteAgentReceiver.
This script just encapsulates the invocation of MSTest and NCover. It reads the output files and packages up the results. 

Yes, finally, someone is doing some work! 

The results are passed back up the call chain.

Results handling

The results object is created at the business end of the chain, in TestRunner. After this, its a matter of packaging and aggregation before the final results are echoed out to the host by DistributedTests.

The RemoteAgentReceiver packages up all its test results to return back to the RemoteAgentController.

The RemoteAgentController will aggregate all results sets from all hosts, then return back to the DistributedTests caller.

DistributedTests then echo's the results to the screen, and looks for any warning or failure counts in the results, and affects the build accordingly.

In short, it does this... 

Topology

Acceptance and Load Testing

In real world practice this is likely to reduce the test running times by half, which isn't a bad saving, but we're only talking a 2 minute gain. This isn't really worth the effort that I've gone too so far, but, there is a much bigger pay-off just around the corner.

Acceptance tests and load tests are traditionally performed in a distributed manner where possible. All too often, they're run from a single host due the difficulties in orchestrating a distributed test. This framework, can be used to conduct distributed acceptance and load tests. In this light, the potential time savings are simply enormous.

Developer workstations can be used overnight to perform browser based acceptance tests from a nightly build. One machine using Chrome, another firefox, IE and so-on. The same applies to load testing.

A speed boost for local builds

Local builds still take around 10 minutes to perform, and the unit tests phase takes about 4-5 minutes. If the unit tests can be farmed out to an idle build agent or two, we can expect a reasonable reduction in build times by 2-3 minutes.

The Scripts

DistributedTests.ps1

RemoteTestInvoker.ps1

RemoteTestReceiver.ps1

TestRunner.ps1


Helpers/QueueHelper.ps1

Helpers/TextHelper.ps1


Ah yes, the reason you're here and I've delayed you long enough!

Simulation results

The results were a little surprising, I wasn't expecting diminishing returns to bite at such a low level of concurrency, but at 6 concurrent agents is where we achieved the shortest possible time.  After this time, the results started to take longer and longer again.

Having looked at my specific test data, I should have expected this result. In the real world situation, we'll have over 110 test containers, each of varying complexity, intensity and duration. I'll perform the same method again and compare the real world results against my simulation. If there is a similar pattern of diminishing returns, then at six concurrent agents, we can expect to gain 50%. Perhaps :)

An over-riding goal of mine is to push the total build times under 10 minutes. If I can leverage this distribution model to other aspects of the build pipeline, static analysis being a prime candidate, I might just achieve it!