Just-In-time compilation

My attempts at creating a distributed grid of build agents shortly before Christmas faltered when I realised that it wasn't going to make any impact on our overall build times. Sure it was cool, but there simply wasn't enough work that could be performed concurrently to take full advantage.

This was a bit deflating, so I put the concept to bed for a few months, with the idea of coming back to it at a later stage.

A fresh start...

The main reason why anyone bothers with parallel processing is to get things done faster, and that was the thinking behind my distributed builds. Increasing the throughput by using grid computing.

Sadly the project stumbled when certain compilers didn't work too well over a network path, but putting such problems aside, the main problem was plainly the lack of opportunity to perform concurrent execution in our build plan - there simply wasn't enough work to be done at any given time.

As I observed the configurations of each solution compiling; following our hand-crafted build plan, I started to see a pattern that I hadn't noticed previously, that much of our build plan was actually sequential.  Attempting to run this mostly linear build-plan over a grid was simply pointless.

The real challenge was making better use of the resources we already had, that being 12 over-clocked cores on our I7 build agents.  I roughly estimated (from paper notes) that for over half of the entire build duration, only a single core was being utilised.  The cause of these flat-spots were;
  1. Waiting for NCover to report on the test runs
  2. Waiting for MSBuild to pre-compile the views on MVC sites
  3. Waiting for core components in the critical path to build
  4. Caterpillar effect of performing work in blocks
Maintaining the hand-crafted build execution plan had also become onerous, and the whole thing could be best described as brittle. It seemed that whilst it worked, it was more often by luck than skill.

Automatic discovery

With some coercion from a colleague (I'm all about procrastination), I decided to tackle the idea of automating the dependency discovery in the hope it might provide greater opportunity for concurrency, and lead to fewer flat-spots. The other bonus is that it would eliminate the need to manually adjust our build plan ever again.

Using Powershell to parse the VisualStudio .sln and .csproj files of the entire platform, I was able to produce an exact granular build execution plan of 436 succinct activities, a big improvement over the 96 steps we'd programmed in by hand.

In addition to automating the discovery of the build order, the new scripts could also automate the generation of artefacts within solutions, such as databases, websites, console applications and windows services. In short, the configuration object now only points to the locations of each solution in the platform, and everything else is entirely automated - automagikally :)

With a little caching involved, the time to generate the new build order is about 30 seconds, but it does persist to an XML file which allows us to skip this step after the first build.

Single purpose actions

A significant contributor to the flat spots was the "logical grouping of tasks" in the original build scripts. Work was grouped by activity:
  1. Clean all solutions
  2. Build the platform
  3. Run the unit tests
  4. Run static analysis
  5. Package
  6. Build databases
  7. Publish
Many flat spots materialised whilst waiting for one phase to end, and the interim pause whilst the next phase of  jobs were queued up for processing. The best way I can describe this is a caterpillar shuffle, like a cars in a traffic jam.

Flat-spots were also caused by constrictions along the critical path, such as waiting for core framework components to build. These components were contained in a single solution that could take over 2 minutes to complete, and with every other solution requiring this to be complete, it was a long wait before we could really kick-off any other builds.

The static analysis phase was noticeably the largest flat spot,  a single threaded process that routinely took over to 2 minutes to complete.

Just-in-time scheduling

The newly generated build execution plan gave us a opportunity to maximise the resources of the local hardware. Knowing precisely each dependency of every build configuration (within a solution) allowed jobs to be executed in a Just-In-Time fashion. Each of those 436 jobs remains dormant until we're ready for it to be run.

All the actions that were previously phased were now fluidly performed just-in-time.
Clean, Build, Test, Analyse, Package, Publish & databases all now broken down into single actions that were run, just as soon as they could.

The static analysis phase as it was,  has also been broken down, so that rather than attempting to parse every solution in one go, it's run, just-in-time, after every test container in a solution has completed. Analysing a solution takes between 2 and 20 seconds, which is much more workable.

So having broken down the entire build process into single purpose actions that are processed just-in-time, has lowered the overall build time from 15 minutes to 6. And for the duration of this 6 minutes, the build agents CPUs rarely falls below 80% utilistation.

If I've not made it clear what's really changed, perhaps this might help.


Convention over configuration

To keep the process of discovery and location as simple as possible, I spent a great deal of time harmonising conventions in every solution. Through this the discovery process can rely on inference (as opposed to explicit configuration) to locate other dependencies and other components.

The final generated build plan is 3MB in total, largely because it contains a lot of enrichment that means we don't have to do any further calculations or discovery when the time comes to perform the job. We have all the information we need, and the CPUs are entirely dedicated to building the platform as quickly as possible.

Extensible

The build execution plan is really nothing more than objects with pointers. It's therefore possible to extend the computed build order with more actions.

For instance, I've been asked if we can include NuGet packaging, and the simple answer is yes. Given the information I now have to work with...
  • I could package every configuration that is built
  • Or.. I could package at the end of every build in a particular domain
  • Or.. I could package after a particular logical group has built (i.e. Core.*)
  • Or...even wait until every build has completed as package the lot in one go.
Specifically interesting in regards to NuGet is that because I have a complete dependency hierarchy, I will be able to dynamically add the dependency information of each produced NuGet package. 

The FXCop project has been waiting for sometime, and the completion of the JIT mechanism has opened up new and more useful options such as applying FXCop over a logical grouping rather than per project.

And there are other goals in my work pipeline that will slip seamlessly into this mechanism. 

Was it worth the effort?

As with everything I do I like to evaluate the outcome, it's all part of the Continuous Improvement ethos. The headline figure of a 9 minute reduction in build times is fairly conclusive evidence. The fact that it's much better is beyond doubt, but was it actually worth the effort?

In total, I will have invested about 10 day of effort into this refactoring exercise. There's a lot of "9 minutes" to be had in 10 working days (4,800 to be precise), so why bother?

Firstly, let's examine what can happen in 9 minutes?

Our platform is a high volume, high transaction, online gaming platform. Going offline, even for a moment or too can impact on revenue. Earning customer loyalty is expensive. Maintaining customer loyalty is also expensive. So, if our platform has to be taken offline for unscheduled maintenance, it's going to be expensive, and in this context a mere 9 minutes could feel more like a lifetime.

Productivity is my daily concern, enabling our Devs to be more productive than yesterday is my job. Our 10 developers work constantly on improving the platform and as a consequence will trigger a build of the platform many times a day. It's the frequency of these builds that will determine the payback on the 10 days invested.
  1. Just once a day, is a combined saving of 90 minutes, and we'd break even in 53 days.
  2. Twice a day, 26 days (so now we're under a month!)
  3. Five times a day, and hey-presto! 10 days. 
So optimistically, it would only take 10 days to recoup my efforts if we were looking at this from a purely resourcing standpoint. 

Extending this argument over the long term and the savings really start to stack-up. Over the course of a single year? Let's assume 252 working days...
  1. So, just once a day, 378 hours saved, 15 days.
    That's easily two medium sized features gained.
  2. Twice a day, easy, over an entire 1-developer month is reclaimed
    What's the cost of hiring a contractor?
  3. Five times a day? 787 days! We've gained the equivalent of 3 developers!
    That's a lot of extra features!
Amazing how quickly things can add up! 10 days, well spent.

I'm off to buy a car ;)



Comments

  1. That's impressive! One might argue that the additional complexity of the overall process hides some maintenance costs, but quicker builds are certainly worth it. I am curious as to whether you are using SSD in your tfs/build solution.

    ReplyDelete
    Replies
    1. Hi Paolo,

      The maintenance of this revised process should be absolutely minimal. There is of course a set of conventions that the developers need to observe, but aside from this the entire process of discovery and build is fully automated.

      The discovery process itself takes about 30 seconds, but to avoid that cost each time, I've cached the discovery output in an XML file that is re-used on every subsequent build run.

      We're actually using 10k/rpm disks in one server, and 15k/rpm in another. Oddly, the 10k/rpm build agent is always faster than the other!!!

      The problem with SSDs is there relatively smaller size, and build agents workspaces quickly consume all the available space. I have now written a few tidy-up scripts in Powershell that run on a schedule. About once every 3 days, the script runs and removes the workspaces from the build agent, then deletes the files. This keeps the footprint relatively small, and so now we could use SSD disks if we wanted. The reason behind creating these workspace tidy scripts was simply to avoid the downtime once every 6 or so months once TFS had completely filled the disk up with its workspaces! Also, the nightly defragment job completes much quicker if it doesn't have 1TB of files to process :)

      I'm looking at adding a third build agent in the next month or so, and I believe we're going to use a 300GB SSD drive to give us the very last word in disk performance. These new agent is also going to be given a different processor architecture, with an emphasis on concurrency as opposed to raw speed. I'm hoping to secure funding for a 24 core system, so I'm thinking that an SSD will certainly minimise disk I/O operations and bottle necks under such an agressive load.

      Many thanks,

      Matt

      Delete
    2. I should have also mentioned, that by doubling the concurrency from 12 to 24 (one per core), that I'm hoping to achieve build times around 4-5 minutes. A build time under 5 minutes would be out of this world. :)

      Delete

Post a Comment