Monday, 16 September 2013

DevOps | Supporting your platform

I am the build and release manager for an online gaming platform, and much of my time is spent supporting and improving the development and testing infrastructure; reducing friction, removing obstacles, straightening processes and generally improving our teams productivity.

Obvious activities include adapting and optimizing build & deployment processes, provisioning new environments. Less visible and less obvious is the maintenance and house-keeping that's required to keep things ticking along on a day to day basis.

Example of this are:
  • IIS meta-base corruption
  • Bespoke deployments
  • Corrupt message queues
  • Stopped windows services
  • Invalid HTTP routing
  • Firewall exceptions
  • Crashed application pools
  • AppFabric corruption
  • Missing message queues
The list is long, and without automation would seriously impede the development and I.T. teams.

In an ideal world, we could engineer-out these repetitive issues, but diverting resources towards C.I. is often as hard as getting water to flow up hill. The ancient Greeks of course solved this problem, but it did involve a giant screw, and I seem to have mislaid mine. With resource allocation beyond my control, the next best thing I can do to ease the pressure is put things right as quickly as possible.

Powershell 'maintenance' module

Requests for support are constant an unrelenting, but some days they can be more numerous than others. For common problems, i have an automated fix, and I've embedded these automated fixes into a collective package known to me as the "Maintenance" module.

This is a standard Powershell Module that is "Platform Aware". The module is able to resolve the run-time configuration of any given instance of the platform, and therefore knows where to find specific wesbites, endpoints, services and hosts. This provides the foundation for the automated fixes that the module provides:
  • Add and remove test users
  • Flush cache hosts
    • Traversing all cache nodes in a cluster
  • Repair MSMQ queues
    • Journalling messages
    • Clearing down queues
  • Reset firewall rules
  • Reset firewall routing
  • Reset application request routes
  • Restart windows services
  • Enable/disable diagnostic operations
  • Enable/disable the scheduled tasks 
    • Noting which were already disabled
  • Gracefully re-start the platform
    • Specific services, in a specific order
  • Warm-up the platform by visiting specific sites and end-points
    • Crawling site maps
    • Crawling message queue handlers
  • Clear-out logs

The general theme throughout is the maintenance and repair of any given instance.

Using the module

The module is available on all the development computers.

Import-Module DevMaintenance

With the module now connected to a given instance, complex multi-step operations can be initiated with simple commands.

Reset-Cache

Identifies all the app fabric cache hosts in given instance, and visits each one in turn, clearing out the existing caches and removing them from the cluster. Once the cluster is effectively torn-down, each node is rebuilt and re-added to the cluster.

e.g.

Import-Module DistributedCacheAdministration
Import-Module DistributedCacheConfiguration

# On the cache cluster host
---------------------------

Use-CacheCluster -Provider "XML" -ConnectionString "c:\AppFabric\AppFabricCacheShare"

Get-Cache | % {
Write-Host " Removing $($_.CacheName)" -fore cyan
Remove-Cache $_.CacheName

}

Stop-CacheCluster


# Per host found
----------------

Remove-CacheHost 
Remove-CacheAdmin 

Unregister-CacheHost -Provider "XML" -ConnectionString "\\$($cacheClusterHost.host)\AppFabricCache" # -EA SilentlyContinue

Remove-CacheCluster -Provider "XML" -ConnectionString "\\$($cacheClusterHost.host)\AppFabricCache" -force



# Back on the the cache cluster host
------------------------------------

Remove-CacheCluster -Provider "XML" -ConnectionString "\\$($cacheClusterHost.host)\AppFabricCache" -force


Reset-RequestRouting & Reset-TCPForwarding

A command that is frequently invoked, as developers (legitimately) alter the application request routes on their environments. However, the next developer that approaches the instance get's very confused when messages go missing. Easiest, quickest and simplest thing to do, invoke this command, and everything is reset back to exactly how it should be.

netsh advfirewall firewall delete rule name="Platform infrastructure - $($service.name)"  | Out-Null

netsh advfirewall firewall add rule name="Platform infrastructure - $($service.name)" dir=in protocol=$($service.protocol) localport=$($servicePorts) action=allow profile=domain  | Out-Null

Reset-Services

The platform depends upon numerous windows services, but from time to time they failed to respond in a timely fashion and Windows terminates them. Any given instance of the platform can span multiple hosts, and visiting each in turn and resetting the services is a time consuming process.

This command can complete that onerous task in under a minute, visiting every service host in a parallel operation and performing the necessary actions.

e.g.
Get-Service | Where-Object {$_.name -match "Suffix.*"} | Restart-Service 

Restart-Platform

The platform has a very specific start-up sequence, which if not adhered too, can be detrimental to the overall function. During system patching and reboots, the incumbent platform must be restarted before development can resume.

Pipelining

These actions can be invoked individually, or they can be pipeline into a series of actions that flow from one to the next.

Further more, they can be applied to one instance of the platform, or as many as I need.

e.g.
Get-Platform | ? {$_.Types -eq "Development"} | Attach-Platform | Reset-Cache | Reset-TCP | Reset-Services | Publish-ETL | Reset-Hosts | Restart-Platform

A very handy facility to have, especially when an un-noticed bug made it back into the main trunk and out on to every development instance.

An entire legion of environments and hosts can be repaired invisibly and more importantly, without fuss.

The powershell pipeline makes it easy to chain together as many actions as I need, and I can add more and more as the platform evolves.

I need this module, I have over 30 deployed instances of our platform to support that span over 50 hardware nodes. And at the rate we are expanding, these numbers are likely to grow.