These links contain white papers I've produced discussing my experience with designing, installing, testing, and maintaining highly-available systems.
A Few Thoughts on Uptime: I pull together experience, insights into human brain functioning, and the normal accidents model.
A Few Thoughts on Uptime References: The books I read which informed A Few Thoughts on Uptime.
How Complex Systems Fail by Richard Cook -- the tightest summary of this space I've encountered.
One Page Summary A Few Thoughts on Uptime: An effort to squish the entire paper onto one page.
Tools for Managing Outages: Tips and techinques for handling planned downtime.
DaPlan-Hobbes: Sample plan for coordinating an Outage
Normal Accidents: Early attempt at cataloguing Normal Accidents
I'm particularly fond of reducing cognitive distraction and clutter, as a technique for contributing to clear thinking, which supports all sorts of aims, including architectural design for uptime as well as reducing mean-time-to-repair. This photo essay illustrates these principles using the physical layer: in these examples, cabling in IDFs, MDFs, and Data Centers. As an aside, this approach to managing Ethernet cabling produces IDFs with a low operational cost for moves/adds/changes (MAC) ... and, intriguingly, a constant cost for MAC as density and churn increases: these IDFs were 10-15 years old when these photos were taken and yet show none of the chaos which typical IT equipment rooms display after mere months of use, despite the high-rate of MAC they have experienced (reconfiguration of work-spaces every ~3-5 years).
For the phy layer geeks amongst us: the Cat6 termination gear portrayed here comes from the ADC Krone Ultim8 line, while the Net IG folks designed and built the custom cabling (switch tails). The success of this Wall Punch strategy stems from several factors, including (a) removing the plastic sheathing from the Cat6 cables shrinks the space they consume by a factor of ~5 and substantially reduces between-cable friction, and (b) every cable is precisely the right length, since the installer custom-cuts it during installation.
The cable installation standards used to deliver the above equipment room designs:
A couple papers I've written on the topic of designing and managing cabling:
Long Road to High-Availability: The psychological, operational, and process-oriented potholes I have encountered as we have gradually increased uptime.
Testing Highly-Available Hardware: Tips I've developed for validating that highly-available systems can survive the loss of a component.
Testing the Transport Side of Highly-Available Hosts: Tactics for validating that multi-NIC hosts are configured to take advantage of redundant Ethernet switches, published in the June 2011 ;login.
Last modified: 16-January-2016