Accepting Failure as the Baseline Reality

I remember when I first encountered the Google File System. I was working at Microsoft and attending one of the internal technology trainings. I was working in the Macintosh Business Unit at the time, but interested in everything! At that point in time the new hotness was a new file system for Windows NT based on the concept of a database. It was going to change everything!

There was a small room where a Microsoft engineer and researching was explaining something Google had recently released: the Google File System (GFS). It blew my mind.

Just listen to this from their published research paper:

GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.

The Google File System: Introduction, page 1

Up until that point reliability in hard drives and the file systems that sat on top of them was, well, not that great. I had personally experience with the “click of death” that so often indicated that your storage system just failed, and well, you probably lost everything.

What was so fascinating to me then, and now, was that the novel solution wasn’t to focus on increasing the reliability of these underlying systems, the solution was to accept that these systems were in fact always going to fail. Accepting that fact that failure was going to happen, and expected, and never going to change, allowed the engineers to build a system that was reliable on top of that!

This is a profound change in viewpoint.

I remember when I was a teenager. My father setup a small family business and let each of us take turns running the company to give us experience. The company was a simple manufacturing business. We purchased parts, assembled them and sold them for a profit. Part of the assembly returned 2 screw holes be drilled in exactly 2 locations to attach the next part. I remember getting so frustrated with my sister who’s job on the assembly line was to drill these holes. She was sloppy. Didn’t get them vertically aligned. Missed the mark. Sometimes the holes were so far off whole piece couldn’t be used. Slower output. Less profit. “Why couldn’t she just be conscientious and careful?!” I complained to my Dad. His response: “Son, fix the process, not the people.” So we got a drill press, built a jig, and not only did the failure rate basically go to zero, the output was faster too!

This taught me a deep lesson: most people want to do well, and most of the time, the reason they are not performing is process related. Something in the context of their job is likely amiss and needs improving. Sometimes it’s additive like a jig, but other times is removing stuff. Either way, when you accept failure as the baseline reality, then you are free to fix the process which in turn will help real people, fumbling, bumbling and stumbling to succeed.

Leave a comment