Positive outcomes from server failures
Well, the September quarter is done and dusted, and if it weren’t for a few busy weeks of fixing servers it would have been pretty quiet. The past few days however have seen numerous PC failures - it is as if the servers work fine while the workstations have issues, and the workstations work fine when the servers are down. If only the users would work fine 10% of the time
In the past six months I have gone through two serious (one of them majorly serious) failures relating to the same server. I’ve noticed there are two good things to come out of a serious server crash/outage. When I say serious, it is on the level where it impacts the majority of the organization for longer than 24 hours, however only time is lost, all company data is safe.
The first positive outcome is that those who are responsible for maintaining and bringing systems back online get invaluable experience. You can’t train for it or study for it, because in a lab environment you don’t have the bosses and staff breathing down your neck. For systems where there is little or no redundancy (probably due to budgetary reasons or the management-imposed requirement to K I S S), the pressure starts to build.
Trying to get the system running again, diagnosing the faults for vendor tech support staff, keeping management up to date, while answering the calls and the same repetitive questions from the userbase can make for a stressful few days. Especially when the exact same server crashes 80 days after it was rebuilt from the first crash.
As the pressure and wariness from long days build, the likelihood of rushing through processes, thereby making mistakes, increases, in turn increasing the time it takes to get the system back online.
The experience gained in these scenarios is invaluable; you can train and study for recoveries, but you can’t get the experience without being put through the wringer.
The second positive outcome is that the majority of business managers will reassess their IT systems and be more willing to act on suggested improvements or increase the budget. Because the managers become more aware of the impact of downtime on the business, they will likely consider the recent proposal from IT to implement clusters/load balancing/replace hardware/etc as worthwhile, rather than the IT team asking for more toys.
These recent crashes have cost the client around $8000 in IT labour, which is not huge, but it has cost them many times more in lost productivity. In life, hindsight is 20/20, but in business, foresight is generally listening to the IT department’s recommendations.. ![]()