First reported in the Telegraph on 21st April 2016, and later by Channel Four News on 24th May 2016, was an incident affecting UK National Security that occurred during the previous year, on 13th June 2015.
“Semaphore, the computer system that checks passengers on their way to the UK against watch lists of suspect individuals, had faltered after being flooded by tens of thousands of messages. The malfunction, believed to have originated on British Airways’ systems, had been spotted on Saturday 13 June last year but snowballed over the weekend. Between 7pm and 8pm on Sunday some 175,000 error messages swarmed the system as BA officials scrambled to contain the meltdown.”
The backdrop could not have been more serious, with the country on heightened alert after jihadists had gunned down 12 people in Paris after storming the offices of satirical magazine Charlie Hebdo. At the time England football fans who had gone to Slovenia for their team’s 3-2 victory that day would likely have been on their way home.”
Given that business context, it is clear that both the UK Home Office and British Airways were far too cavalier about allowing a contractor to make changes to such a critical system without peer review and without a way to anticipate and fix a possible failure. At one point 1.8 million messages were queued at the Home Office Semaphore system, which crashed completely, and it left all airlines flying to Britain off-line to Semaphore and with no access to the ‘no-fly list’ for 48 hours, an unacceptable business interruption.
It is a well-known experience, often widely reported in the news media (as in this case, after a year of secrecy), that software upgrades and changes can go wrong, and when they do, critical business services suffer an outage. The way to manage this risk is to have a regression plan
The process begins with peer-reviewing and testing new code to make sure there are no identifiable bugs within it. It is quite common for old faults to re-emerge during the evolving life cycle of software. Sometimes this happens because a previous fix gets lost through poor revision control, either by faulty processes being applied or by human error in applying a good process. In other cases a previous fix may be ‘fragile’, in that it fixes a bug in a very local context where it was first observed, whereas it actually exists in the global context of the software and will re-emerge later in the life-cycle when other changes are implemented. Then there is the possibility that when a feature or function is redesigned and replaced with a new version, the same mistakes will be made in that redesign that were made in the original design, removing any fixes that may have been made to correct that problem.
Regression testing is the process of going back over the entire history of bug fixes and retesting them to ensure that they still work and have not re-emerged in the latest version. It requires that a complete history of previous releases, bugs, the fixes for them and the test procedures that expose them be maintained. There are tools on the market that can automate this regression-testing plan.
Finally, there should be the capacity to regress back to a previous known release of the software, known to work, but not with the new or redesigned features, so that business as usual can be maintained while the problem is investigated.
It seems that so many organisations still do not join the dots between business critical services and the ICT that underpins them. Technologists are in charge of the shop, with apparently no notion of what their actions can mean to the business. An IT department that did understand this principle would never allow a critical software release without a full-scale regression plan. It seems that much more SABSA thinking is needed to make the world a better, safer place.