The Great TSB Meltdown
Don't blame the Testers!
Whenever there is a significant IT systems failure, the finger of blame is often pointed at Testing and you will hear the inevitable cry of ‘why didn‘t they test it?’ Testing and Testers seem to take the brunt of the blame even though people don’t have the full facts in front of them.
Well it was no different with the meltdown of TSB’s online banking system in April 2018 following an attempted systems migration of grand proportions. TSB are still counting the cost of this failure and so far as a result, have had to pay nearly £370 million in ‘post-migration charges’. Costs could increase further if fines are subsequently imposed by regulators.
Now the dust has settled, investigative reports by IBM along with a new in-depth inquiry by law firm Slaughter and May have now been published. Both have been looking for cause and blame for the failure, and both reports focus heavily on testing and sight how insufficient testing was a significant reason for the failure.
In light of this, techTesters have taken an in-depth look at the whole event and the report findings. We intend to explain how unjust it is that Testing as a whole gets tarnished, and why any blame that implies Testers were at fault appears to be very unfair. We conclude by looking into what Testers can learn from the situation and incorporate into their daily testing routine.
The IBM Report was heavily critical on the amount of testing performed prior to the system migration. Below are some statements from the IBM Report:
- Test Documentation: There was little evidence found of test design documents and expected outcomes of tests
- Performance Testing: the performance test results did not provide the required level of evidence that capacity could be managed successfully
- Operational Testing: there was no evidence of operational testing or operational support processes being developed and tested
- Cut-over Testing: there was no evidence of any cutover testing or dry-runs being performed
- Test Environments: There was a lack of realistic test environments
- Design Documentation: There was little evidence of system design, architecture and configuration documents that would have normally been used to help with test design and test script preparation.
The recent Slaughter and May Report is also heavily critical of Testing:
- There were two data centres involved in the migration, but only one was tested
- IT services company Sabis who were responsible for the data centre testing concealed the lack of testing from TSB board members so were kept in the dark and led to believe the migration testing had been sufficient.
So, on the surface this all looks pretty damning from a testing perspective. However, the facts below show that there is another side to the story and that testing appears to have been used in the main as a scapegoat when there were so many other key factors involved in the failure.
- TSB were paying Lloyds TSB £100 million a year to use their systems and this project would remove the cost. Therefore the project needed to be completed as a matter of urgency and there was pressure applied to the tech teams to do so
- The project was meant to take 18 months and by April 2018 (when they did the actual cutover) the project was significantly behind schedule. This added to the pressure to complete the migration with urgency
- It was a hugely complex project to migrate all TSB UK customers from Lloyds TSB systems to a new platform, and then keep the new platform running in an identical manner to the Lloyds system. It was business critical and a significant amount of time was required for testing, but this did not fit in with the project deadlines.
- To add to the complexity, new applications were introduced along with advanced use of microservices and the use of active data centres. This all compounded the risk and enhanced the need for rigorous testing.
So what can we conclude from this?
The following suggests that testing and Testers shouldn’t be blamed directly for the TSB Meltdown and that it was the management team and decision makers where significant responsibility lies.
- The testers did not have all the facts required to allow them to write adequate testing documentation and know what the expected outcomes of tests should be
- There was huge pressure within the project to go-live with urgency. More testing would most likely have been completed had corners not been cut across the project to meet deadlines
- The performance testing that was performed showed that the system could not handle the required load and it appears this evidence was ignored
- Testing progress (and lack of it) was being concealed from decision makers
- It is highly unlikely any Test Manager would have agreed that testing was complete and that they believed all risk had been mitigated
- It is unlikely anyone within the testing department deliberately set out to minimise the creation of test documentation and to perform significantly less testing than was required
- There were a number of other reasons cited in the IBM report for the failure and shows that testing alone is not enough to ensure quality and shouldn’t take such a large proportion of the blame
What many of us already know is that when a project has gone live, and there are subsequent issues, testers can be made the scapegoat. However, in reality it is not the testers that actually cause these issues; they are there to help detect them in advance if given the time and tools to carry out the job.
Many of us will be very aware how cautious testers have to be when proceeding with testing in difficult circumstances and how they communicate progress. Here is some guidance:
- If you do not have all the necessary information available to write adequate test cases and expected results. Call this out immediately, seek to have it rectified, and until it is resolved ensure it stated on a risk log or similar document. If one doesn’t exist; create one.
- If you are asked to reduce the amount of testing due to time constraints. Document the risks of doing so and ensure the decision makers see it. Let someone else specify ‘what they don’t want tested’ so they will be accountable, not you. Get it in writing
- Be careful when reporting test progress that you don’t give the perception that things are going well based simply on the number of test cases performed, when in reality, many have failed and will need to be rerun
- Often you may be asked if ‘all the testing is finished’. Now this can be a bit of a trick question and one that can subsequently be your downfall. Testing is rarely ever ‘finished’. There is nearly always something extra you could test if time, resource or infrastructure allowed. A model answer to this sort of question would be… ‘With the information and time provided to the Test Team, I believe there has been sufficient testing performed to help mitigate significant risk’. Therefore if things do go wrong after a go-live, no one can justifiably quote you as saying you had “finished all the testing”
- As testing is often the last thing to be completed before a project goes live, when the testing phase is nearing completion the Test Manager may well asked if the project is ready to go-live. Once again, be very careful with your response. A Test Manager should not be declaring a project ready for go-live. They will not have all the facts about the project as a whole and should not be the one to make this decision. They can of course contribute to a decision.
- If blame is directed at the Test Team, be prepared to fight your corner and make it clear that blame should not be apportioned without proper justification, facts and evidence
So to summarise; if you are in an uncomfortable situation from a testing perspective when a project is about to go live, write down the reasons with justifications and ensure the decision makers have read it. Think carefully about questions you are being asked and how you answer them so as to not give out misleading information or information that can lead to misinterpretation.
There does though have to be pragmatism and it is understandable that often a go-live cannot wait for all the planned testing to be completed. If though, as a Tester Manager or Tester you follow our guidance and deal with situations like this carefully, you will have something to fall back on afterwards in your defence should issues subsequently arise.
On a positive note, attitudes have changed significantly towards apportioning blame when things go wrong in tech and there is a general understanding that what we do is difficult, and we do our best. However, from the TSB scenario where millions of customers were impacted, the cost to the company has been millions of pounds, fraud increased and people lost their jobs; someone is going to get the blame, and it unfortunately it was testing that seemed to get the brunt of it.