What if the Internet Fails?
Are you as prepared as you think for an internet outage?
When I first studied business continuity, our instructor gave the typical cause of a telecom outage as being a backhoe (UK - JCB) accidentally digging through a telephone cable. The outage might last for a few hours. If you judged the probability of this happening as high enough and the impact as severe enough, your options were:
- Have two telecommunication feeds into two different areas of the building. The reasoning being that even if a backhoe severed one of the cables, the other would still be operational.
- Have a hot backup site or alternative call center to which you could route calls instead.
- Have a number of mobile phones to which calls could be redirected if the landlines failed.
All these approaches were reasonable, although they did have problems. Paying two different companies to supply telecommunications and having feeds to two different parts of the building might give the appearance of redundancy, but unless you were careful, both companies would lease capacity from the same third party: your protection against badly-behaved backhoes might extend only as far as the next manhole cover.
Although backhoes haven't stopped wreaking havoc on buried cables everywhere, that's probably not been your biggest telecommunication threat.
What is the most likely threat now? Human error.
Starting at around 08:15 UTC on 8 July 2022 Rogers Communications, one of the three biggest internet and mobile phone providers in Canada, started withdrawing routes from other internet providers. The internet works using a protocol called BGP (Border Gateway Protocol), where each router (network node) tells adjacent routers which parts of the network it can connect with. Progressively the entire Rogers network collapsed. (There's a mesmerizing animation of it .) By 08:45 UTC there was nothing left. The entire network was unreachable, both internally and externally. And it stayed that way for most users for over 24 hours - some nearer 72 hours.
We haven't yet heard any technical explanations of what went wrong - a "maintenance update failure" was the official "explanation".
The days when there was a simple distinction between the phone system and the internet are long gone. Rogers phone customers (both mobile and "landline") lost phone service. A (deleted) post on Reddit suggests that Rogers couldn't communicate with its engineers because they were using Rogers mobile phones.
Among the services disrupted:
- 911 Emergency calls (the call centers worked but people couldn't call them from many phones)
- Canadian passport offices (!)
- Canadian ArriveCan app (must be used by travelers entering Canada).
- The Interac (Canadian debit card) payment system.
- ATM machines at many banks
- Point of sale systems: shops which could not operate without internet connectivity were forced to shut.
- Handheld devices used by baggage handlers to scan luggage at Toronto airport
- Major concerts cancelled. (It's unclear whether this was due to electronic ticketing or security staff relying on Rogers network).
-
Local radio stations and repeaters failed.
- A Bike share system unavailable
- Court cases were postponed due to lack of video conferencing
- Transit ticket purchasing was unavailable
- Two factor authentication systems (requiring SMS messages) locked people out of systems.
- Even Rogers customers using their phones abroad were unable to make calls or send texts.
Increasingly systems are being deployed which assume mobile network connectivity. Although I couldn't find reports of disruption in these cases, cost factors increasingly mean that mobile networks are being used for:
- Traffic control, in particular prioritizing emergency vehicles at traffic lights
- Transit system location vehicle tracking and ticketing
- Autonomous vehicles. (If you're not an inhabitant of San Francisco, you may be amused to read about what happens when some "autonomous" vehicles lose connectivity. )
This isn't the first time this has happened - a previous outage (affecting mobile services) in April 2021 lasted almost a day.
So when planning, don't worry so much about the backhoes of this world. Ask yourself what would happen if:
- Your ISP failed for 24 hours?
- Your phone system failed for 24 hours?
- If your ISP and phone system both failed at the same time because they use the same infrastructure?
Mitigation steps
- Have a tested manual system you can use.
- Use multiple ISPs who do not have common systems.
- Be aware of which ISPs share parts of their infrastructure to ensure you have practical redundancy.
- Don't use a single mobile phone carrier for corporate phones.
- Do not negotiate a major discount for staff with a single carrier.
Resources
-
Cloudfare's blog on this outage.
-
Cloudfare's on a similar Facebook outage.
- blogTO on its effects on Toronto services.
- IBM explaining its global Cloud outage caused by BGP errors
- The Weeknd and other concerts postponed
- Ironically The Weeknd concert was at the Rogers Centre: more details from The Globe and Mail. Some commentators have taken this to mean that they literally couldn't open the doors due to the network outage. Also interesting because it discusses some of the third party impacts of the cancellation.