Analysis of Today's CenturyLink/Level(3) Outage

pritambarhate · on Aug 31, 2020

There was the finals of The FIDE Chess Olympiad going on between India and Russia when this outage started. 2 Indian players lost connect to Chess.com at this time and ended up losing their games on time. This ensued a lot of drama. In the FIDE ended up declaring both India and Russia as joint gold medalist. Russian players were not happy. Also Armenia had forfeited from tournament as they had faced similar disconnection issue against India during quarterfinals.

Here are a few links if you want to follow the drama:

BBC: Chess Olympiad: India and Russia both get gold after controversial final [1]

YouTube: Joint Gold for Team India and Russia at the Online Olympiad 2020 | Full [2] story

FINALE!! INDIA vs RUSSIA CHESS OLYMPIAD LIVE STREAM [3]: Actual live stream of the final. The actual incident starts around 01:56:00

[1] https://www.bbc.com/news/world-53965748

[2] https://www.youtube.com/watch?v=YgZAVOUmWcg

[3] https://www.youtube.com/watch?v=kJLyFSVuRnk

tzs · on Aug 31, 2020

I don't know how these events are conducted, but I'm assuming that it is not simply each player playing from home unsupervised. That would make it way too hard to prevent cheating. So I assume that there is a tournament official onsite at each player location watching, just like there would be for an in-person tournament.

If that is the case, there are a couple of reasonable ways to handle this.

One is to bring back something that used to be common in high level chess: the adjournment.

This used to be common when most championships used time controls that were slow enough that you often would not finish the game in one play session.

The way it worked is that at the end of the play session, the arbiter calls for an adjournment. When the player on the move decides their next move they write it down on a piece of paper rather than actually making it on the board, the move is sealed in an envelope which is kept by the arbiter, and play stops.

When it is time to resume, the arbiter opens the envelope and plays the sealed moved on the board and starts the clocks.

Both players are free to analyze as much as they want between play sessions, and can get outside help. For world championships, especially back when it was Fischer vs. Spassky or Korchnoi vs. Karpov and the match was serving as a proxy for the cold war, each player would have whole teams of top GMs to help analyze during an adjournment.

The player who sealed the move has the advantage of knowing for certain what position will be on the board when the game resumes. On the other hand, the other player is going to be the first one to have the move after several GMs have pulled an all-nighter analyzing it for them so any inaccuracy in the sealed move is much more likely to be punished than it would have been without adjournment.

Or get a freaking modem. Exchanging chess moves does not require high bandwidth, and most of these internet outages do not take out phone service.

Do a mini-adjournment (seal the move, but keep the players on site), establish a dial-up connection between the playing sites, and then unseal the move and resume.

timsally · on Aug 31, 2020

Generally in professional online tournaments there is not a tournament official onsite. Players are competing from their homes. Instead, tournaments require several Zoom sessions from multiple angles in an attempt to verify that the neither the player's computer nor a secondary device are being used to cheat. These Zoom sessions are monitored by tournament officials.

On top of this, commercial chess websites have extensive anti-cheating measures that are used to analyze the games after the fact. For example, one of the major players has 5+ engineers and several strong chess players on their anti-cheating team. These teams have caught professional players cheating a surprising number of times. Being caught results in a lifetime ban from the chess website and there are often consequences for the player in real life as well.

I don't really buy the idea of an adjournment. Are you supposed to have one every time a connection problem happens? Historically players knew at what point in the game an adjournment would happen. Having them happen at random throughout the game would change the whole dynamic.

Your idea of a cellular connection as backup is an excellent one. I think the first commercial chess website to implement a turn key way for players to utilize one will see huge returns from that investment.

hinkley · on Aug 31, 2020

Every time this conversation comes up, I'm reminded of a world-building subplot in one of Vernor Vinge's first books. Instead of banning computers in chess, let the competitor use a computer that they built themselves, so it's one augmented human versus another augmented human.

In his world, there were no supercomputers elsewhere, so you didn't have to worry about covert channels phoning home to a much bigger computer. I suppose you could put everyone in a Faraday cage...

loneranger_11x · on Aug 31, 2020

which book is this? and would you recommend?

hinkley · on Aug 31, 2020

I'm fairly sure it's The Peace War, which has a sequel (Marooned in Realtime) where the concept of The Technology Singularity in introduced.

I get a little salty about Ray Kurzweil getting the credit for a concept that Vernor Vinge had already published in 1986.

Aperocky · on Aug 31, 2020

Why can't they have a rematch though? At least the Russian player should have a say if they wanted a rematch.

achiang · on Aug 31, 2020

Google networking SRE here (my team runs ns[1-4].google.com among other services).

Regardless of original intent, the blog doesn't land well with me. It could have provided the background on flowspec, using their own past outage as a case study, without any of the speculation or blameyness that came across here. The #hugops at the end reads quite disingenuously.

We see other networks break all the time and we often have pretty good guesses as to why. But I personally would never sign off on a public blog speculating on a WAG of why someone else's network went down. That's uncouth.

badrabbit · on Aug 31, 2020

I think you're stuck on the politics. Level3 is their competition but initially CF was blamed. CF owes it to their customers and investors to explain to them why they had an outage and how they responded to it, and they do not need talk in detail about an unrelated past incident (just because it was related to flowspec does not mean it was a similar outage), and they certainly should not wait for Level3's investigation.

I would expect Google to have a similar explanation if a significant number of GCP customers faced an outage.

You should know, it wasn't just someone else's network that went down, that network brought down a big chunk of the internet with it. I think technical honesty comes before political appearances. The #hugops and mention of their past experience with a flowspec outage is clearly there to signal that the blogpost is not there for blaming or making L3 look bad.

achiang · on Aug 31, 2020

The politics is exactly the point of my comment.

The professional way to write a blog post like this is from your own perspective. Identify the proximate cause (the peer), name names if you must, talk about how awesome your own systems are, show some of your monitoring if you like, and talk about what you'll do in the future to be even more resilient to this class of problems.

That's all to the good and much of Cloudflare's blog was exactly that. Would've been fine if they left it like that.

Acknowledging there is no postmortem (yet) but then pointlessly speculating about what it might contain is what I have a problem with.

I don't speak for Google but if I found out we had written a post like this, I would speak up and advocate to change it.

badrabbit · on Aug 31, 2020

There is nothing professional about avoiding a topic for the sake of appearances. Level3 put out details knowing others in the industry will discuss and speculate based on that information. They could have witheld details such as flowspec and edge routers bouncing but they did not, it's perfectly professional to discuss speculative details of someone elses outage that affected your customers based on details they chose to make public.

In infosec for example, it's extremely common to speculate about a vulnerability based on details in the CVE. Entire news articles are based on such speculation. Like I said, you are giving too much weight to optics and appearances. I would like to see anyone actually at Level3 complain about this post.

timcosta · on Aug 31, 2020

Honestly, I'm wondering if this blog is a response to https://web.archive.org/web/20200830171114/https://www.cnn.c... which has since been heavily modified, but was on the front page of CNN making it sound like Cloudflare was responsible if you only read the first bit.

badrabbit · on Aug 31, 2020

Could be, even yesterday the BBC was blaming CF at the text scroll thingy on their newscast

iJohnDoe · on Aug 31, 2020

Normally CF throws a lot of mud in these situations. Karma got them on their recent last outage.

While reading this I got impression they were genuinely trying to tone down the mud slinging they normally do while also trying to make it clear the outage wasn’t their fault. They just need more practice.

A provider as large as Cloudflare will always be impacted by other providers. Hopefully that point is clear to them now. The worst thing that can happen is they get a reputation they can’t play ball nicely and their peers and partners get tired of them. Service to their customers will erode over time because those peers and partners will screw with CF behind the scenes. It’s better to have friends than enemies with the type of business CF is in.

evntdrvn · on Aug 31, 2020

pretty common pattern for CF though.

kryogen1c · on Aug 31, 2020

sometimes people remark at how extensive ancient civilations became with such simple technology, yet here we are. billions of people being served by things like BGP and SS7.

as i get older, i become more and more concerned with humanity's lack of fault tolerance.

brett weinstein clued me into this as an evolutionary phenomena. if a gene activates a short-term solution and long-term problem, that gene is likely to be favored.

how do we transcend this problem that seems to be inherent with existing? a first-principal problem?

CloudNetworking · on Aug 31, 2020

If anything, these kind of issues we see popping up here and there are proof of the high availability of the Internet and specifically how protocols such as BGP helped on making it what it is today.

It is not that we have built the Internet despite BGP. We have built it thanks to BGP. If we didn't have BGP we would have to invent it :)

alexchamberlain · on Aug 31, 2020

Isn’t the point that BGP is great in the same way as the Model T was great? No one is saying it wasn’t needed or - to a certain extent - doesn’t do the job, but given recent (and not so recent) improvements to technology and security standards, maybe we need a BGP 2.0?

PietjePukster · on Aug 31, 2020

BGP 2.0, like self driving cars, are five years away.

For the last 20 years..

p_l · on Aug 31, 2020

Good that we're running BGP 4, right? ;)

Aperocky · on Aug 31, 2020

This makes total sense because humans are inherently lazy. Hurd would be out in production in 199x if not for Linux. But it's still being worked on in 2020.

phone8675309 · on Aug 31, 2020

> humans are inherently lazy

> Hurd would be out in production in 199x if not for Linux.

I think Hurd is not a good example of this. Hurd being sidelined seems to me to be a result of bikeshedding (which microkernel to use) and realizing that Linux (as a kernel) had more effort being poured into it because it had more mindshare.

> But it's still being worked on in 2020.

At more or less a leisurely pace as a passion project more than the end goal being production, precisely because the social goal that it was trying to achieve has been mostly achieved by the Linux kernel.

cm2187 · on Aug 31, 2020

You could say the same thing of http 1.0. Doesn’t mean it doesn’t have flaws or room for improvement.

ATsch · on Aug 31, 2020

This outage really doesn't really appear to have anything to do with BGP? As in, it would have happened even if you replaced it with some super shiny modern distributed protocol. Your comment is also strange because especially in networking, BGP is far from the oldest thing we are using.

There is this tendency in startup bros to assume BGP must be bad because it's old but in general BGP still holds up as a great protocol and is still something people regularly deploy into shiny new systems internally (e.g. bgp evpn vxlan) with no hesitation or regret, which is what L3 was doing here. The only real fundamental problem with BGP is that the assumption that "the internet" is such an internal system no longer holds. Solving that is primarily a social and economic problem in the same way other internet wide protocol upgrades are, just with the added difficulty of convincing a bean counter you need a few 100k for new routers.

selecsosi · on Aug 31, 2020

I think potentially the concern/idea of optimization, for nature, is at a different scale and goal than what humans are capable of observing to be linear, predictable, "right" or "good".

On a similar scale I think a "scientific" observer of nature might say one shouldn't worry about local maximums because there is always some input stimuli to disrupt seemingly stable long term maxima, and that measurable dynamism of categorical evolution comes through dipping to local mins, and finding a new path to the new temp max because the topology of the surface changed.

Humanity challenge I think is in recognizing that almost all successful solutions to long term problems is built from small functional systems that scaled up rather than top down predictive and prescribed methodologies (mostly accidents until recent history). We can stand in nature and appreciate the diverse and varied approaches nature has taken to solving the particular localized problem (soil, land topology, etc...) but we don't evaluate the suffering and time frame it might of taken to find that particular solution. The plants that were either prematurely stunted or culled are just as much part of the ecosystem and fertilizer that holds the visible flora.

While appreciable suffering can and should be addressed and planned for, localized short term solutions are the things that take hold and spread until the collapse under their own lack of structural stability. Luckily nature is always at the ready to continue the experiment. Trail blazing is not about clearing the path, but finding it, the shortest path for nature is almost always the most preferred, you could say many of the foundations of physics[0] are rooted in it

[0]https://en.wikipedia.org/wiki/Fermat%27s_principle

jonathanliu · on Aug 31, 2020

> brett weinstein clued me into this as an evolutionary phenomena. if a gene activates a short-term solution and long-term problem, that gene is likely to be favored.

It's tough because of path dependence. If you don't survive the short term problem, the long term doesn't matter.

tenebrisalietum · on Aug 31, 2020

Short-term solutions per X is bad for an individual X, but great for many X's spread over the world, if X has the the ability replicate and grow over time.

PaulDavisThe1st · on Sept 1, 2020

being cut off from certain parts of the internet for about 5 or 6 hours is "being served" ?

your definition of civilizational collapse differs a bit from mine. i'm not altogether certain which one is more useful.

cm2187 · on Aug 31, 2020

...and smtp!

indigodaddy · on Aug 31, 2020

I love Cloudflare’s writeups and Cloudflare in general, and while this was again well written and excellent analysis, it contained a tad bit too much speculation for my taste.

hayleox · on Aug 31, 2020

I liked the speculation in this piece. It was well marked as such and just seemed to be "here are the umpteen possibilities that come to mind" from an expert perspective. I don't know enough about backbone networking to speculate like that myself, so it was really interesting to me to get that perspective on the problem solving process for a company like Cloudflare.

rkachowski · on Aug 31, 2020

It has to be speculation, a lot of cloudflare's customers were affected by this incident and yet there is very little information from level 3 about it. It's good faith speculation and cloudflare mention they've had a similar experience with the same protocols

danfritz · on Aug 31, 2020

I think thats because they don't know anything more atm. CenturyLink didn't do a postmortem yet so they can only guess and see what happend. What they suggest all seems possible, but it will be guesswork until CenturyLink explains what happens (if that ever happens)

jabroni_salad · on Aug 31, 2020

This is probably as good as it gets for outside analysis. I doubt we will learn much more until CL/L3 posts their own post-mortem with all the data.

indigodaddy · on Aug 31, 2020

Didn’t say it wasn’t good— it was that. I just think it was slightly on the inappropriate side in terms of vendor relationships. I mean obviously CF was kind of making a point to CTL with the article, I get that too.

ilkkao · on Aug 31, 2020

yes, it's easy to see they are at least somewhat frustrated because of not getting more details about what happened from CenturyLink. I think that's understandable.

rossdavidh · on Aug 31, 2020

Horror thought: if the internet ever "breaks" enough that all access to StackOverflow is lost, no one will be able to fix anything to get it back up, and we'll be back to the stone ages to start again.

Kidding. Mostly.

j8014 · on Aug 31, 2020

As a teen in the early 90's, who was self taught, there is only one correct answer for all questions, RTFM. lol. Man that sucked.

myself248 · on Aug 31, 2020

Kiwix offers downloadable mirrors of the various StackExchange sites. I haven't played with 'em to see how searchable they are, but there's something to be said for keeping a local copy.

Of that, and a Debian mirror, and maybe the top ten-thousand github repos or something...

Grimm665 · on Aug 31, 2020

Maybe we need a Global Seed Vault of certain key SO answers and GitHub projects :)

EE84M3i · on Aug 31, 2020

Did other providers have similar issues to Cloudflare? I only noticed cloudflare sites being particularly down, not other CDNs, but maybe that was selection bias?

rwky · on Aug 31, 2020

I couldn't connect to anything in the US (I'm in the UK) which broke a large chunk of the internet mainly due to being unable to access Cloudfront.

mcspiff · on Aug 31, 2020

That’s surprising given CloudFront has a decent number of Points of Presence in London/Europe.

ngold · on Aug 31, 2020

Everything was going haywire for me except the amazon prime movie i was watching, and it was fine.

It was a head scratcher until this mornings news.

longtermdd · on Aug 31, 2020

Fastly was also down

EE84M3i · on Aug 31, 2020

Youtube and Akamai didn't seem to be.

tyfon · on Aug 31, 2020

Youtube and twitch (which I belive use akamai?) was down for me in Norway.

I think there were many routes going haywire and it was different depending on your location.

tambre · on Aug 31, 2020

Youtube is Google's own global CDN, Twitch is Fastly.

cbg0 · on Aug 31, 2020

Only the main twitch site seems to be using fastly, the streams are powered by AWS' infrastructure as far as I can tell, which would be expected, seeing as how Amazon owns Twitch.

tyfon · on Aug 31, 2020

Oddly enough the user interface worked somewhat, it was the streams themselves that failed to load for me (even with mpv+youtube-dl)

nik736 · on Aug 31, 2020

Twitch is not using Akamai.

kraftomatic · on Aug 31, 2020

ISPs often have Youtube and Akamai caches forward deployed to their networks, which would reduce the dependancy on big ISPs. Google also buys a lot of its own international links so it can pair directly with or close to ISPs, so big ISP issues don't affect their services as much.

RealStickman_ · on Aug 31, 2020

Youtube definitely had problems for me. (Through Mullvad VPN in Switzerland)

7demons · on Aug 31, 2020

That technician who pushed this rule... Man, what a story he will have to tell to his grandchild.

system2 · on Aug 31, 2020

I wish we could read an analysis from CenturyLink. Their status page doesn't have anything useful: https://status.ctl.io/

davio · on Aug 31, 2020

That's CenturyLink Cloud (formerly Tier 3) - doesn't really have anything to do with CTL/Level 3 backbone.

chkaloon · on Aug 31, 2020

BGP and its associated tools is beginning to look like a national, actually global, systemic risk.

_-___________-_ · on Aug 31, 2020

Typo in title

danfritz · on Aug 31, 2020

Fixed!

chrismorgan · on Aug 31, 2020

I still see “Cloudlfare”.

malwarebytess · on Aug 31, 2020

Bit off topic. Does anyone know the reason why Centurylink doesn't operate in California?

mxmasster · on Aug 31, 2020

CenturyLink “residential” is limited to territories where they are the ILEC.

CenturyLink “business” is everywhere in CA and a large provider.