Enlarge / The outage started shortly after 12pm on June 2nd, impacting global users connecting to GCP us-east4-c.ThousandEyes

Earlier this week, the Internet had a conniption. In broad patches around the globe, YouTube sputtered. Shopify stores shut down. Snapchat blinked out. And millions of people couldnt access their Gmail accounts. The disruptions all stemmed from Google Cloud, which suffered a prolonged outage—an outage which also prevented Google engineers from pushing a fix. And so, for an entire afternoon and into the night, the Internet was stuck in a crippling ouroboros: Google couldnt fix its cloud, because Googles cloud was broken.

The root cause of the outage, as Google explained this week, was fairly unremarkable. (And no, it wasnt hackers.) At 2:45pm ET on Sunday, the company initiated what should have been a routine configuration change, a maintenance event intended for a few servers in one geographic region. When that happens, Google routinely reroutes jobs those servers are running to other machines, like customers switching lines at Target when a register closes. Or sometimes, importantly, it just pauses those jobs until the maintenance is over.

What happened next gets technically complicated—a cascading combination of two misconfigurations and a software bug—but had a simple upshot. Rather than that small cluster of servers blinking out temporarily, Googles automation software descheduled network control jobs in multiple locations. Think of the traffic running through Googles cloud like cars approaching the Lincoln Tunnel. In that moment, its capacity effectively went from six tunnels to two. The result: Internet-wide gridlock.

Still, even then, everything held steady for a couple minutes. Googles network is designed to “fail static,” which means even after a control plane has been descheduled, it can function normally for a small period of time. It wasnt long enough. By 2:47 pm ET, this happened:

In moments like this, not all traffic fails equally. Google has automated systems in place to ensure that when it starts sinking, the lifeboats fill up in a specific order. “The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows,” wrote Google vice president of engineering Benjamin Treynor Sloss in an incident debrief, “much as urgent packages may be couriered by bicycle through even the worst traffic jam.” See? Lincoln Tunnel.

You can see how Google prioritized in the downtimes experienced by various services. According to Sloss, Google Cloud lost nearly a third of its traffic, which is why third parties like Shopify got nailed. YouTube lost 2.5 percent of views in a single hour. One percent of Gmail users ran into issues. And Google search skipped merrily along, at worst experiencing a barely perceptible slowdown in returning results.

“If I type in a search and it doesnt respond right away, Im going to Yahoo or something,” says Alex Henthorn-Iwane, vice president at digital experience monitoring company ThousandEyes. “So that was prioritized. Its latency-sensitive, and it happens to be the cash cow. Thats not a surprising business decision to make on your network.”

But those decisions dont only apply to the sites and services you saw flailing last week. In those moments, Google has to triage among not just user traffic but also the networks control plane (which tells the network where to route traffic) and management traffic (which encompasses the sort of administrative tools that Google engineers would need to correct, say, a configuration problem that knocks a bunch of the internet offline).

“Management traffic, because it can be quite voluminous, youre always careful. Its a little bit scary to prioritize that, because it can eat up the network if something wrong happens with your management tools,” Henthorn-Iwane says. “Its kind of a Catch-22 that happens with network management.”

Packet loss was total between ThousandEyes' global monitoring agents and content hosted in a GCE instance in GCP us-west2-a.
Enlarge / Packet loss was total between ThousandEyes' global monitoring agents and content hosted in a GCE instance in GCP us-west2-a.ThousandEyes

Which is exactly what played out on Sunday. Google says its engineers were aware of the problem within two minutes. And yet! “Debugging the problem was significantly hampered by failure of tools competing over use of the now-congesteRead More – Source

[contf] [contfnew]

Ars Technica

[contfnewc] [contfnewc]