X

Google Cloud Stumbled and Unplugged the Internet: Here's How the 'Crash Loop' Began

For an uncomfortable stretch on Thursday, a self-inflicted outage wiped out popular websites and services, including Google's own.

Headshot of Corinne Reichert
Headshot of Corinne Reichert
Corinne Reichert
Google logo G
Tobias Schwarz/CNET

Last Thursday, a massive internet outage took down vast swaths of the online world, kicked off by issues with Google Cloud services and affecting sites including Spotify, Discord, Character.ai, Snapchat, UPS, Pokemon and many of Google's own Workspace offerings.

Google has since acknowledged -- in a lengthy post on its Cloud status page at the end of the Friday workday -- the technical issues that spurred the outage, which stemmed from a series of failures in how a new system management feature was set up and deployed. It also described the steps it is taking to prevent similar problems from occurring in the future. 

By about 3 hours after it began, the issue had been resolved and most sites and services had returned to normal operations. Late Thursday evening, Google issued a preliminary incident report on its Cloud status page, tying the incident to disruptions in its API management system. And apologized: "We are deeply sorry for the impact to all of our users and their customers."

The Google Cloud platform supports a wide array of fundamental internet services, from networking to storage to AI, that businesses rely on.

IT services company Cloudflare, describing how the outage affected its customers, said that "the proximate cause (or trigger) for this outage was a third-party vendor failure."

Earlier in the day, as we all found ourselves blocked from accessing favorite services or necessary work tools, Google was noting that its engineers had "identified the root cause" and had taken steps to mitigate the issues. Still, some areas were not responding as fast as others. "Our infrastructure has recovered in all regions except us-central1," Google said.

Affected companies began noting that they were seeing recovery. By 2 p.m. PT Thursday, the Downdetector service was showing that the spikes in outages were largely past their peaks and quickly heading toward zero reports of problems. (Disclosure: Downdetector is owned by Ziff Davis, which is also the parent company of CNET.)

Google noted at the time on its Workspace status page that its services were back up and running.

"The problem with Gmail, Google Calendar, Google Chat, Google Cloud Search, Google Docs, Google Drive, Google Meet, Google Tasks, and Google Voice has been resolved," it posted at 12:53 p.m. PT. "We apologize for the inconvenience and thank you for your patience and continued support."

See below for the latest updates and notes on individual sites and services.

A late-Friday post mortem

By Jon Skillings

Late Friday afternoon, Google posted a fuller incident report to its Google Cloud status page. Among other things, it noted that on May 29, the Service Control system gained a new feature for quota policy checks -- a quota being the amount of a shared cloud resource that a customer can use. But that code change "did not have appropriate error handling nor was it feature flag protected," Google said, noting: "If this had been flag protected, the issue would have been caught in staging." As it turned out, when the code change was deployed region by region, "the code path that failed was never exercised during this rollout."

Then at 10:45 a.m. PT on June 12 -- the day  of the outage -- a change in the quota policy was added in and "replicated globally within seconds." But blank fields in the policy data started a chain of events, "causing the binaries to go into a crash loop." Google says its site reliability engineers identified the root problem within 10 minutes, and recovery began some 25 to 40 minutes after the start of the incident. The restarting of Service Control tasks "created a herd effect on the underlying infrastructure ... overloading the infrastructure." And because Service Control had not implemented "the appropriate randomized exponential backoff," a full recovery in the us-central1 region (Google's largest) took up to about 2 hours, 40 minutes.

Google's Mini Incident Report

By Jon Skillings

Here are some details Google provided in its Mini Incident Report on the June 12 outage, based on an initial analysis. The company also noted that its investigation is ongoing and that it will publish a full incident report in the coming days.

  • The incident started at 10:49 a.m. PT on June 12 and ended at 1:49 p.m. PT, a duration of 3 hours.
  • The outage had a global reach.
  • The apparent cause was "an invalid automated quota update to our API management system which was distributed globally, causing external API requests to be rejected."
  • When Google "bypassed the offending quota check," that allowed for recovery in most regions within 2 hours.
  • Problems dragged on, however, in Google's us-central1 region. Some products had a backlog of up to an hour after the primary issue was addressed, "and a small number recovering after that." (The us-central1 region refers to systems in Council Bluffs, Iowa.)

Google: 'We are deeply sorry'

By Jon Skillings

Late Thursday, at 11:34 p.m. PT, Google posted a "Mini Incident Report" on its Cloud status page leading with an apology: "We are deeply sorry for the impact to all of our users and their customers that this service disruption/outage caused." Google noted that customers had "intermittent API and user-interface access issues to the affected services." Those affected services included a long list of about 75 Google Cloud products including Google Cloud Storage, Google BigQuery, Vertex Gemini API and Cloud Load Bearing, as well as Google Workspace products including Gmail, Google Docs and Google Meet. 

Spotify is back, so enjoy your tunes again

By Gael Cooper

If we had to guess, we'd say more people were upset about losing access to their Spotify music than were bummed about Google Meet being down and preventing work meetings from happening. So rejoice, music lovers, Spotify seems to have returned to musical business as usual.

The music site went down as part of Thursday's large Google Cloud outage, and just after noon PT, Downdetector recorded more than 46,000 reports of problems with the site. 

A representative for Spotify said the site had no official comment.

Services being restored across the web

By Corinne Reichert

AI chatbot Claude says the "incident has been resolved" as of 1:55 p.m. PT.

"We have confirmed that all success rates have returned to normal at this time, and are closely monitoring to ensure no further issues," Claude said on its status page, noting that the outage lasted for 2 hours, 20 minutes.

Discord also restored services, posting at 1:32 p.m. PT, "Our Windows updater should be fixed everywhere and we should have fully recovered. Happy gaming." The chat service's outage lasted 1 hour, 33 minutes.

Character.ai said its incident was resolved at 1:57 p.m. PT. Its major outage lasted 2 hours, 22 minutes, while a partial outage lasted 27 minutes.

Issues were all caused by a Google Cloud outage

By Corinne Reichert

Cloudflare says the issues were caused by a Google Cloud outage.

"This is a Google Cloud outage. A limited number of services at Cloudflare use Google Cloud and were impacted. We expect them to come back shortly. The core Cloudflare services were not impacted," a Cloudflare spokesperson told CNET via email.

A Google Cloud spokesperson told CNET, "We are currently investigating a service disruption to some Google Cloud services. Please view our public status dashboard for the latest updates."

Anthropic's AI chatbot Claude also suffering outage

By Corinne Reichert

Claude, an artificial intelligence chatbot, has also been affected by the outage, citing "elevated errors on the API, console and Claude.ai."

"Our engineers are continuing to work to recover public API success rates, though text-only requests to Claude.ai should return responses as expected. We will continue to provide updates as we make progress towards full recovery," developer Anthropic posted at 12:52 p.m. PT.

Discord still 'seeing issues across a lot of features'

By Corinne Reichert

Messaging platform Discord, which is popular among gamers, said just after 12 p.m. PT that it was "seeing issues across a lot of features" and working with its provider on resolving them.

"Subscriptions and shop should have recovered. Most of our features should also be back online and we are seeing recovery progressively," the company posted at 12:35 p.m. PT.

It notes some Windows users might still be seeing issues with updating or downloading its Windows client.

Music service Spotify is also down

By Gael Cooper

Music streamer Spotify is one of the sites that seemed to be experiencing an outage on Thursday as of around 1 p.m. PT. A representative for the company did not immediately respond to a request for comment.

Spotify's Ongoing Issues blog had not been updated as of 1 p.m. PT.

Character.ai investigating the issue

By Corinne Reichert

Character.ai said on its status page that its service is being impacted by a third-party issue at 11:32 a.m. PT. 

"We are currently investigating this issue," Character.ai's latest update at 12:36 p.m. PT said.

Snapchat says it's 'working to get things fixed'

By Corinne Reichert

Snapchat has an alert on its Support page about the outage.

"We're aware some Snapchatters are experiencing issues and we're working to get things fixed. Thanks for your patience!" the company posted at the top of the Support page.