What on earth happened to Cloudflare last week?

The trio of data centers is not so close together that a natural disaster would cause them all to crash at once.  Simultaneously, they’re still close enough that they could all run active-redundant data clusters. So, by design, if any of the facilities go offline, the remaining ones should pick up the load and keep operating.

Sounds great, doesn’t it? However, that’s not what happened.

What happened first was that a power failure at Flexential’s facility caused unexpected service disruption. Portland General Electric (PGE) was forced to shut down one of its independent power feeds into the building. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to supplement the feed that was down. 

That approach, by the way, for those of you who don’t know data centers’ best practices, is a no-no. You don’t use off-premise power and generators at the same time. Adding insult to injury, Flexential didn’t tell Cloudflare that they’d sort of, kind of, transitioned to generator power.

Also: 10 ways to speed up your internet connection today

Then, there was a ground fault on a PGE transformer that was going into the data center. And, when I say ground fault, I don’t mean a short, like the one that has you going down into the basement to fix a fuse. I mean a 12,470-volt bad boy that took down the connection and all the generators in less time than it took you to read this sentence.  

In theory, a bank of UPS batteries should have kept the servers going for 10 minutes, which in turn should have been enough time to crank the generators back on. Instead, the UPSs started dying in about four minutes, and the generators never made it back on in time anyway.

Whoops.

There might have been no one who was able to save the situation, but when the onsite, overnight staff “consisted of security and an unaccompanied technician who had only been on the job for a week,” the situation was hopeless.

Also: The best VPN services for iPhone and iPad (yes, you need to use one)

In the meantime, Cloudflare discovered the hard way that some critical systems and newer services were not yet integrated into its high-availability setup. Furthermore, Cloudflare’s decision to keep logging systems out of the high-availability cluster, because the analytics delays would be acceptable, turned out to be wrong. As Cloudflare’s staff couldn’t get a good look at the logs to see what was going wrong, the outage would linger on. 

It turned out that, while the three data centers were “mostly” redundant, they weren’t completely. The other two data centers running in the area did take over responsibility for the high-availability cluster and keep critical services online. 

So far, so good. However, a subset of services that were supposed to be on the high-availability cluster had dependencies on services that were running exclusively on the dead data center. 

Specifically, two critical services that process logs and power Cloudflare’s analytics — Kafka and ClickHouse — were only available in the offline data center. So, when services in the high-availability cluster called for Kafka and Clickhouse, they failed.

Cloudflare admits it was “far too lax about requiring new products and their associated databases to integrate with the high-availability cluster.” Moreover, far too many of its services depend on the availability of its core facilities. 

Lots of companies do things this way, but Prince admitted, this “does not play to Cloudflare’s strength. We are good at distributed systems. Throughout this incident, our global network continued to perform as expected. but far too many fail if the core is unavailable. We need to use the distributed systems products that we make available to all our customers for all our services, so they continue to function mostly as normal even if our core facilities are disrupted.”

Also: Cybersecurity 101: Everything on how to protect your privacy and stay safe online

Hours later, everything was finally back up and running — and it wasn’t easy. For example, almost all the power breakers were fried, and Flexentail had to go and buy more to replace them all.

Expecting that there had been multiple power surges, Cloudflare also decided the “only safe process to recover was to follow a complete bootstrap of the entire facility.” That approach meant rebuilding and rebooting all the servers, which took hours. 

The incident, which lasted until November 4, was eventually resolved. Looking forward, Prince concluded: “We have the right systems and procedures in place to be able to withstand even the cascading string of failures we saw at our data center provider, but we need to be more rigorous about enforcing that they are followed and tested for unknown dependencies. This will have my full attention and the attention of a large portion of our team through the balance of the year. And the pain from the last couple of days will make us better.”

Featured

Two breakthroughs made 2023 tech’s most innovative year in over a decade

AI in 2023: A year of breakthroughs that left no human thing unchanged

These 5 major tech advances of 2023 were the biggest game-changers

What is Gemini? Everything you should know about Google’s new AI model

Two breakthroughs made 2023 tech’s most innovative year in over a decade

  • AI in 2023: A year of breakthroughs that left no human thing unchanged

  • These 5 major tech advances of 2023 were the biggest game-changers

  • What is Gemini? Everything you should know about Google’s new AI model

  • Article source: https://www.zdnet.com/home-and-office/networking/what-on-earth-happened-to-cloudflare-last-week/#ftag=RSSbaffb68

    Related posts