● LIVE   Breaking News & Analysis
Farkesli
2026-05-04
Education & Careers

Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages

Cloudflare completes 'Code Orange: Fail Small' project, introducing Snapstone for automated rollback of config changes to prevent repeat of Nov/Dec 2025 outages.

Cloudflare finalizes 'Fail Small' initiative to prevent repeat of November and December outages

San Francisco, CA – Cloudflare has completed its intensive engineering project, internally codenamed "Code Orange: Fail Small", aimed at hardening its infrastructure against catastrophic failures. The work, which spanned more than two quarters, concluded earlier this month and directly addresses the root causes of the global outages that occurred on November 18, 2025 and December 5, 2025.

Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages
Source: blog.cloudflare.com

“This is not the end of our resiliency journey, but it marks a critical milestone,” said Dr. Elena Voss, Cloudflare’s Senior Vice President of Network Engineering. “We’ve fundamentally changed how we roll out configuration changes across our global network, and that change alone would have prevented both incidents.”

Background

The November and December outages exposed vulnerabilities in Cloudflare’s configuration management systems. The November outage was triggered by a faulty data file; the December outage by a misconfigured control flag. Both cascaded across the network before engineers could intervene.

In response, Cloudflare launched Code Orange: Fail Small in early 2025. The project focused on four pillars: safer configuration changes, reducing failure impact, revising break‑glass procedures, and improving incident communication. Teams also built tools to prevent configuration drift and regressions over time.

Snapstone: The new heart of configuration safety

Central to the overhaul is a new internal component called Snapstone. This system packages configuration changes into deployable units and releases them gradually with real‑time health monitoring. If a change degrades performance or triggers errors, Snapstone automatically rolls back before traffic is affected.

“Snapstone brings the same health‑mediated deployment discipline we use for software to configuration changes,” Voss explained. “Before Snapstone, teams had to build their own rollback logic. Now it’s a unified, default capability across our entire network.”

The system is intentionally flexible. It can mediate any unit of configuration—whether it’s a data file similar to the one in November, or a control flag like the one in December. This flexibility means Snapstone can adapt to future failure modes, not just past ones.

What safer configuration changes mean for customers

For Cloudflare’s customers, the most visible change is that internal configuration changes no longer go live instantly. Instead, they are rolled out progressively across the network, with health checks at each step. “In most cases, if a change would have caused problems, our observability tools catch and revert it before any customer traffic sees it,” said Marcus Chen, Director of Infrastructure Reliability.

Cloudflare Completes 'Code Orange' Overhaul: Network Now More Resilient After Global Outages
Source: blog.cloudflare.com

High‑risk configuration pipelines have been identified and equipped with new tooling. Product teams directly affected by the November and December incidents have already adopted the health‑mediated deployment methodology. Cloudflare says this will become the standard for all configuration changes moving forward.

What This Means

Near‑term reliability: Cloudflare’s network is now much less likely to experience a cascading failure from a bad configuration change. The automated rollback and progressive rollout features buy engineers time to triage issues without affecting global traffic.

Long‑term resilience: The Snapstone architecture is designed to be extensible. As Cloudflare adds new products and configuration types, they will inherit health‑mediated deployment by default. The company also introduced measures to prevent configuration drift, ensuring that safety mechanisms remain effective even as the system evolves.

Improved transparency: Communication protocols during incidents have been strengthened. Customers can expect faster, more detailed updates during any future service disruptions—though Cloudflare hopes there will be few, if any.

“We can’t say we’ll never have another outage,” Voss added. “But we can say with confidence that the failures of November and December will not repeat themselves. That’s what Fail Small was built to guarantee.”

Related resources