Architecture

Postmortem: Cloudflare’s November 2025 Outage

Introduction

On 18 November 2025 at 11:20 UTC, many websites and services on the internet started experiencing issues loading and performing correctly, the issue was linked to a simple filter in a cloudflare database query.

High Level Architecture

Cloudflare uses ClickHouse, an extremely columnar database used for fast analytics processing that supports data replication and sharding, for storing features used by their Bot Management machine learning system. This System helps to generate “bot scores” to determine if the traffic is coming from a real human or automated bots.

The ClickHouse Clusters are sharded for scalability and performance, it has a user facing distributed tables stored in the `default` database (think of them as virtual tables) that routes subqueries to all shards, the shards, that contains the data default is querying from are part of the `r0` database.

The Change That Caused The Outage

The issue with the current distributed subqueries in ClickHouse was that the queries that were being executed by the users, were forwarded to the shared system account and not the user account who executed that query and this system had much more permission than the user, this allows for queries to be run maliciously or accidentally and could affect other cloudflare users, additionally it’s harder to audit who did what under a same system account. Cloudflare called this out:

“Before today, ClickHouse users would only see the tables in the default database when querying metadata…

the change was made so that distributed subqueries can run under the initial user account, allowing fine-grained evaluation of limits and access grants.”

This is why caused the infrastructure improvement decision to change permissions so that users could also see metadata for the “r0” tables, making database queries more explicit.

The outage happened when a SQL query, maybe it was a legacy one, didn’t specify the database where the queries should to be executed on:

SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

This change caused the queries results to be doubled. one coming from default an the other from the sharded database.

this created a feature file that has the double amount of data that was expected and when the limit got exceeded the rust code checking the file limit panic and returned 500 errors to the core proxy which dies and that affected 20% of the internet.

Once the engineering team found the issue using their observability tools they fixed it:

  • Stop generating new Bot Management configuration files
  • Manually inject a known-good version of the file
  • Restart the core proxy systems that were broken
  • Gradually bring services back online

Lessons learned:

Missing Test Case: a test case that got missed on how the permission change would affect the downstream system is not uncommon when dealing with distributed systems.

Single Point Of Failure: This will make people think on how to handle Cloudflare or similar systems as Single point of failure that when they go down they shouldn’t disrupt their business.

Observability: Setup your observability tools to catch the error before it hits production and alerts the engineers with it to resolve it.

Design For Failure: Systems fail, always assume this and plan accordingly and think about how are you going to roll back changes efficiently when something goes wrong.

Source:

https://blog.cloudflare.com/20-percent-internet-upgrade

https://blog.cloudflare.com/18-november-2025-outage