Cloudflare Engineers Uncover Hidden ClickHouse Bottleneck Threatening Billion-Dollar Billing Pipeline
Billing Pipeline Grinds to a Crawl
Cloudflare’s daily billing aggregation jobs—responsible for generating hundreds of millions of dollars in usage revenue—unexpectedly slowed to a halt after a recent migration. The delay threatened to disrupt invoice reconciliation and downstream systems, including fraud detection.

“It was a big problem when daily aggregation jobs slowed down,” said a Cloudflare engineer who worked on the fix. “Everything we normally check—I/O, memory, rows scanned, parts read—appeared normal. That’s when we knew it was something deeper.”
Hidden Bottleneck Discovered Inside ClickHouse
The bottleneck was traced to a subtle inefficiency within ClickHouse’s internals, specifically in how the database handles per-namespace data sorting. The system, called Ready-Analytics, stores petabytes of data from hundreds of applications in a single massive table, sorted by namespace, indexID, and timestamp.
“We had to dig deep into ClickHouse’s query execution logic to find the culprit,” another engineer explained. “It wasn’t a resource issue—it was a design flaw in our own schema and retention policy.”
Background: The Rise of Ready-Analytics
Cloudflare built Ready-Analytics in early 2022 to simplify data onboarding for internal teams. Instead of creating custom tables, teams stream data into one unified table with a standard schema of 20 float fields, 20 string fields, a timestamp, and an indexID. The indexID is a string that forms part of the primary key, allowing each namespace’s data to be sorted optimally for its queries.
By December 2024, Ready-Analytics held over 2 petabytes of data and ingested millions of rows per second. But its retention policy—dropping partitions older than 31 days—was a blunt instrument. Teams requiring longer retention had to skip Ready-Analytics entirely, opting for a much more complex conventional setup.

The Problem: One-Size-Fits-All Retention
Cloudflare has used ClickHouse for years, predating native Time-to-Live (TTL) features. The company built its own retention system based on daily partitioning. The Ready-Analytics table used a 31-day global retention, which forced teams with legal or contractual obligations to store data for years to build separate infrastructures.
“This restriction meant many use cases couldn’t use Ready-Analytics,” a product manager noted. “We needed a per-namespace retention solution that didn’t require abandoning the platform.”
What This Means for Cloudflare and Users
The three patches written to fix the bottleneck not only restored billing pipeline performance but also enabled per-namespace retention, opening Ready-Analytics to teams that previously had to avoid it. The engineers have documented their approach to share with the ClickHouse community.
“The fix eliminated the hidden bottleneck and gave us the flexibility we needed,” said a lead engineer. “Now teams can set their own retention periods without impacting the entire cluster.”
Cloudflare expects the improvements to accelerate onboarding for internal teams and reduce operational overhead. Users will benefit from more accurate and timely billing, while the company avoids revenue reconciliation headaches.
Related Articles
- Declassified Photographs Reveal the Unseen Fury of the First Atomic Bomb Test
- Blue Zones Under the Microscope: Separating Fact from Fiction
- Tim Cook's Two Decades at Apple: From Operations Expert to Trillion-Dollar CEO
- Forgotten 18th-Century Volcano Design Erupts to Life with Modern Technology
- How Cloudflare Strengthened Its Network: The Inside Story of 'Code Orange: Fail Small'
- Human Data Quality Called Critical for AI Model Training, Experts Warn of Neglect
- The Hidden Toll of Transforming Education: A Journey of Radical Hope and Burnout
- Mastering KV Cache Compression: A Practical Guide to TurboQuant