When Slower is Better: How a Costco Queue Revolutionized Our DR Strategy (Part 1)

From pandemic queues to production systems - discovering the counter-intuitive solution to disaster recovery costs

Jul 08, 2025

📖 This is Part 1 of a 2-part series. Read Part 2: Implementation and Results for the technical deep dive and results.

The Costco Revelation

It was the summer of 2020, and the COVID pandemic had just turned the world upside down. Like millions of others, I found myself standing in an impossibly long line outside a Costco supermarket, watching the queue snake around the building and into the parking lot. After nearly an hour of inching forward at a glacial pace, I made a decision that would later reshape how my team approached disaster recovery: I gave up and went home.

Walking back to my car, something clicked. What I had just done wasn't simply abandoning a shopping trip—it was a textbook example of "request backoff." The supermarket hadn't asked me to leave or turned me away at the door. I had voluntarily decided that the wait time exceeded my tolerance, and I backed off to try again later. One fewer customer for the overwhelmed system to serve.

This phenomenon was nothing new to me as a technologist. In distributed systems, when response times increase, clients naturally timeout and retry after a backoff period. It's a well-understood behavior that helps prevent system collapse when demand exceeds capacity. But standing there in that parking lot, I realized I had just experienced it firsthand in the physical world—and it got me thinking.

Pandemic queue that inspired our DR breakthrough - when longer wait times naturally reduce customer demand.

At the time, my team was grappling with a challenge that keeps engineering leaders awake at night: how to maintain robust disaster recovery capabilities without breaking the budget. We needed to find a way to reduce our computing resource footprint by 20% during disaster scenarios, allowing our DR region to be provisioned with significantly less capacity than our primary infrastructure.

In today's economic climate, this cost optimization has become even more critical. Everyone wants high availability, but no one wants to pay for it.

That's when the Costco experience sparked an idea. What if we could intentionally increase system latency during disaster scenarios to naturally reduce throughput, just like how increased wait times caused customers to voluntarily leave the queue? Instead of complex load balancing or expensive infrastructure changes, could we use this principle to solve our capacity problem?

The elegance lay in its simplicity. Sometimes the most effective solutions hide in plain sight, waiting for the right moment of inspiration to reveal themselves. In this case, that moment came while walking away from a very long line at a warehouse store during a global pandemic.

Setting the Stage: Understanding the Challenge

Before diving into our solution, it's worth establishing the key concepts and challenges that drive modern disaster recovery strategies. Understanding these fundamentals helps explain why traditional approaches often fall short and why innovative solutions become necessary.

What is Disaster Recovery?

Disaster Recovery (DR) is an organization's ability to restore access and functionality to tech infrastructure after a disruptive event. This could be anything from natural disasters like earthquakes and floods to human-caused incidents like cyberattacks, hardware failures, or even simple configuration errors that bring down critical systems.

Important Note: In this blog series, we're focusing specifically on the disaster recovery of compute resources—the infrastructure where services run, such as Kubernetes clusters, application servers, and processing capacity. We are not addressing the DR strategies for storage systems, databases, or persistent metadata infrastructure, which have their own distinct challenges and solutions.

Traditional DR Approaches and The Cost Problem

The disaster recovery landscape has evolved around two main patterns:

Active-Passive - Your primary environment serves all traffic while a secondary environment sits idle, ready to take over during a disaster.

Active-Active - Multiple environments serve traffic simultaneously. If one fails, the others continue operating.

Both approaches share a fundamental cost challenge. Let's define 1N as the compute resources required to serve 100% of your application's workload under normal conditions.

In a traditional active-passive setup, you provision:

Active region: 1N capacity (serving live traffic)
Passive region: 1N capacity (idle, waiting for disaster)
Total cost: 2N capacity

But what if you could reduce your application's resource requirements by 20% during a disaster? Then you'd only need:

Active region: 1N capacity (normal operations)
Passive region: 0.8N capacity (reduced load during disaster)
Total cost: 1.8N capacity

Moving from 2N to 1.8N represents a 10% reduction in total infrastructure costs. When you're managing hundreds of thousands of CPU cores, this translates to substantial savings.

The DR Cost Problem

With the technical foundation established, let's examine the specific challenge we faced at Box and why traditional disaster recovery approaches weren't meeting our needs.

Our DR Challenge

At Box, we use both active-active and active-passive disaster recovery configurations across different services and flows. This approach provides robust protection but comes with substantial infrastructure costs.

When planning our DR compute capacity expansion, we saw an opportunity to challenge the traditional assumption that disaster recovery regions must mirror the full capacity of production environments. The question became: could we provision our DR region 20% smaller while maintaining acceptable service levels during a disaster scenario?

We're operating at significant scale—hundreds of thousands of CPU cores across our infrastructure. At this level, disaster recovery isn't just a technical challenge; it's a major financial commitment. We're talking about millions of dollars in compute infrastructure costs for our DR capabilities alone.

The Innovation Opportunity

Rather than simply accepting the standard DR cost structure, we had a six-month timeline to explore whether there was a better way. The goal wasn't to compromise our disaster recovery capabilities, but to find a method that could maintain service availability while requiring 20% less compute capacity in our DR regions.

Why This Mattered

The potential impact was substantial. In an infrastructure environment measured in hundreds of thousands of cores, a 20% reduction in DR capacity requirements could translate to millions of dollars in cost savings annually. But the benefits went beyond just the immediate financial impact.

Success would prove that disaster recovery could be both robust and cost-efficient, challenging the industry assumption that comprehensive DR requires full capacity duplication. It would also provide a repeatable approach that could scale with our continued growth without proportionally increasing DR costs.

Why Traditional Solutions Fall Short

Before we could justify pursuing an unconventional approach like latency injection, we needed to understand why existing disaster recovery strategies weren't solving our cost problem. The traditional toolkit for managing capacity during disasters had several options, but each came with significant limitations that made them unsuitable for our needs.

Load Shedding: The Blunt Instrument

Load shedding intentionally drops requests when systems become overloaded, monitoring resources like CPU usage and dropping traffic when thresholds are crossed. However, it has key limitations: most implementations drop requests indiscriminately, explicitly fail requests with error responses during disasters, provide unpredictable correlation between dropped requests and actual capacity reduction, and can be overly aggressive.

Capacity Scaling Limitations

Cloud-based auto-scaling seemed appealing—let the system automatically provision capacity during disasters rather than pre-provisioning it. However, auto-scaling has critical limitations during disasters: slow activation times, limited available capacity when multiple customers scale simultaneously, and expensive on-demand pricing that compounds disaster costs.

The Gap in the Market

Each traditional approach optimized for a different dimension—infrastructure redundancy prioritized reliability over cost, load shedding prioritized system stability over user experience, and auto-scaling prioritized flexibility over predictability.

What we needed was an approach that could:

Predictably reduce capacity requirements during disasters
Maintain acceptable user experience rather than failing requests
Work with existing infrastructure without requiring expensive redundancy
Provide gradual, tunable control rather than binary on/off behavior

None of the existing solutions addressed all these requirements simultaneously. This gap led us to explore whether we could achieve capacity reduction through a fundamentally different mechanism—one that worked with natural client behavior rather than against it.

📖 Continue Reading: Part 2: When Slower is Better: Implementation and Results

In Part 2, we'll dive deep into the implementation details of our latency injection approach, show you the exact results we achieved, and provide a practical framework for when this technique works (and when it doesn't).

About the author: Advait Mishra, Engineering Manager for the Core Storage team at Box.com, responsible for building and maintaining exascale storage infrastructure. An engineering leader with expertise in distributed systems, cloud computing, and data analytics.

Cloud Storage

Discussion about this post