Skip to content

Hot partitions anti-pattern

What is it?

A hot partition occurs when a disproportionate amount of read or write traffic is concentrated on a single partition key value. Since DynamoDB distributes data across partitions based on the partition key, having all your traffic hit one partition creates a bottleneck that prevents your application from scaling.

Why is it a problem?

DynamoDB's performance is based on distributed architecture, but hot partitions undermine this:

  • Throttling: Individual partitions have throughput limits (3,000 RCU or 1,000 WCU)
  • Performance Degradation: Requests queue up, causing high latency
  • Scalability Ceiling: Can't scale beyond single partition limits
  • Wasted Capacity: Other partitions sit idle while one is overloaded
  • Cascading Failures: Throttling can cause retries, making the problem worse

The fundamental issue

Even if your table has 10,000 WCU provisioned, if all writes go to one partition key, you're limited to 1,000 WCU for that partition. The other 9,000 WCU are wasted.

Visual representation

Hot Partition Problem

graph TB
    subgraph "❌ Hot Partition"
        A[Application] -->|90% traffic| P1[Partition 1<br/>OVERLOADED<br/>Throttling]
        A -->|5% traffic| P2[Partition 2<br/>Idle]
        A -->|3% traffic| P3[Partition 3<br/>Idle]
        A -->|2% traffic| P4[Partition 4<br/>Idle]
        style P1 fill:#f44336
        style P2 fill:#ccc
        style P3 fill:#ccc
        style P4 fill:#ccc
    end

    subgraph "✅ Distributed Load"
        B[Application] -->|25% traffic| Q1[Partition 1<br/>Healthy]
        B -->|25% traffic| Q2[Partition 2<br/>Healthy]
        B -->|25% traffic| Q3[Partition 3<br/>Healthy]
        B -->|25% traffic| Q4[Partition 4<br/>Healthy]
        style Q1 fill:#4CAF50
        style Q2 fill:#4CAF50
        style Q3 fill:#4CAF50
        style Q4 fill:#4CAF50
    end

Common causes

1. low-cardinality partition keys

Using attributes with few possible values as partition keys:

// ❌ BAD: Status as partition key
{
  pk: 'STATUS#ACTIVE',  // Most users are active!
  sk: `USER#${userId}`,
  ...userData
}

// All active users (99% of users) in ONE partition

2. time-based keys without distribution

Using current date/time as partition key:

// ❌ BAD: Current date as partition key
{
  pk: `DATE#${new Date().toISOString().split('T')[0]}`,  // Today's date
  sk: `EVENT#${eventId}`,
  ...eventData
}

// All today's events in ONE partition
// Tomorrow, new hot partition is created

Viral content or trending items:

// ❌ BAD: Popular post getting all the traffic
{
  pk: `POST#${viralPostId}`,  // This one post
  sk: `COMMENT#${commentId}`,
  ...commentData
}

// One viral post = one hot partition

4. global counters

Single counter for entire application:

// ❌ BAD: Global counter
{
  pk: 'COUNTER#GLOBAL',
  sk: 'VIEWS',
  count: 1000000
}

// Every view increments the same item

Example of the problem

❌ anti-pattern: status-based partitioning

import { TableClient } from '@ddb-lib/client'

const table = new TableClient({
  tableName: 'Users',
  // ... config
})

// BAD: All active users in one partition
await table.put({
  pk: 'STATUS#ACTIVE',  // Same for millions of users!
  sk: `USER#${userId}`,
  email: user.email,
  name: user.name,
  status: 'ACTIVE'
})

// Query active users
const activeUsers = await table.query({
  keyCondition: {
    pk: 'STATUS#ACTIVE'  // Hot partition!
  }
})

// This partition gets hammered with:
// - All new user registrations
// - All active user queries
// - All active user updates

❌ anti-pattern: time-based hot partition

// BAD: Current hour as partition key
const currentHour = new Date().toISOString().slice(0, 13)

await table.put({
  pk: `HOUR#${currentHour}`,  // Same for all events this hour
  sk: `EVENT#${eventId}`,
  timestamp: Date.now(),
  ...eventData
})

// All events in the current hour hit the same partition
// Next hour, new hot partition is created

The solution

✅ solution 1: use high-cardinality keys

// GOOD: User ID as partition key (unique per user)
await table.put({
  pk: `USER#${userId}`,  // Unique partition per user
  sk: 'PROFILE',
  email: user.email,
  name: user.name,
  status: 'ACTIVE'
})

// Load is distributed across all users

✅ solution 2: distribute with sharding

import { PatternHelpers } from '@ddb-lib/core'

// GOOD: Distribute status across shards
const shardCount = 10
const pk = PatternHelpers.distributedKey('STATUS#ACTIVE', shardCount)
// Returns: 'STATUS#ACTIVE#SHARD#7' (random 0-9)

await table.put({
  pk,  // Distributed across 10 partitions
  sk: `USER#${userId}`,
  email: user.email,
  name: user.name,
  status: 'ACTIVE'
})

// Query all shards when needed
const allActiveUsers = []
for (let i = 0; i < shardCount; i++) {
  const result = await table.query({
    keyCondition: {
      pk: `STATUS#ACTIVE#SHARD#${i}`
    }
  })
  allActiveUsers.push(...result.items)
}

✅ solution 3: add random suffix to time-based keys

// GOOD: Distribute time-based data
const currentHour = new Date().toISOString().slice(0, 13)
const shard = Math.floor(Math.random() * 10)

await table.put({
  pk: `HOUR#${currentHour}#SHARD#${shard}`,
  sk: `EVENT#${eventId}`,
  timestamp: Date.now(),
  ...eventData
})

// Load distributed across 10 partitions per hour

✅ solution 4: use write sharding for counters

// GOOD: Distributed counter
const shardCount = 10
const shard = Math.floor(Math.random() * shardCount)

// Increment random shard
await table.update({
  pk: `COUNTER#VIEWS#SHARD#${shard}`,
  sk: 'COUNT',
  updates: {
    count: { increment: 1 }
  }
})

// Read all shards to get total
let totalViews = 0
for (let i = 0; i < shardCount; i++) {
  const result = await table.get({
    pk: `COUNTER#VIEWS#SHARD#${i}`,
    sk: 'COUNT'
  })
  totalViews += result.item?.count || 0
}

✅ solution 5: use GSI for alternative access

// GOOD: Use GSI for status queries
// Main table: pk = USER#{userId}, sk = PROFILE
// GSI: pk = STATUS#{status}#SHARD#{shard}, sk = USER#{userId}

await table.put({
  pk: `USER#${userId}`,
  sk: 'PROFILE',
  status: 'ACTIVE',
  statusShard: `STATUS#ACTIVE#SHARD#${Math.floor(Math.random() * 10)}`,
  ...userData
})

// Query GSI with distributed load
const shard = Math.floor(Math.random() * 10)
const activeUsers = await table.query({
  indexName: 'StatusIndex',
  keyCondition: {
    pk: `STATUS#ACTIVE#SHARD#${shard}`
  }
})

Impact metrics

Throughput comparison

Scenario Partition Strategy Max Throughput Throttling
Hot Partition Single partition 1,000 WCU Frequent
Distributed (10 shards) 10 partitions 10,000 WCU Rare
Distributed (100 shards) 100 partitions 100,000 WCU Very rare
High-cardinality key Millions of partitions Unlimited* None

*Subject to table-level limits

Latency impact

Metric Hot Partition Distributed
P50 Latency 50ms 10ms
P99 Latency 2,000ms 25ms
P99.9 Latency 10,000ms 50ms
Error Rate 15% <0.1%

Cost impact

Hot partitions waste capacity:

Table Capacity: 10,000 WCU
Hot Partition Limit: 1,000 WCU
Wasted Capacity: 9,000 WCU (90%)
Wasted Cost: $4,500/month (at $0.50 per WCU/month)

Detection

The anti-pattern detector can identify hot partitions:

import { StatsCollector, AntiPatternDetector } from '@ddb-lib/stats'

const stats = new StatsCollector()
const detector = new AntiPatternDetector(stats)

// After running operations
const issues = detector.detectHotPartitions()

for (const issue of issues) {
  console.log(issue.message)
  // "Partition STATUS#ACTIVE receiving 85% of traffic"
  // "Partition key 'POST#12345' has 1,000 requests/sec (hot partition)"
}

Warning signs

You might have hot partitions if:

  • You see "ProvisionedThroughputExceededException" despite having capacity
  • Latency is high even with low overall traffic
  • Some operations are fast while others are slow
  • CloudWatch shows uneven partition metrics
  • Increasing table capacity doesn't help

Choosing shard count

How many shards should you use?

// Calculate required shards
const targetThroughput = 5000  // WCU needed
const partitionLimit = 1000     // WCU per partition
const minShards = Math.ceil(targetThroughput / partitionLimit)
// Result: 5 shards minimum

// Add buffer for growth
const recommendedShards = minShards * 2
// Result: 10 shards recommended

Shard count guidelines

Traffic Level Recommended Shards Notes
Low (<500 req/s) 5-10 Minimal overhead
Medium (500-2000 req/s) 10-20 Good balance
High (2000-5000 req/s) 20-50 Scales well
Very High (>5000 req/s) 50-100+ Maximum distribution

Trade-offs

Sharding has trade-offs to consider:

Pros

  • ✅ Eliminates hot partitions
  • ✅ Enables horizontal scaling
  • ✅ Reduces throttling
  • ✅ Improves latency

Cons

  • ❌ Queries require multiple requests (scatter-gather)
  • ❌ Slightly more complex code
  • ❌ Aggregations require reading all shards
  • ❌ More storage for shard metadata

When to shard

Use sharding when: - Traffic exceeds 1,000 WCU or 3,000 RCU per partition key - You have low-cardinality partition keys - You need to scale beyond single partition limits - You're experiencing throttling

Don't shard when: - Traffic is low (<100 req/s per partition) - You have naturally high-cardinality keys (user IDs, order IDs) - Query performance is more important than write throughput - Your access pattern requires strong consistency across items

Summary

The Problem: Concentrating traffic on a single partition key creates a bottleneck that prevents scaling and causes throttling.

The Solution: Use high-cardinality partition keys or distribute load across multiple shards using helper functions.

The Impact: Proper distribution can increase throughput by 10-100x and eliminate throttling.

Remember: DynamoDB's power comes from its distributed architecture. Hot partitions prevent you from leveraging that power. Design your keys to distribute load evenly across partitions.