Skip to content

Table scans anti-pattern

What is it?

The table scans anti-pattern occurs when developers use the Scan operation to find items in a DynamoDB table when a Query operation would be more appropriate. A scan examines every single item in your table, regardless of whether it matches your criteria, and then filters the results afterward.

Why is it a problem?

Scans are one of the most expensive and slowest operations in DynamoDB:

  • Performance: Scans have O(n) complexity - they get slower as your table grows
  • Cost: You pay for reading every item examined, not just items returned
  • Throughput: Scans consume massive amounts of read capacity
  • Latency: Response times increase linearly with table size
  • Scalability: Scans don't benefit from DynamoDB's distributed architecture

The hidden cost

If you have a table with 1 million items and you scan for 10 specific items:

  • Items examined: 1,000,000
  • Items returned: 10
  • RCU consumed: ~1,000,000 (you pay for all examined items!)
  • Time: Several seconds to minutes

Visual representation

Scan vs Query Operation

graph TB
    subgraph "❌ Scan Operation"
        S[Scan Request] --> P1[Partition 1<br/>Read ALL items]
        S --> P2[Partition 2<br/>Read ALL items]
        S --> P3[Partition 3<br/>Read ALL items]
        S --> P4[Partition 4<br/>Read ALL items]
        P1 --> F[Filter Results]
        P2 --> F
        P3 --> F
        P4 --> F
        F --> R1[Return 10 items<br/>Examined 1M items]
        style R1 fill:#f44336
    end

    subgraph "✅ Query Operation"
        Q[Query Request] --> T[Target Partition]
        T --> R2[Return 10 items<br/>Examined 10 items]
        style R2 fill:#4CAF50
    end

Example of the problem

❌ anti-pattern: using scan

import { TableClient } from '@ddb-lib/client'

const table = new TableClient({
  tableName: 'Users',
  // ... config
})

// BAD: Scanning entire table to find active users
const result = await table.scan({
  filter: {
    status: { eq: 'ACTIVE' },
    createdAt: { gt: '2024-01-01' }
  }
})

// This examines EVERY item in the table!
// If you have 1M users, this reads 1M items
// Even if only 100 are active and match the date

Common scan mistakes

// BAD: Scanning to find a specific user
const user = await table.scan({
  filter: {
    email: { eq: 'user@example.com' }
  }
})

// BAD: Scanning to get recent orders
const orders = await table.scan({
  filter: {
    userId: { eq: '123' },
    orderDate: { gt: '2024-01-01' }
  }
})

// BAD: Scanning to count items
const count = await table.scan({
  filter: {
    type: { eq: 'ORDER' }
  },
  select: 'COUNT'
})
// Still reads every item to count!

The solution

✅ use query with proper keys

// GOOD: Query with partition key
const result = await table.query({
  keyCondition: {
    pk: 'STATUS#ACTIVE',
    sk: { beginsWith: 'USER#' }
  },
  filter: {
    createdAt: { gt: '2024-01-01' }
  }
})

// Only reads items in the STATUS#ACTIVE partition
// Filter is applied after reading, but only on relevant items

✅ design keys for your access patterns

// Design your data model to support queries
interface User {
  pk: string    // STATUS#ACTIVE or STATUS#INACTIVE
  sk: string    // USER#{userId}
  email: string
  createdAt: string
  // ... other attributes
}

// Now you can query efficiently
const activeUsers = await table.query({
  keyCondition: {
    pk: 'STATUS#ACTIVE'
  }
})

// Or query a specific user
const user = await table.query({
  keyCondition: {
    pk: 'STATUS#ACTIVE',
    sk: 'USER#123'
  }
})

✅ use GSI for alternative access patterns

// If you need to query by email, create a GSI
// GSI: email (PK), userId (SK)

const userByEmail = await table.query({
  indexName: 'EmailIndex',
  keyCondition: {
    email: 'user@example.com'
  }
})

// Efficient query on GSI instead of table scan

✅ use get for single items

// If you know the exact key, use get
const user = await table.get({
  pk: 'USER#123',
  sk: 'PROFILE'
})

// Single-item read, most efficient operation

Impact metrics

Performance comparison

Operation Items in Table Items Returned Items Examined RCU Consumed Latency
Scan 1,000,000 10 1,000,000 ~1,000,000 5-30 seconds
Query 1,000,000 10 10 ~10 10-50ms
Get 1,000,000 1 1 1 5-20ms

Cost impact

Assuming $0.25 per million read requests:

Operation Monthly Requests Items Examined Monthly Cost
Scan 10,000 10 billion $2,500
Query 10,000 100,000 $0.025
Savings - - $2,499.98 (99.99%)

When scans are acceptable

Scans are appropriate in limited scenarios:

✅ acceptable use cases

  1. One-time data migrations

    // Migrating data structure (run once)
    const allItems = await table.scan({})
    

  2. Analytics on small tables (<1,000 items)

    // Small lookup table
    const allCategories = await table.scan({})
    

  3. Admin operations during maintenance windows

    // Cleanup job during low-traffic period
    const oldItems = await table.scan({
      filter: {
        expiresAt: { lt: Date.now() }
      }
    })
    

  4. Exporting entire table for backup

    // Full table backup
    const backup = await table.scan({})
    

❌ never use scans for

  • Finding specific items by attribute
  • Filtering by non-key attributes in production
  • Counting items (use aggregated counters instead)
  • Regular application queries
  • User-facing operations

Detection

The anti-pattern detector can identify excessive scan usage:

import { StatsCollector, AntiPatternDetector } from '@ddb-lib/stats'

const stats = new StatsCollector()
const detector = new AntiPatternDetector(stats)

// After running operations
const issues = detector.detectExcessiveScans()

for (const issue of issues) {
  console.log(issue.message)
  // "High scan usage detected: 45% of operations are scans"
  // "Scan on table 'Users' examined 1M items but returned only 10"
}

Warning signs

You might have this anti-pattern if:

  • Your DynamoDB costs are unexpectedly high
  • Queries get slower as your table grows
  • You see "ProvisionedThroughputExceededException" errors
  • Response times are in seconds instead of milliseconds
  • Your read capacity is constantly maxed out

How to fix

Step 1: identify your access patterns

// What queries do you need to support?
// - Get user by ID
// - Get user by email
// - Get active users
// - Get users created after date

Step 2: design keys and indexes

// Main table: pk = USER#{id}, sk = PROFILE
// GSI1: pk = EMAIL#{email}, sk = USER#{id}
// GSI2: pk = STATUS#{status}, sk = CREATED#{timestamp}

Step 3: replace scans with queries

// Before: Scan
const users = await table.scan({
  filter: { status: { eq: 'ACTIVE' } }
})

// After: Query on GSI
const users = await table.query({
  indexName: 'StatusIndex',
  keyCondition: {
    pk: 'STATUS#ACTIVE'
  }
})

Step 4: monitor and validate

// Verify improvement
const stats = collector.getStats()
console.log(`Scan operations: ${stats.scanCount}`)
console.log(`Query operations: ${stats.queryCount}`)
console.log(`Avg scan latency: ${stats.avgScanLatency}ms`)
console.log(`Avg query latency: ${stats.avgQueryLatency}ms`)

Summary

The Problem: Using Scan when Query would work examines every item in your table, causing slow performance and high costs.

The Solution: Design your keys and indexes to support your access patterns, then use Query operations to target specific partitions.

The Impact: Switching from Scan to Query can improve performance by 100x and reduce costs by 99%.

Remember: If you find yourself using Scan in production code for user-facing operations, it's almost always a sign that your data model needs improvement.