System Design Notes All designs

Fundamentals

Interview Toolkit & Cheat Sheet

The cross-cutting building blocks every design on this site leans on — a repeatable delivery framework, the non-functional qualities you're scored on, CAP, clean REST modeling, and when to reach for PostgreSQL vs DynamoDB. Behavioral prep now has its own dedicated pages.

The delivery framework

Almost every design here follows the same six steps. The first half nails what the system does; the second half defends how it holds up under load and failure.

flowchart LR
    R["1 Requirements"] --> E["2 Core Entities"]
    E --> A["3 API / Interface"]
    A --> D["4 Data Flow (optional)"]
    D --> H["5 High-Level Design"]
    H --> DD["6 Deep Dives"]
      
  1. Requirements — functional (features) + non-functional (qualities); state what's out of scope.
  2. Core entities — the nouns the system stores (User, Post, Ride…).
  3. API / interface — the contract; REST resources or event/stream shapes.
  4. Data flow — optional; trace a request end-to-end for infra-heavy problems.
  5. High-level design — the boxes-and-arrows that satisfy the functional requirements.
  6. Deep dives — the hard problems and trade-offs that satisfy the non-functional requirements.

Mental model

High-level design = "does it work?" Deep dives = "does it still work at 100× scale, during a failure, and under contention?" Spend your interview minutes proportional to where the difficulty actually is.

Functional vs non-functional requirements

Functional = what the system does (post a tweet, book a seat). Non-functional = the qualities it must exhibit. Pick the 2–3 that actually dominate the problem and let them drive the deep dives — don't recite the whole list.

Quality What you're really being asked
Scalability Does it hold up as users/data/QPS grow by orders of magnitude?
Availability Uptime target (how many 9s); graceful behavior during failures.
Operational characteristics Latency, throughput, monitoring, deploy/rollback — running it in production.
Security AuthN/Z, encryption in transit/at rest, input validation, rate limiting.
Testability Can components be verified in isolation and end-to-end?
Usability Is the API/UX clear and hard to misuse?
Extensibility Can new features/entities be added without a rewrite?
Portability Can it move across environments/clouds without deep coupling?

Quantify them: "p99 read < 200 ms," "99.99% availability," "eventual consistency within 1 minute." A number turns a buzzword into a design constraint that justifies caching, replication, or async pipelines.

CAP theorem

Brewer's theorem: a distributed data store can guarantee at most two of threeConsistency, Availability, Partition tolerance. Networks partition in the real world, so P is non-negotiable — which means under a partition you must choose between consistency and availability.

flowchart TD
    P{"Network partition happens"}
    P -->|"choose Consistency"| CP["CP: reject/block until consistent"]
    P -->|"choose Availability"| AP["AP: always answer, may be stale"]
    CP --> CPe["Ticketmaster booking, Auction bids, Uber matching"]
    AP --> APe["News Feed, Yelp search, Web Crawler, Tinder stack"]
      
Term Meaning Example on this site
Consistency Every read sees the most recent write Auction highest bid, seat booking
Availability Every request gets a (non-error) response Feed reads, search, redirects
Partition tolerance Works despite dropped/delayed network messages Any multi-node system

In practice you choose per data path, not per system: Ticketmaster runs a CP transaction plane (booking) alongside an AP discovery plane (search). "Availability >> consistency" or the reverse is the single most useful NFR to state up front. PACELC extends this: else (no partition) you still trade Latency vs Consistency.

REST resource modeling

Clean REST falls out of your core entities. Resources are your core entities, named with plural nouns; the HTTP method is the verb — never put verbs in the path.

GET  /events              # get all events
GET  /events/{id}         # get a specific event
GET  /venues/{id}         # get a specific venue
GET  /events/{id}/tickets # available tickets for an event
POST /events/{id}/bookings# create a booking for an event
GET  /bookings/{id}       # get a specific booking

# NOT:
POST /events/create       <-  no verbs in the path!

PostgreSQL — when & why

The default relational choice across these designs (YouTube metadata, Yelp, Ticketmaster, auctions). An object-relational, open-source database with broad SQL compliance.

Reach for it when you need transactions, ad-hoc queries/joins, ranges, or a single store that does geo + full-text + ACID. Scale reads with replicas; scale writes by sharding on a high-cardinality key. Trade-off: horizontal write scaling is manual versus a natively partitioned NoSQL store.

DynamoDB & TTL — when & why

The default when the access pattern is key-based at massive, predictable scale (Dropbox metadata, FB feed tables, Uber rides). A managed key-value / wide-column store partitioned by key.

TTL is a system-design primitive

The auto-expiring item powers several designs on this site: Uber/Ticketmaster locks (a seat/driver hold that releases itself if checkout is abandoned), and WhatsApp inbox cleanup. "Write a row with a TTL" replaces a whole background-deletion service. Note TTL deletion is eventual (minutes), so guard the hot path with a WHERE/condition too.

Trade-off: no rich joins/ad-hoc queries (model for known access patterns), and you design around hot partitions. Pick it for write-heavy, key-addressable workloads where you'd otherwise shard SQL by hand.

Behavioral interview

Senior/staff loops grade behavioral signal as heavily as the technical rounds. That material now has its own dedicated pages: