Why Netflix Built Its Own Graph Abstraction

Netflix runs a wide variety of graph use cases, from social graphs in gaming to real-time service topology analysis. These fall into two camps: OLAP (analytical, open-ended queries) and OLTP (high-throughput, low-latency traversals). The OLTP category demands millions of operations per second with millisecond response times—something off-the-shelf graph databases often can't deliver without heavy trade-offs.

This post is the first in a series that unpacks the architecture of Netflix's custom Graph Abstraction, which currently handles ~10 million ops/sec across 650 TB of graph data with p99 latencies in the single-digit milliseconds.

Source: Netflix Tech Blog – High-Throughput Graph Abstraction at Netflix: Part I

Netflix Graph Abstraction architecture diagram showing nodes and edges connected across distributed servers System Abstract Visual

Architecture: Building on Existing Abstractions

Instead of reinventing the storage layer, Netflix built the Graph Abstraction on top of its existing Key-Value (KV) Abstraction for real-time indexes and TimeSeries (TS) Abstraction for historical views. EVCache provides low-millisecond caching, and the Data Gateway Control Plane manages schemas and provisioning.

Property Graph Model with Strongly Typed Schemas

Nodes and edges are stored as property graphs with explicit schemas defined in the control plane. Here's an example edge mapping configuration:

{
  "edgeConfig": {
    "edgeMappings": [
      {
        "edgeMappingKey": {
          "fromNodeType": "account",
          "edgeType": "owns",
          "toNodeType": "profile"
        },
        "directionType": "UNIDIRECTIONAL"
      },
      {
        "edgeMappingKey": {
          "fromNodeType": "profile",
          "edgeType": "linked_to",
          "toNodeType": "device"
        },
        "directionType": "BIDIRECTIONAL"
      }
    ]
  }
}

Property schemas extend edge mappings with allowed property names and types:

{
  "edgeMappingKey": {
    "fromNodeType": "profile",
    "edgeType": "linked_to",
    "toNodeType": "device"
  },
  "propertySchema": {
    "propertyMappings": [
      { "propertyKey": "registration_time", "propertyValueType": "TIMESTAMP" },
      { "propertyKey": "status", "propertyValueType": "STRING" }
    ]
  }
}

The schema enables data quality enforcement, query planning, deduplication of traversed edges, and elimination of impossible traversal paths.

Real-Time Index: Key-Value Storage

Nodes are stored per type in dedicated KV namespaces, enabling efficient single-partition lookups. Edges use two separate indexes:

  • Links index: adjacency list mapping source nodes to neighbors.
  • Property index: stores edge properties separately.

This separation allows efficient property upserts without wide rows, at the cost of non-atomic writes across namespaces.

Caching Strategies

  • Write-aside caching of edge links reduces write amplification by avoiding redundant writes.
  • Read-aside caching of properties (via EVCache) reduces read amplification during traversals.
  • Write-through caching (WIP) will organize indexes by different sort orders for further performance gains.

Consistency Enforcement

Multi-region replication is asynchronous, leading to eventual consistency. An entropy repair mechanism uses Kafka to retry failed writes across indexes. Node deletions are asynchronous with LWW conflict resolution.

Traversal API

The custom gRPC traversal API (inspired by Gremlin) supports chained traversals, filters, sorting, and limiting. Here's a hypothetical query to recommend shows to users on a shared device:

TraversalRequest.newBuilder()
    .setNamespace("")
    .setTraversalQuery(
        TraversalQuery.newBuilder()
            .setStartNode(node("device", "my-device-id"))
            .setTraversal(
                Traversal.newBuilder()
                    .setEdgeLimit(5)
                    .setDirectionTraversal(
                        DirectionTraversal.newBuilder()
                            .setDirection(IN)
                            .addNodePropertiesSelections(propSelection("account", "created_at"))
                            .addNodePropertiesSelections(propSelection("profile", "last_active"))
                            .setDirectionFilter(
                                DirectionFilter.newBuilder()
                                    .setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
                                    .addAllNodeFilters(typeFilters("account", "profile"))))
                    .addNextTraversals(
                        Traversal.newBuilder()
                            .setOrder(LATEST)
                            .setEdgeLimit(200)
                            .setDirectionTraversal(
                                DirectionTraversal.newBuilder()
                                    .setDirection(OUT)
                                    .addEdgePropertiesSelections(propSelection("watched", "view_time"))
                                    .addEdgePropertiesSelections(propSelection("has_plan", "active"))
                                    .setDirectionFilter(
                                        DirectionFilter.newBuilder()
                                            .setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
                                            .addAllNodeFilters(typeFilters("title", "plan"))))))
    .build();

Real-time graph traversal flow for OLTP use cases with low-latency caching layers Programming Illustration

Performance & Real-World Metrics

  • Throughput: up to 10 million ops/sec at peak.
  • Latency: single-digit ms for point reads and 1-hop traversals (p99).
  • 2-hop traversals (used by RDG): p90 under 50ms.
  • Data volume: ~650 TB globally.

Limitations & Caveats

  • Eventual consistency is a deliberate trade-off for high availability. Applications must tolerate short-lived stale reads.
  • Non-atomic writes across multiple namespaces require careful retry logic.
  • Traversal depth is limited to keep latency predictable; deep traversals (3+ hops) may require different strategies.

Next Steps

Part II of this series will dive into traversal planning, execution, and counting mechanisms. Part III will cover the temporal index and Time Series integration. If you're building a high-throughput graph system, consider starting with a clear separation of link and property storage, and invest in a schema-driven query planner.

Related Reading

Performance metrics dashboard for graph database operations showing p50 p90 p99 latencies Development Concept Image

Conclusion

Netflix's Graph Abstraction is a masterclass in building a purpose-built OLTP graph store on top of existing infrastructure. By separating link and property indexes, leveraging schema-driven optimizations, and employing layered caching, they achieve remarkable throughput and latency at scale. The trade-offs—eventual consistency, non-atomic writes, limited traversal depth—are clearly documented and acceptable for their use cases.

If you're architecting a graph system for high-throughput OLTP workloads, take inspiration from Netflix's approach: don't build everything from scratch; compose existing abstractions wisely.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.