Chapter 1: Why Graphs?

Relational databases revolutionized data management in the 1970s, but they carry a fundamental assumption: data lives in flat, rectangular tables. When your data is defined by its connections—who knows whom, which server talks to which service, how one transaction links to another—those tables start to buckle under the weight of recursive JOINs. This chapter explains why graphs are the natural data structure for connected data and introduces the core vocabulary you will use throughout this guide.

1.1 The Limits of Tables

Consider a simple social network. You have a users table and a friendships table. Alice wants to know: "Who are my friends-of-friends that I don't already know?" This is a two-hop query—traverse from Alice to her friends, then from those friends to their friends, excluding anyone Alice is already connected to.

The SQL approach

In a relational database, you would write something like this:

-- Find Alice's friends-of-friends (2 hops)
SELECT DISTINCT fof.name
FROM users alice
JOIN friendships f1
  ON alice.id = f1.user_id
JOIN friendships f2
  ON f1.friend_id = f2.user_id
JOIN users fof
  ON f2.friend_id = fof.id
WHERE alice.name = 'Alice'
  AND fof.id != alice.id
  AND fof.id NOT IN (
    SELECT f3.friend_id
    FROM friendships f3
    WHERE f3.user_id = alice.id
  );

This works for two hops, but notice the pattern: every additional hop requires another JOIN. For three hops you add another JOIN pair. For four hops, another. By the time you reach six hops—which is common in fraud detection and cybersecurity analysis—the query becomes nearly unreadable and the query planner is performing nested index lookups that scale poorly.

The graph approach

In AstraeaDB's GQL dialect, the same query is a single declarative pattern:

MATCH (alice:Person {name: "Alice"})
      -[:KNOWS]->(friend)
      -[:KNOWS]->(fof)
WHERE fof <> alice
RETURN DISTINCT fof.name

Adding more hops is trivial—just extend the pattern. And the execution engine does not perform index lookups at each step; it follows direct pointers from node to node (more on this in Section 1.3).

SQL vs. Graph: query complexity comparison

Query Type	SQL Approach	Graph Approach
Direct friends (1 hop)	1 JOIN — straightforward	`MATCH (a)-[:KNOWS]->(b)`
Friends-of-friends (2 hops)	2 JOINs + subquery exclusion	`MATCH (a)-[:KNOWS]->(b)-[:KNOWS]->(c)`
3-hop recommendation	3 JOINs + deduplication + exclusion	Extend the pattern by one arrow
Shortest path (variable depth)	Recursive CTE — hard to optimize	Built-in `shortest_path()`
6-hop fraud ring	6+ JOINs, often impractical	BFS with depth limit of 6

Key takeaway SQL databases are optimized for filtering and aggregating rows within a table. Graph databases are optimized for navigating relationships between entities. They are complementary tools, but when your questions center on connections, a graph database is the right choice.

1.2 Nodes, Edges, and Properties

Every graph database is built on three fundamental primitives. Understanding these will give you the vocabulary to work with any graph system, not just AstraeaDB.

Nodes (Vertices)

A node represents an entity—a person, a server, a transaction, a document. Each node has:

A unique identifier (an internal 64-bit integer in AstraeaDB)
One or more labels that classify the node (e.g., Person, Server, Account)
A set of properties—key-value pairs stored as JSON (e.g., {"name": "Alice", "age": 30})
An optional embedding vector—a fixed-size float32 array for AI/vector operations (unique to AstraeaDB's Vector-Property Graph model)

Edges (Relationships)

An edge connects two nodes with a directed, typed relationship. Each edge has:

A source node and a target node (edges are directed: Alice KNOWS Bob does not automatically mean Bob KNOWS Alice)
A relationship type (e.g., KNOWS, ACTED_IN, CONNECTS_TO)
A set of properties (e.g., {"since": 2019, "weight": 0.85})
An optional validity interval for temporal graphs (covered in Chapter 9)

Properties

Properties are key-value pairs attached to either nodes or edges. In AstraeaDB, property values are JSON-compatible: strings, numbers, booleans, arrays, and nested objects are all supported. Properties allow you to store rich metadata without needing separate lookup tables.

Putting it all together

Here is a small social graph expressed as ASCII art:

(Alice:Person) (Carol:Person) { name: "Alice", { name: "Carol", age: 30 } age: 28 } | ^ | :KNOWS | :KNOWS | { since: 2019 } | { since: 2021 } v | (Bob:Person) ----------------------+ { name: "Bob", age: 32 } Legend: ( ) = Node, with Label and Properties ----> = Directed Edge, with Type and Properties

In GQL, you would create this graph like so:

-- Create nodes
CREATE (alice:Person {name: "Alice", age: 30})
CREATE (bob:Person   {name: "Bob",   age: 32})
CREATE (carol:Person {name: "Carol", age: 28})

-- Create edges
MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"})
CREATE (a)-[:KNOWS {since: 2019}]->(b)

MATCH (b:Person {name: "Bob"}), (c:Person {name: "Carol"})
CREATE (b)-[:KNOWS {since: 2021}]->(c)

Labels vs. types In AstraeaDB terminology, labels classify nodes (a node can have multiple labels, e.g., Person and Employee) while types classify edges (each edge has exactly one type, e.g., KNOWS). This convention aligns with the ISO GQL standard and the broader property graph community.

1.3 Thinking in Connections

The performance advantage of graph databases over relational databases for connected queries comes down to a single architectural decision: index-free adjacency.

How relational databases traverse

When a relational database follows a foreign key relationship, it performs an index lookup. Even with a B-tree index, each lookup costs O(log N) where N is the total number of rows in the target table. For a single hop this is fast. But the cost multiplies with each additional hop:

1 hop: O(log N)
2 hops: O(k * log N) where k is the average number of matches at hop 1
d hops: O(k^d-1 * log N) — the log N factor applies at every level

How graph databases traverse

In a native graph storage engine like AstraeaDB's, each node stores direct pointers to its neighbors. There is no index lookup; you simply follow the pointer. The cost of visiting one neighbor is O(1)—constant time. Expanding to all neighbors of a node costs O(k) where k is that node's degree (number of neighbors). Crucially, this cost is independent of the total graph size. Whether your graph has 1,000 nodes or 1 billion, traversing from one node to its neighbor takes the same time.

AstraeaDB takes this further with pointer swizzling: when a subgraph is "hot" (frequently accessed), the storage engine promotes 64-bit disk page IDs into direct memory pointers. This eliminates even the page-table indirection, achieving nanosecond-level traversal for active working sets.

Depth vs. time: a comparison

The following table illustrates how traversal time scales with depth for a graph of 10 million nodes, assuming an average degree of 50 neighbors per node:

Traversal Depth	Nodes Visited	SQL (B-tree JOINs)	Graph (pointer chasing)
1 hop	50	~0.5 ms (50 index lookups)	~0.005 ms (50 pointer follows)
2 hops	2,500	~25 ms (2,500 index lookups)	~0.25 ms
3 hops	125,000	~1.25 sec	~12.5 ms
4 hops	6,250,000	~62 sec	~625 ms
5 hops	Exceeds table size	Minutes+ (often times out)	~31 sec (with deduplication)

These numbers are illustrative Real-world performance depends on many factors: hardware, caching, query complexity, and graph topology. The key insight is the scaling pattern: SQL's per-hop cost grows with the table size (log N), while graph traversal cost grows only with the local neighborhood size (k). The gap widens as depth increases and as the dataset grows.

The implication

Graphs do not "replace" relational databases. If your workload is primarily analytical aggregation over flat records (sums, averages, group-by), a relational database or columnar store remains the best tool. But when your workload is dominated by path traversal, pattern matching, or neighborhood exploration—finding fraud rings, tracing network intrusions, recommending connections, answering multi-hop knowledge questions—a graph database provides an order-of-magnitude advantage.

1.4 Real-World Use Cases

Graph databases are not an academic curiosity. They power critical systems across industries. Here are the domains where AstraeaDB's combination of graph traversal, vector search, and AI integration delivers the most value:

Social Networks

Model users as nodes and relationships as edges. Friend-of-friend recommendations become simple 2-hop traversals. Influence analysis uses PageRank. Community detection finds interest clusters. AstraeaDB's vector embeddings add semantic similarity—recommend connections who are both structurally close and share similar interests.

Fraud Detection

Fraudsters hide in the connections between accounts, not in the accounts themselves. A single suspicious account looks normal in a table. But when you traverse outward and find it shares a phone number with three other accounts that share a device ID with ten more—all created within 48 hours—the fraud ring becomes visible. Multi-hop traversal with temporal filtering is the core technique.

Cybersecurity

Model your network as a graph: hosts, services, firewall rules, user accounts. When an intrusion is detected, trace the attacker's lateral movement by following CONNECTS_TO and AUTHENTICATED_AS edges. Identify which critical assets are reachable from a compromised host. Shortest-path algorithms reveal the most dangerous attack vectors.

Knowledge Graphs

Connect facts, entities, and concepts into a web of knowledge. "Albert Einstein" is connected to "General Relativity" via a DISCOVERED edge, which connects to "Gravitational Waves" via PREDICTED. Combined with vector embeddings, this enables GraphRAG: retrieve structurally relevant context and feed it to an LLM for precise, grounded answers.

Recommendation Engines

"People who bought X also bought Y" is a graph query: find nodes co-connected to the same purchase nodes. Collaborative filtering becomes a traversal problem. AstraeaDB's hybrid search blends structural co-occurrence with vector similarity to produce recommendations that are both popular and personally relevant.

Life Sciences

Protein interaction networks, gene regulatory pathways, and drug-target relationships are naturally graph-shaped. Researchers use graph algorithms to identify key proteins (centrality), functional modules (community detection), and potential drug targets (shortest path to disease-associated genes). AstraeaDB's GNN support enables learning directly on molecular graphs.

Supply Chain

Model suppliers, manufacturers, distributors, and retailers as nodes connected by SUPPLIES edges. When a disruption occurs (a factory goes offline, a shipping route is blocked), traverse the graph to instantly identify all downstream impacts. Temporal edges track how supply relationships change over time.

Which use case fits you? If your data has a high "relationship-to-entity" ratio—that is, the connections between records are as important as the records themselves—a graph database is likely the right foundation. In the next chapter, we survey the graph database landscape and explain where AstraeaDB fits.

← Table of Contents Chapter 2: The Graph Database Landscape →