Chapter 1: Why Graphs?
Relational databases revolutionized data management in the 1970s, but they carry a fundamental assumption: data lives in flat, rectangular tables. When your data is defined by its connections—who knows whom, which server talks to which service, how one transaction links to another—those tables start to buckle under the weight of recursive JOINs. This chapter explains why graphs are the natural data structure for connected data and introduces the core vocabulary you will use throughout this guide.
1.1 The Limits of Tables
Consider a simple social network. You have a users table and a friendships table. Alice wants to know: "Who are my friends-of-friends that I don't already know?" This is a two-hop query—traverse from Alice to her friends, then from those friends to their friends, excluding anyone Alice is already connected to.
The SQL approach
In a relational database, you would write something like this:
-- Find Alice's friends-of-friends (2 hops) SELECT DISTINCT fof.name FROM users alice JOIN friendships f1 ON alice.id = f1.user_id JOIN friendships f2 ON f1.friend_id = f2.user_id JOIN users fof ON f2.friend_id = fof.id WHERE alice.name = 'Alice' AND fof.id != alice.id AND fof.id NOT IN ( SELECT f3.friend_id FROM friendships f3 WHERE f3.user_id = alice.id );
This works for two hops, but notice the pattern: every additional hop requires another JOIN. For three hops you add another JOIN pair. For four hops, another. By the time you reach six hops—which is common in fraud detection and cybersecurity analysis—the query becomes nearly unreadable and the query planner is performing nested index lookups that scale poorly.
The graph approach
In AstraeaDB's GQL dialect, the same query is a single declarative pattern:
MATCH (alice:Person {name: "Alice"}) -[:KNOWS]->(friend) -[:KNOWS]->(fof) WHERE fof <> alice RETURN DISTINCT fof.name
Adding more hops is trivial—just extend the pattern. And the execution engine does not perform index lookups at each step; it follows direct pointers from node to node (more on this in Section 1.3).
SQL vs. Graph: query complexity comparison
| Query Type | SQL Approach | Graph Approach |
|---|---|---|
| Direct friends (1 hop) | 1 JOIN — straightforward | MATCH (a)-[:KNOWS]->(b) |
| Friends-of-friends (2 hops) | 2 JOINs + subquery exclusion | MATCH (a)-[:KNOWS]->(b)-[:KNOWS]->(c) |
| 3-hop recommendation | 3 JOINs + deduplication + exclusion | Extend the pattern by one arrow |
| Shortest path (variable depth) | Recursive CTE — hard to optimize | Built-in shortest_path() |
| 6-hop fraud ring | 6+ JOINs, often impractical | BFS with depth limit of 6 |
1.2 Nodes, Edges, and Properties
Every graph database is built on three fundamental primitives. Understanding these will give you the vocabulary to work with any graph system, not just AstraeaDB.
Nodes (Vertices)
A node represents an entity—a person, a server, a transaction, a document. Each node has:
- A unique identifier (an internal 64-bit integer in AstraeaDB)
- One or more labels that classify the node (e.g.,
Person,Server,Account) - A set of properties—key-value pairs stored as JSON (e.g.,
{"name": "Alice", "age": 30}) - An optional embedding vector—a fixed-size float32 array for AI/vector operations (unique to AstraeaDB's Vector-Property Graph model)
Edges (Relationships)
An edge connects two nodes with a directed, typed relationship. Each edge has:
- A source node and a target node (edges are directed: Alice KNOWS Bob does not automatically mean Bob KNOWS Alice)
- A relationship type (e.g.,
KNOWS,ACTED_IN,CONNECTS_TO) - A set of properties (e.g.,
{"since": 2019, "weight": 0.85}) - An optional validity interval for temporal graphs (covered in Chapter 9)
Properties
Properties are key-value pairs attached to either nodes or edges. In AstraeaDB, property values are JSON-compatible: strings, numbers, booleans, arrays, and nested objects are all supported. Properties allow you to store rich metadata without needing separate lookup tables.
Putting it all together
Here is a small social graph expressed as ASCII art:
In GQL, you would create this graph like so:
-- Create nodes CREATE (alice:Person {name: "Alice", age: 30}) CREATE (bob:Person {name: "Bob", age: 32}) CREATE (carol:Person {name: "Carol", age: 28}) -- Create edges MATCH (a:Person {name: "Alice"}), (b:Person {name: "Bob"}) CREATE (a)-[:KNOWS {since: 2019}]->(b) MATCH (b:Person {name: "Bob"}), (c:Person {name: "Carol"}) CREATE (b)-[:KNOWS {since: 2021}]->(c)
Person and Employee) while types classify edges (each edge has exactly one type, e.g., KNOWS). This convention aligns with the ISO GQL standard and the broader property graph community.
1.3 Thinking in Connections
The performance advantage of graph databases over relational databases for connected queries comes down to a single architectural decision: index-free adjacency.
How relational databases traverse
When a relational database follows a foreign key relationship, it performs an index lookup. Even with a B-tree index, each lookup costs O(log N) where N is the total number of rows in the target table. For a single hop this is fast. But the cost multiplies with each additional hop:
- 1 hop: O(log N)
- 2 hops: O(k * log N) where k is the average number of matches at hop 1
- d hops: O(kd-1 * log N) — the log N factor applies at every level
How graph databases traverse
In a native graph storage engine like AstraeaDB's, each node stores direct pointers to its neighbors. There is no index lookup; you simply follow the pointer. The cost of visiting one neighbor is O(1)—constant time. Expanding to all neighbors of a node costs O(k) where k is that node's degree (number of neighbors). Crucially, this cost is independent of the total graph size. Whether your graph has 1,000 nodes or 1 billion, traversing from one node to its neighbor takes the same time.
AstraeaDB takes this further with pointer swizzling: when a subgraph is "hot" (frequently accessed), the storage engine promotes 64-bit disk page IDs into direct memory pointers. This eliminates even the page-table indirection, achieving nanosecond-level traversal for active working sets.
Depth vs. time: a comparison
The following table illustrates how traversal time scales with depth for a graph of 10 million nodes, assuming an average degree of 50 neighbors per node:
| Traversal Depth | Nodes Visited | SQL (B-tree JOINs) | Graph (pointer chasing) |
|---|---|---|---|
| 1 hop | 50 | ~0.5 ms (50 index lookups) | ~0.005 ms (50 pointer follows) |
| 2 hops | 2,500 | ~25 ms (2,500 index lookups) | ~0.25 ms |
| 3 hops | 125,000 | ~1.25 sec | ~12.5 ms |
| 4 hops | 6,250,000 | ~62 sec | ~625 ms |
| 5 hops | Exceeds table size | Minutes+ (often times out) | ~31 sec (with deduplication) |
The implication
Graphs do not "replace" relational databases. If your workload is primarily analytical aggregation over flat records (sums, averages, group-by), a relational database or columnar store remains the best tool. But when your workload is dominated by path traversal, pattern matching, or neighborhood exploration—finding fraud rings, tracing network intrusions, recommending connections, answering multi-hop knowledge questions—a graph database provides an order-of-magnitude advantage.
1.4 Real-World Use Cases
Graph databases are not an academic curiosity. They power critical systems across industries. Here are the domains where AstraeaDB's combination of graph traversal, vector search, and AI integration delivers the most value:
Social Networks
Model users as nodes and relationships as edges. Friend-of-friend recommendations become simple 2-hop traversals. Influence analysis uses PageRank. Community detection finds interest clusters. AstraeaDB's vector embeddings add semantic similarity—recommend connections who are both structurally close and share similar interests.
Fraud Detection
Fraudsters hide in the connections between accounts, not in the accounts themselves. A single suspicious account looks normal in a table. But when you traverse outward and find it shares a phone number with three other accounts that share a device ID with ten more—all created within 48 hours—the fraud ring becomes visible. Multi-hop traversal with temporal filtering is the core technique.
Cybersecurity
Model your network as a graph: hosts, services, firewall rules, user accounts. When an intrusion is detected, trace the attacker's lateral movement by following CONNECTS_TO and AUTHENTICATED_AS edges. Identify which critical assets are reachable from a compromised host. Shortest-path algorithms reveal the most dangerous attack vectors.
Knowledge Graphs
Connect facts, entities, and concepts into a web of knowledge. "Albert Einstein" is connected to "General Relativity" via a DISCOVERED edge, which connects to "Gravitational Waves" via PREDICTED. Combined with vector embeddings, this enables GraphRAG: retrieve structurally relevant context and feed it to an LLM for precise, grounded answers.
Recommendation Engines
"People who bought X also bought Y" is a graph query: find nodes co-connected to the same purchase nodes. Collaborative filtering becomes a traversal problem. AstraeaDB's hybrid search blends structural co-occurrence with vector similarity to produce recommendations that are both popular and personally relevant.
Life Sciences
Protein interaction networks, gene regulatory pathways, and drug-target relationships are naturally graph-shaped. Researchers use graph algorithms to identify key proteins (centrality), functional modules (community detection), and potential drug targets (shortest path to disease-associated genes). AstraeaDB's GNN support enables learning directly on molecular graphs.
Supply Chain
Model suppliers, manufacturers, distributors, and retailers as nodes connected by SUPPLIES edges. When a disruption occurs (a factory goes offline, a shipping route is blocked), traverse the graph to instantly identify all downstream impacts. Temporal edges track how supply relationships change over time.