Chapter 2: The Graph Database Landscape

Not all graph databases are alike. They differ in their data models, query languages, storage architectures, and intended workloads. This chapter surveys the landscape so you can make informed decisions—and understand exactly where AstraeaDB fits.

2.1 Property Graphs vs. RDF

The graph database world is divided into two major data model families. Understanding the difference is important because it shapes the query language, the API, and the kinds of problems each model handles naturally.

The Property Graph model

In a Property Graph, nodes carry labels and key-value properties, and edges are directed, typed, and can also carry properties. This is the model used by Neo4j, Memgraph, TigerGraph, and AstraeaDB. It is the dominant model in the industry for transactional and analytical graph workloads.

A Property Graph representation of "Alice knows Bob since 2019":

// Two nodes with labels and properties
(alice:Person {name: "Alice", age: 30})
(bob:Person   {name: "Bob",   age: 32})

// One directed edge with a type and properties
(alice)-[:KNOWS {since: 2019}]->(bob)

Key characteristics:

Nodes and edges are first-class citizens with their own identity
Properties are native key-value pairs—no need for reification or blank nodes
Edge properties allow rich metadata (weight, timestamp, confidence score) directly on relationships
Intuitive for application developers—maps naturally to objects and references in code

The RDF model

RDF (Resource Description Framework) represents everything as subject-predicate-object triples. Each element is identified by a URI. Relationships do not carry properties directly; you must use reification (creating a node to represent the relationship itself) to attach metadata.

The same "Alice knows Bob since 2019" in RDF (Turtle syntax):

# Prefix declarations
@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Basic triple: Alice knows Bob
ex:alice foaf:knows ex:bob .

# Node properties require separate triples
ex:alice foaf:name "Alice" .
ex:alice ex:age "30"^^xsd:integer .
ex:bob   foaf:name "Bob" .
ex:bob   ex:age "32"^^xsd:integer .

# Edge properties require reification (a new node for the relationship)
ex:friendship1 a rdf:Statement ;
    rdf:subject   ex:alice ;
    rdf:predicate foaf:knows ;
    rdf:object    ex:bob ;
    ex:since      "2019"^^xsd:integer .

Key characteristics:

Universal: everything is a URI, enabling global data interchange across organizations
Standardized: W3C standards (RDF, RDFS, OWL) provide formal semantics and reasoning
Verbose: simple relationships require multiple triples; edge properties need reification
Strongest in academic, governmental, and linked data/semantic web applications

Comparison

Aspect	Property Graph	RDF
Node identity	Internal ID + labels	URI (globally unique)
Edge properties	Native key-value pairs	Requires reification
Schema	Optional (schema-free or enforced)	RDFS/OWL ontologies
Query language	Cypher, GQL, Gremlin	SPARQL
Reasoning	Not built-in	OWL inference, entailment
Developer ergonomics	Intuitive for app developers	Steeper learning curve
Data interchange	Vendor-specific formats	Universal (URIs, standards)
Primary audience	Application development, analytics	Linked data, semantic web, research

AstraeaDB's choice AstraeaDB uses the Property Graph model. This decision reflects its primary design goals: developer ergonomics, performance for deep traversals, and seamless integration with AI/ML workflows. The Property Graph model maps naturally to JSON documents, making it straightforward to attach rich metadata to both nodes and edges without reification overhead.

2.2 Query Languages

The query language you use determines how you express graph patterns, traversals, and mutations. Here are the four major graph query languages and how they compare:

Cypher (Neo4j)

Cypher pioneered ASCII art pattern matching: you draw the graph pattern you want to find using parentheses for nodes and arrows for edges. It is declarative—you describe what you want, not how to get it.

// Find Alice's friends who are older than 25
MATCH (a:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
WHERE friend.age > 25
RETURN friend.name, friend.age
ORDER BY friend.age DESC

Cypher is the most widely adopted graph query language. Its pattern syntax is intuitive and readable, even for developers new to graph databases. However, it was developed by Neo4j and, until recently, lacked formal standardization.

Gremlin (Apache TinkerPop)

Gremlin takes an imperative, step-based approach. You compose a traversal by chaining steps that describe how to walk the graph. It runs on the JVM and is the standard for the Apache TinkerPop framework.

// Same query in Gremlin
g.V().has('Person', 'name', 'Alice')
 .out('KNOWS')
 .hasLabel('Person')
 .has('age', gt(25))
 .order().by('age', desc)
 .valueMap('name', 'age')

Gremlin's imperative style gives fine-grained control over traversal execution and is well-suited for procedural graph algorithms. However, complex patterns are harder to read than Cypher's visual syntax, and performance depends on step ordering.

SPARQL (W3C)

SPARQL is the W3C standard for querying RDF data. It uses triple pattern matching with a SQL-like SELECT syntax.

# Same query in SPARQL (RDF world)
PREFIX ex: <http://example.org/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?friendName ?friendAge
WHERE {
    ex:alice foaf:knows ?friend .
    ?friend  foaf:name  ?friendName .
    ?friend  ex:age     ?friendAge .
    FILTER(?friendAge > 25)
}
ORDER BY DESC(?friendAge)

SPARQL excels in federated queries across distributed RDF endpoints and in environments with formal ontologies. It is the standard in academic, governmental, and linked open data communities, but it is rarely used for application-level graph databases.

GQL (ISO 9075 — The New Standard)

GQL (Graph Query Language) is the new ISO standard, officially published in 2024. It unifies the best ideas from Cypher's pattern matching syntax and SQL's clause structure into a vendor-neutral specification. Key design goals include:

Declarative pattern matching with ASCII art syntax (inherited from Cypher)
Composable graph queries that can be chained like SQL subqueries
First-class path expressions for variable-length traversals
Integration with SQL for hybrid relational/graph queries
Prevention of vendor lock-in through formal standardization

// GQL syntax (AstraeaDB implements this)
MATCH (a:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
WHERE friend.age > 25
RETURN friend.name, friend.age
ORDER BY friend.age DESC

If you know Cypher, GQL will look immediately familiar. The core pattern matching syntax is compatible, while the standard adds formal grammar rules, composability features, and a path toward SQL integration.

Language comparison at a glance

Language	Paradigm	Data Model	Standardized	Primary Ecosystem
Cypher	Declarative, pattern-based	Property Graph	openCypher (community)	Neo4j, Memgraph, RedisGraph
Gremlin	Imperative, step-based	Property Graph	Apache TinkerPop	JanusGraph, Amazon Neptune, CosmosDB
SPARQL	Declarative, triple-pattern	RDF	W3C Standard	Blazegraph, GraphDB, Stardog
GQL	Declarative, pattern-based	Property Graph	ISO Standard (2024)	AstraeaDB, emerging adoption

AstraeaDB's query language AstraeaDB implements a GQL/Cypher-compatible syntax through a hand-written recursive-descent parser. The full execution pipeline supports: MATCH with node and edge pattern matching, WHERE filtering with boolean expressions, CREATE and DELETE for mutations, RETURN with expressions, ORDER BY, LIMIT, and aggregation functions (count(), sum(), avg(), min(), max()). Chapter 5 covers the query language in detail.

2.3 What Makes AstraeaDB Different

AstraeaDB is not another Neo4j clone. It was designed from scratch to address the shortcomings of existing graph databases, especially in the areas of AI integration, cloud-native storage, and performance. Here are the key differentiators:

The Vector-Property Graph

Most graph databases treat vector search as a bolt-on feature—an afterthought added to an existing architecture. In AstraeaDB, embeddings are first-class citizens. Every node can carry a float32 embedding vector alongside its labels and JSON properties. The HNSW (Hierarchical Navigable Small World) vector index is integrated directly into the graph structure: the navigation links in the vector index are graph edges. This unified architecture enables:

Hybrid search: blend graph proximity (structural distance) with vector similarity (semantic distance) using a configurable alpha parameter
Semantic traversal: walk the graph greedily toward a target concept embedding, combining structural and semantic intelligence at each hop
Zero-overhead embeddings: no separate vector database, no ETL pipeline, no synchronization headaches

AI-First Architecture

AstraeaDB is built for the AI era. Beyond vector search, it includes:

GraphRAG: a built-in Retrieval-Augmented Generation engine that finds an anchor node via vector search, extracts a subgraph with BFS, linearizes it to text, and feeds it to an LLM—all in one atomic operation
GNN training: differentiable tensors and message-passing layers built directly into the database, enabling node classification training without exporting data
Semantic walk: a traversal primitive that combines vector similarity with edge traversal, letting you explore the graph in the direction of a concept

Rust Performance

AstraeaDB is written entirely in Rust, delivering:

Zero garbage collection pauses: unlike Java-based graph databases (Neo4j, JanusGraph), Rust's ownership model eliminates GC stop-the-world events
Memory safety without overhead: Rust's borrow checker prevents use-after-free, double-free, and data race bugs at compile time—critical for a database that manages memory-mapped pages and concurrent traversals
Fearless concurrency: Rust's type system guarantees thread safety, enabling the storage engine, query executor, and network server to share data structures without mutex contention where possible
Predictable latency: no GC pauses, no JIT warmup, no interpreted overhead—consistent sub-millisecond response times for hot-path queries

Three-Tier Storage

AstraeaDB solves the "Memory Wall" problem—the tension between needing random-access speed for traversals and wanting cloud-native separation of compute and storage:

Tier 1 — Cold (Object Storage): data persists in JSON, Apache Parquet, or cloud object stores (S3, GCS, Azure). Open formats ensure long-term interoperability.
Tier 2 — Warm (NVMe Buffer Pool): an LRU buffer pool caches 8 KiB pages from disk with pin/unpin semantics. Pluggable I/O backends support memmap2 (cross-platform) and io_uring (Linux async I/O).
Tier 3 — Hot (Pointer Swizzling): frequently accessed subgraphs are promoted into RAM. 64-bit disk page IDs are replaced with direct memory pointers for nanosecond-level traversal. The HNSW vector index lives entirely in this tier.

Three Transport Protocols

No single protocol fits every use case. AstraeaDB offers three:

JSON-TCP (port 7687): simple, human-readable, zero-dependency. Ideal for getting started, scripting, and lightweight clients.
gRPC (port 7688): strongly typed, protobuf-serialized, bidirectional streaming. Best for production microservices that need schema enforcement and code generation.
Apache Arrow Flight (port 7689): zero-copy columnar data transfer. Stream query results directly into Pandas or Polars DataFrames without serialization overhead. The natural choice for data science and ML workflows.

Full comparison: AstraeaDB vs. the field

Capability	AstraeaDB	Neo4j	TigerGraph	ArangoDB	Memgraph
Language	Rust	Java	C++	C++	C++
Data model	Vector-Property Graph	Property Graph	Property Graph	Multi-model (Doc+Graph)	Property Graph
Query language	GQL / Cypher	Cypher	GSQL	AQL	Cypher
Vector search	Built-in HNSW (first-class)	Vector index (added 2023)	No	No	No
GraphRAG	Built-in engine	Plugin (LangChain)	No	No	No
GNN training	Built-in tensors + message passing	Export to PyTorch Geometric	Export to DGL	No	No
Temporal graphs	Native validity intervals	Manual (property-based)	Manual	Manual	Manual
Storage tiers	Cold / Warm / Hot with pointer swizzling	Page cache	Distributed in-memory	RocksDB	In-memory only
Transport protocols	JSON-TCP + gRPC + Arrow Flight	Bolt	REST	HTTP / VelocyStream	Bolt
Encryption	Homomorphic (FHE on labels)	TLS + at-rest	TLS + at-rest	TLS + at-rest	TLS
GC pauses	None (Rust)	JVM GC pauses	None (C++)	None (C++)	None (C++)
Graph algorithms	PageRank, Louvain, Centrality, Components	GDS library (paid)	Built-in (extensive)	Pregel-based	MAGE library
License	MIT (open source)	GPL / Commercial	Commercial	Apache 2.0	BSL / Commercial

The unified advantage The defining characteristic of AstraeaDB is unification. Rather than bolting vector search onto a graph engine, or exporting graph data to a separate ML framework, AstraeaDB treats graph traversal, vector similarity, temporal queries, and neural network training as facets of a single system. This eliminates data movement, reduces operational complexity, and enables novel operations like semantic traversal and in-database GNN training that are impossible when these capabilities live in separate systems.

When to choose AstraeaDB

AstraeaDB is the strongest fit when your workload requires two or more of the following:

Deep graph traversals (3+ hops) with low latency
Vector similarity search integrated with graph structure
LLM-powered question answering grounded in graph knowledge (GraphRAG)
Time-varying relationships that need historical queries
GNN model training without data export
Predictable latency without GC pauses
An open-source, MIT-licensed foundation with no enterprise-gated features

If your workload is purely document-oriented (no meaningful relationships), a document database like MongoDB is a better fit. If your focus is RDF and ontology reasoning, a triple store like Stardog or GraphDB is more appropriate. AstraeaDB excels specifically where connections, semantics, and computation converge.

← Chapter 1: Why Graphs? Chapter 3: Installation and Setup →