Chapter 2: The Graph Database Landscape
Not all graph databases are alike. They differ in their data models, query languages, storage architectures, and intended workloads. This chapter surveys the landscape so you can make informed decisions—and understand exactly where AstraeaDB fits.
2.1 Property Graphs vs. RDF
The graph database world is divided into two major data model families. Understanding the difference is important because it shapes the query language, the API, and the kinds of problems each model handles naturally.
The Property Graph model
In a Property Graph, nodes carry labels and key-value properties, and edges are directed, typed, and can also carry properties. This is the model used by Neo4j, Memgraph, TigerGraph, and AstraeaDB. It is the dominant model in the industry for transactional and analytical graph workloads.
A Property Graph representation of "Alice knows Bob since 2019":
// Two nodes with labels and properties (alice:Person {name: "Alice", age: 30}) (bob:Person {name: "Bob", age: 32}) // One directed edge with a type and properties (alice)-[:KNOWS {since: 2019}]->(bob)
Key characteristics:
- Nodes and edges are first-class citizens with their own identity
- Properties are native key-value pairs—no need for reification or blank nodes
- Edge properties allow rich metadata (weight, timestamp, confidence score) directly on relationships
- Intuitive for application developers—maps naturally to objects and references in code
The RDF model
RDF (Resource Description Framework) represents everything as subject-predicate-object triples. Each element is identified by a URI. Relationships do not carry properties directly; you must use reification (creating a node to represent the relationship itself) to attach metadata.
The same "Alice knows Bob since 2019" in RDF (Turtle syntax):
# Prefix declarations @prefix ex: <http://example.org/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . # Basic triple: Alice knows Bob ex:alice foaf:knows ex:bob . # Node properties require separate triples ex:alice foaf:name "Alice" . ex:alice ex:age "30"^^xsd:integer . ex:bob foaf:name "Bob" . ex:bob ex:age "32"^^xsd:integer . # Edge properties require reification (a new node for the relationship) ex:friendship1 a rdf:Statement ; rdf:subject ex:alice ; rdf:predicate foaf:knows ; rdf:object ex:bob ; ex:since "2019"^^xsd:integer .
Key characteristics:
- Universal: everything is a URI, enabling global data interchange across organizations
- Standardized: W3C standards (RDF, RDFS, OWL) provide formal semantics and reasoning
- Verbose: simple relationships require multiple triples; edge properties need reification
- Strongest in academic, governmental, and linked data/semantic web applications
Comparison
| Aspect | Property Graph | RDF |
|---|---|---|
| Node identity | Internal ID + labels | URI (globally unique) |
| Edge properties | Native key-value pairs | Requires reification |
| Schema | Optional (schema-free or enforced) | RDFS/OWL ontologies |
| Query language | Cypher, GQL, Gremlin | SPARQL |
| Reasoning | Not built-in | OWL inference, entailment |
| Developer ergonomics | Intuitive for app developers | Steeper learning curve |
| Data interchange | Vendor-specific formats | Universal (URIs, standards) |
| Primary audience | Application development, analytics | Linked data, semantic web, research |
2.2 Query Languages
The query language you use determines how you express graph patterns, traversals, and mutations. Here are the four major graph query languages and how they compare:
Cypher (Neo4j)
Cypher pioneered ASCII art pattern matching: you draw the graph pattern you want to find using parentheses for nodes and arrows for edges. It is declarative—you describe what you want, not how to get it.
// Find Alice's friends who are older than 25 MATCH (a:Person {name: "Alice"})-[:KNOWS]->(friend:Person) WHERE friend.age > 25 RETURN friend.name, friend.age ORDER BY friend.age DESC
Cypher is the most widely adopted graph query language. Its pattern syntax is intuitive and readable, even for developers new to graph databases. However, it was developed by Neo4j and, until recently, lacked formal standardization.
Gremlin (Apache TinkerPop)
Gremlin takes an imperative, step-based approach. You compose a traversal by chaining steps that describe how to walk the graph. It runs on the JVM and is the standard for the Apache TinkerPop framework.
// Same query in Gremlin g.V().has('Person', 'name', 'Alice') .out('KNOWS') .hasLabel('Person') .has('age', gt(25)) .order().by('age', desc) .valueMap('name', 'age')
Gremlin's imperative style gives fine-grained control over traversal execution and is well-suited for procedural graph algorithms. However, complex patterns are harder to read than Cypher's visual syntax, and performance depends on step ordering.
SPARQL (W3C)
SPARQL is the W3C standard for querying RDF data. It uses triple pattern matching with a SQL-like SELECT syntax.
# Same query in SPARQL (RDF world) PREFIX ex: <http://example.org/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?friendName ?friendAge WHERE { ex:alice foaf:knows ?friend . ?friend foaf:name ?friendName . ?friend ex:age ?friendAge . FILTER(?friendAge > 25) } ORDER BY DESC(?friendAge)
SPARQL excels in federated queries across distributed RDF endpoints and in environments with formal ontologies. It is the standard in academic, governmental, and linked open data communities, but it is rarely used for application-level graph databases.
GQL (ISO 9075 — The New Standard)
GQL (Graph Query Language) is the new ISO standard, officially published in 2024. It unifies the best ideas from Cypher's pattern matching syntax and SQL's clause structure into a vendor-neutral specification. Key design goals include:
- Declarative pattern matching with ASCII art syntax (inherited from Cypher)
- Composable graph queries that can be chained like SQL subqueries
- First-class path expressions for variable-length traversals
- Integration with SQL for hybrid relational/graph queries
- Prevention of vendor lock-in through formal standardization
// GQL syntax (AstraeaDB implements this) MATCH (a:Person {name: "Alice"})-[:KNOWS]->(friend:Person) WHERE friend.age > 25 RETURN friend.name, friend.age ORDER BY friend.age DESC
If you know Cypher, GQL will look immediately familiar. The core pattern matching syntax is compatible, while the standard adds formal grammar rules, composability features, and a path toward SQL integration.
Language comparison at a glance
| Language | Paradigm | Data Model | Standardized | Primary Ecosystem |
|---|---|---|---|---|
| Cypher | Declarative, pattern-based | Property Graph | openCypher (community) | Neo4j, Memgraph, RedisGraph |
| Gremlin | Imperative, step-based | Property Graph | Apache TinkerPop | JanusGraph, Amazon Neptune, CosmosDB |
| SPARQL | Declarative, triple-pattern | RDF | W3C Standard | Blazegraph, GraphDB, Stardog |
| GQL | Declarative, pattern-based | Property Graph | ISO Standard (2024) | AstraeaDB, emerging adoption |
MATCH with node and edge pattern matching, WHERE filtering with boolean expressions, CREATE and DELETE for mutations, RETURN with expressions, ORDER BY, LIMIT, and aggregation functions (count(), sum(), avg(), min(), max()). Chapter 5 covers the query language in detail.
2.3 What Makes AstraeaDB Different
AstraeaDB is not another Neo4j clone. It was designed from scratch to address the shortcomings of existing graph databases, especially in the areas of AI integration, cloud-native storage, and performance. Here are the key differentiators:
The Vector-Property Graph
Most graph databases treat vector search as a bolt-on feature—an afterthought added to an existing architecture. In AstraeaDB, embeddings are first-class citizens. Every node can carry a float32 embedding vector alongside its labels and JSON properties. The HNSW (Hierarchical Navigable Small World) vector index is integrated directly into the graph structure: the navigation links in the vector index are graph edges. This unified architecture enables:
- Hybrid search: blend graph proximity (structural distance) with vector similarity (semantic distance) using a configurable alpha parameter
- Semantic traversal: walk the graph greedily toward a target concept embedding, combining structural and semantic intelligence at each hop
- Zero-overhead embeddings: no separate vector database, no ETL pipeline, no synchronization headaches
AI-First Architecture
AstraeaDB is built for the AI era. Beyond vector search, it includes:
- GraphRAG: a built-in Retrieval-Augmented Generation engine that finds an anchor node via vector search, extracts a subgraph with BFS, linearizes it to text, and feeds it to an LLM—all in one atomic operation
- GNN training: differentiable tensors and message-passing layers built directly into the database, enabling node classification training without exporting data
- Semantic walk: a traversal primitive that combines vector similarity with edge traversal, letting you explore the graph in the direction of a concept
Rust Performance
AstraeaDB is written entirely in Rust, delivering:
- Zero garbage collection pauses: unlike Java-based graph databases (Neo4j, JanusGraph), Rust's ownership model eliminates GC stop-the-world events
- Memory safety without overhead: Rust's borrow checker prevents use-after-free, double-free, and data race bugs at compile time—critical for a database that manages memory-mapped pages and concurrent traversals
- Fearless concurrency: Rust's type system guarantees thread safety, enabling the storage engine, query executor, and network server to share data structures without mutex contention where possible
- Predictable latency: no GC pauses, no JIT warmup, no interpreted overhead—consistent sub-millisecond response times for hot-path queries
Three-Tier Storage
AstraeaDB solves the "Memory Wall" problem—the tension between needing random-access speed for traversals and wanting cloud-native separation of compute and storage:
- Tier 1 — Cold (Object Storage): data persists in JSON, Apache Parquet, or cloud object stores (S3, GCS, Azure). Open formats ensure long-term interoperability.
- Tier 2 — Warm (NVMe Buffer Pool): an LRU buffer pool caches 8 KiB pages from disk with pin/unpin semantics. Pluggable I/O backends support memmap2 (cross-platform) and io_uring (Linux async I/O).
- Tier 3 — Hot (Pointer Swizzling): frequently accessed subgraphs are promoted into RAM. 64-bit disk page IDs are replaced with direct memory pointers for nanosecond-level traversal. The HNSW vector index lives entirely in this tier.
Three Transport Protocols
No single protocol fits every use case. AstraeaDB offers three:
- JSON-TCP (port 7687): simple, human-readable, zero-dependency. Ideal for getting started, scripting, and lightweight clients.
- gRPC (port 7688): strongly typed, protobuf-serialized, bidirectional streaming. Best for production microservices that need schema enforcement and code generation.
- Apache Arrow Flight (port 7689): zero-copy columnar data transfer. Stream query results directly into Pandas or Polars DataFrames without serialization overhead. The natural choice for data science and ML workflows.
Full comparison: AstraeaDB vs. the field
| Capability | AstraeaDB | Neo4j | TigerGraph | ArangoDB | Memgraph |
|---|---|---|---|---|---|
| Language | Rust | Java | C++ | C++ | C++ |
| Data model | Vector-Property Graph | Property Graph | Property Graph | Multi-model (Doc+Graph) | Property Graph |
| Query language | GQL / Cypher | Cypher | GSQL | AQL | Cypher |
| Vector search | Built-in HNSW (first-class) | Vector index (added 2023) | No | No | No |
| GraphRAG | Built-in engine | Plugin (LangChain) | No | No | No |
| GNN training | Built-in tensors + message passing | Export to PyTorch Geometric | Export to DGL | No | No |
| Temporal graphs | Native validity intervals | Manual (property-based) | Manual | Manual | Manual |
| Storage tiers | Cold / Warm / Hot with pointer swizzling | Page cache | Distributed in-memory | RocksDB | In-memory only |
| Transport protocols | JSON-TCP + gRPC + Arrow Flight | Bolt | REST | HTTP / VelocyStream | Bolt |
| Encryption | Homomorphic (FHE on labels) | TLS + at-rest | TLS + at-rest | TLS + at-rest | TLS |
| GC pauses | None (Rust) | JVM GC pauses | None (C++) | None (C++) | None (C++) |
| Graph algorithms | PageRank, Louvain, Centrality, Components | GDS library (paid) | Built-in (extensive) | Pregel-based | MAGE library |
| License | MIT (open source) | GPL / Commercial | Commercial | Apache 2.0 | BSL / Commercial |
When to choose AstraeaDB
AstraeaDB is the strongest fit when your workload requires two or more of the following:
- Deep graph traversals (3+ hops) with low latency
- Vector similarity search integrated with graph structure
- LLM-powered question answering grounded in graph knowledge (GraphRAG)
- Time-varying relationships that need historical queries
- GNN model training without data export
- Predictable latency without GC pauses
- An open-source, MIT-licensed foundation with no enterprise-gated features
If your workload is purely document-oriented (no meaningful relationships), a document database like MongoDB is a better fit. If your focus is RDF and ontology reasoning, a triple store like Stardog or GraphDB is more appropriate. AstraeaDB excels specifically where connections, semantics, and computation converge.