Chapter 8: Vector Search and Semantic Queries
AstraeaDB is not just a graph database—it is a Vector-Property Graph. Every node can carry a dense embedding alongside its properties and labels, enabling a family of queries that blend structural graph traversal with semantic similarity.
8.1 What Are Vector Embeddings?
A vector embedding is a fixed-size array of floating-point numbers (typically 128 to 1536 dimensions) that represents the "meaning" of a piece of content—a sentence, a document, an image, a user profile, or any concept you want to reason about computationally.
The core insight
Embedding models are trained so that similar things have similar vectors. If you embed the sentences "The cat sat on the mat" and "A kitten rested on the rug," their vectors will be very close together in high-dimensional space, even though the sentences share almost no words. Conversely, "Stock prices fell sharply" will produce a vector far from both.
Distance metrics
"Closeness" is measured by a distance function. AstraeaDB supports three:
| Metric | Formula | Range | When to use |
|---|---|---|---|
| Cosine similarity | 1 - (A . B) / (|A| * |B|) |
0 (identical) to 2 (opposite) | Text embeddings (most common) |
| Euclidean (L2) | sqrt(sum((a_i - b_i)^2)) |
0 to infinity | Image embeddings, spatial data |
| Dot product | A . B |
-infinity to infinity | Pre-normalized vectors, ranking |
Visual intuition
Imagine a 2D scatter plot (a simplified projection of the actual high-dimensional space):
Why this matters for graph databases
Traditional graph queries answer structural questions: "Who is connected to whom?" Vector search answers semantic questions: "What is conceptually similar to X?" AstraeaDB lets you ask both at once: "Among Alice's 2-hop neighbors, which ones are most semantically related to 'machine learning'?" This combination—called hybrid search—is the foundation for GraphRAG, recommendation engines, and knowledge discovery.
8.2 Adding Embeddings to Nodes
When you create a node, you can attach an embedding by passing a float32 array. The embedding is stored alongside the node's labels and properties, and is automatically indexed for fast approximate nearest-neighbor search.
text-embedding-3-small, Sentence Transformers, CLIP, or any other encoder. The only requirement is that all embeddings in a given search share the same dimensionality.
from astraeadb.client import JsonClient client = JsonClient("localhost") client.connect() # Generate an embedding with your model of choice # (here we use a placeholder 128-dim vector for illustration) embedding = [0.12, -0.45, 0.78, 0.03, -0.91] + [0.0] * 123 # 128 dims # Create a node with labels, properties, AND an embedding node_id = client.create_node( ["Document"], {"title": "Graph Databases", "content": "An introduction to..."}, embedding=embedding ) print("Created node:", node_id) # You can also update an embedding on an existing node new_embedding = [0.22, -0.33, 0.55] + [0.0] * 125 client.set_embedding(node_id, new_embedding)
library(astraea) client <- AstraeaClient$new("localhost") client$connect() # Generate an embedding (placeholder 128-dim vector) embedding <- c(0.12, -0.45, 0.78, 0.03, -0.91, rep(0, 123)) # Create a node with labels, properties, and embedding node_id <- client$create_node( labels = c("Document"), properties = list(title = "Graph Databases", content = "An introduction to..."), embedding = embedding ) cat("Created node:", node_id, "\n") # Update an embedding on an existing node new_embedding <- c(0.22, -0.33, 0.55, rep(0, 125)) client$set_embedding(node_id, new_embedding)
package main import ( "fmt" "log" astraea "github.com/AstraeaDB/AstraeaDB-Official" ) func main() { client, err := astraea.NewJSONClient("localhost", 7687) if err != nil { log.Fatal(err) } defer client.Close() // Generate an embedding (placeholder 128-dim vector) embedding := make([]float32, 128) embedding[0] = 0.12 embedding[1] = -0.45 embedding[2] = 0.78 // Create a node with labels, properties, and embedding props := map[string]interface{}{ "title": "Graph Databases", "content": "An introduction to...", } nodeID, _ := client.CreateNode( []string{"Document"}, props, astraea.WithEmbedding(embedding), ) fmt.Println("Created node:", nodeID) }
import com.astraeadb.client.JsonClient; import java.util.*; public class EmbeddingExample { public static void main(String[] args) { JsonClient client = new JsonClient("localhost", 7687); client.connect(); // Generate an embedding (placeholder 128-dim vector) float[] embedding = new float[128]; embedding[0] = 0.12f; embedding[1] = -0.45f; embedding[2] = 0.78f; // Create a node with labels, properties, and embedding Map<String, Object> props = Map.of( "title", "Graph Databases", "content", "An introduction to..." ); String nodeId = client.createNode( List.of("Document"), props, embedding ); System.out.println("Created node: " + nodeId); client.close(); } }
8.3 Vector Search (k-NN)
Once nodes carry embeddings, you can find the k most similar nodes to a query vector. This is called k-nearest-neighbor (k-NN) search. AstraeaDB uses an HNSW (Hierarchical Navigable Small World) index to answer these queries in sub-millisecond time, even on millions of vectors.
How HNSW works (in brief)
HNSW builds a multi-layer graph over the vectors. The top layer is sparse and enables long-range jumps; the bottom layer is dense and enables fine-grained search. A query starts at the top, greedily walks toward the nearest vector, then descends to the next layer and repeats. This achieves logarithmic search complexity—dramatically faster than scanning every vector.
API
Call vector_search(query_vector, k) and receive a list of results, each containing the node ID, distance, labels, and properties.
# Suppose we have an embedding for the query "machine learning" query_vec = model.encode("machine learning") # returns a list of floats # Find the 10 most similar nodes results = client.vector_search(query_vec, k=10) for r in results: print(f"Node {r['node_id']} dist={r['distance']:.4f} " f"labels={r['labels']} title={r['properties']['title']}") # Example output: # Node 42 dist=0.0812 labels=['Document'] title=Introduction to ML # Node 17 dist=0.1034 labels=['Document'] title=Deep Learning Basics # Node 91 dist=0.1567 labels=['Document'] title=Neural Network Architectures
# Suppose we have an embedding for the query "machine learning" query_vec <- model$encode("machine learning") # Find the 10 most similar nodes results <- client$vector_search(query_vec, k = 10) for (r in results) { cat(sprintf("Node %s dist=%.4f title=%s\n", r$node_id, r$distance, r$properties$title)) }
// Suppose queryVec is a []float32 from your embedding model results, _ := client.VectorSearch(queryVec, 10) for _, r := range results { fmt.Printf("Node %s dist=%.4f labels=%v title=%s\n", r.NodeID, r.Distance, r.Labels, r.Properties["title"]) }
// Suppose queryVec is a float[] from your embedding model List<VectorResult> results = client.vectorSearch(queryVec, 10); for (VectorResult r : results) { System.out.printf("Node %s dist=%.4f labels=%s title=%s%n", r.getNodeId(), r.getDistance(), r.getLabels(), r.getProperties().get("title")); }
8.4 Hybrid Search
Pure vector search ignores graph structure. Pure graph traversal ignores semantic meaning. Hybrid search combines both: it finds nodes that are close in the graph to an anchor node and close in vector space to a query concept.
How it works
The alpha parameter
| Alpha | Behavior | Use case |
|---|---|---|
0.0 |
Pure graph proximity (closest by hops) | Structural exploration |
0.5 |
Balanced blend | General-purpose (recommended default) |
1.0 |
Pure vector similarity (ignores graph distance) | Semantic search within a subgraph |
Example: "Find nodes near Alice that relate to machine learning"
# Embed the concept we are searching for ml_vector = model.encode("machine learning") # Hybrid search: anchor on Alice, blend graph + vector results = client.hybrid_search( anchor=alice_id, # start node query_vector=ml_vector, # semantic target max_hops=3, # graph radius k=10, # number of results alpha=0.5 # blend factor ) for r in results: print(f"Node {r['node_id']} hops={r['hops']} " f"vector_dist={r['vector_distance']:.4f} " f"score={r['score']:.4f} " f"title={r['properties']['title']}") # Example output: # Node 88 hops=2 vector_dist=0.0912 score=0.7877 title=ML Pipeline Design # Node 45 hops=1 vector_dist=0.2134 score=0.7266 title=Data Science Team # Node 67 hops=3 vector_dist=0.0501 score=0.7249 title=Deep Learning Lab
# Embed the concept we are searching for ml_vector <- model$encode("machine learning") # Hybrid search: anchor on Alice, blend graph + vector results <- client$hybrid_search( anchor = alice_id, query_vector = ml_vector, max_hops = 3, k = 10, alpha = 0.5 ) for (r in results) { cat(sprintf("Node %s hops=%d vector_dist=%.4f score=%.4f title=%s\n", r$node_id, r$hops, r$vector_distance, r$score, r$properties$title)) }
// Embed the concept we are searching for mlVector := model.Encode("machine learning") // Hybrid search: anchor on Alice, blend graph + vector results, _ := client.HybridSearch(astraea.HybridSearchParams{ Anchor: aliceID, QueryVector: mlVector, MaxHops: 3, K: 10, Alpha: 0.5, }) for _, r := range results { fmt.Printf("Node %s hops=%d vector_dist=%.4f score=%.4f title=%s\n", r.NodeID, r.Hops, r.VectorDistance, r.Score, r.Properties["title"]) }
// Embed the concept we are searching for float[] mlVector = model.encode("machine learning"); // Hybrid search: anchor on Alice, blend graph + vector List<HybridResult> results = client.hybridSearch( aliceId, // anchor mlVector, // query vector 3, // max hops 10, // k 0.5 // alpha ); for (HybridResult r : results) { System.out.printf( "Node %s hops=%d vector_dist=%.4f score=%.4f title=%s%n", r.getNodeId(), r.getHops(), r.getVectorDistance(), r.getScore(), r.getProperties().get("title")); }
8.5 Semantic Neighbors and Semantic Walk
Beyond hybrid search, AstraeaDB offers two additional semantic operations that combine vector similarity with the graph's edge structure in different ways.
Semantic Neighbors
Given a node and a concept vector, semantic neighbors ranks that node's actual graph neighbors (connected by edges) by how similar their embeddings are to the concept. Unlike hybrid search, this does not perform a BFS—it only looks at direct neighbors.
# Which of Alice's outgoing neighbors are most related to "risk"? risk_vec = model.encode("risk management") neighbors = client.semantic_neighbors( node_id=alice_id, concept_vector=risk_vec, direction="outgoing", # "outgoing", "incoming", or "both" k=10 ) for n in neighbors: print(f"{n['node_id']} similarity={n['similarity']:.4f} " f"edge={n['edge_label']} title={n['properties']['title']}") # Example output: # abc123 similarity=0.8921 edge=AUTHORED title=Risk Assessment Framework # def456 similarity=0.7134 edge=REVIEWED title=Compliance Guidelines
# Which of Alice's outgoing neighbors are most related to "risk"? risk_vec <- model$encode("risk management") neighbors <- client$semantic_neighbors( node_id = alice_id, concept_vector = risk_vec, direction = "outgoing", k = 10 ) for (n in neighbors) { cat(sprintf("%s similarity=%.4f edge=%s title=%s\n", n$node_id, n$similarity, n$edge_label, n$properties$title)) }
// Which of Alice's outgoing neighbors are most related to "risk"? riskVec := model.Encode("risk management") neighbors, _ := client.SemanticNeighbors(astraea.SemanticNeighborsParams{ NodeID: aliceID, ConceptVector: riskVec, Direction: "outgoing", K: 10, }) for _, n := range neighbors { fmt.Printf("%s similarity=%.4f edge=%s title=%s\n", n.NodeID, n.Similarity, n.EdgeLabel, n.Properties["title"]) }
// Which of Alice's outgoing neighbors are most related to "risk"? float[] riskVec = model.encode("risk management"); List<SemanticNeighborResult> neighbors = client.semanticNeighbors( aliceId, // node ID riskVec, // concept vector "outgoing", // direction 10 // k ); for (SemanticNeighborResult n : neighbors) { System.out.printf("%s similarity=%.4f edge=%s title=%s%n", n.getNodeId(), n.getSimilarity(), n.getEdgeLabel(), n.getProperties().get("title")); }
Semantic Walk
A semantic walk is a greedy traversal: starting from a node, at each hop the algorithm chooses the neighbor whose embedding is most similar to the concept vector, then continues from that neighbor. The walk terminates after max_hops or when no neighbor improves the similarity score.
Think of it as "follow the gradient toward a concept through the graph's structure." The result is the path taken.
# Starting from a "Database" topic node, walk toward "Machine Learning" ml_vec = model.encode("machine learning") path = client.semantic_walk( start=database_node_id, concept_vector=ml_vec, max_hops=3 ) print("Semantic walk path:") for i, step in enumerate(path): print(f" Hop {i}: {step['node_id']} " f"similarity={step['similarity']:.4f} " f"title={step['properties']['title']}") # Example output: # Semantic walk path: # Hop 0: n001 similarity=0.3210 title=Databases # Hop 1: n047 similarity=0.5892 title=Data Mining # Hop 2: n088 similarity=0.7831 title=Statistical Learning # Hop 3: n112 similarity=0.9145 title=Machine Learning Fundamentals
# Starting from a "Database" topic node, walk toward "Machine Learning" ml_vec <- model$encode("machine learning") path <- client$semantic_walk( start = database_node_id, concept_vector = ml_vec, max_hops = 3 ) cat("Semantic walk path:\n") for (i in seq_along(path)) { step <- path[[i]] cat(sprintf(" Hop %d: %s similarity=%.4f title=%s\n", i - 1, step$node_id, step$similarity, step$properties$title)) }
// Starting from a "Database" topic node, walk toward "Machine Learning" mlVec := model.Encode("machine learning") path, _ := client.SemanticWalk(astraea.SemanticWalkParams{ Start: databaseNodeID, ConceptVector: mlVec, MaxHops: 3, }) fmt.Println("Semantic walk path:") for i, step := range path { fmt.Printf(" Hop %d: %s similarity=%.4f title=%s\n", i, step.NodeID, step.Similarity, step.Properties["title"]) }
// Starting from a "Database" topic node, walk toward "Machine Learning" float[] mlVec = model.encode("machine learning"); List<WalkStep> path = client.semanticWalk( databaseNodeId, // start mlVec, // concept vector 3 // max hops ); System.out.println("Semantic walk path:"); for (int i = 0; i < path.size(); i++) { WalkStep step = path.get(i); System.out.printf(" Hop %d: %s similarity=%.4f title=%s%n", i, step.getNodeId(), step.getSimilarity(), step.getProperties().get("title")); }