Chapter 8: Vector Search and Semantic Queries

AstraeaDB is not just a graph database—it is a Vector-Property Graph. Every node can carry a dense embedding alongside its properties and labels, enabling a family of queries that blend structural graph traversal with semantic similarity.

Prerequisites This chapter assumes you have completed Chapters 3–6 and have a running AstraeaDB instance with the Python, R, Go, or Java client installed. Familiarity with basic machine-learning concepts (embeddings, cosine similarity) is helpful but not required—we explain everything from first principles.

8.1 What Are Vector Embeddings?

A vector embedding is a fixed-size array of floating-point numbers (typically 128 to 1536 dimensions) that represents the "meaning" of a piece of content—a sentence, a document, an image, a user profile, or any concept you want to reason about computationally.

The core insight

Embedding models are trained so that similar things have similar vectors. If you embed the sentences "The cat sat on the mat" and "A kitten rested on the rug," their vectors will be very close together in high-dimensional space, even though the sentences share almost no words. Conversely, "Stock prices fell sharply" will produce a vector far from both.

Distance metrics

"Closeness" is measured by a distance function. AstraeaDB supports three:

Metric	Formula	Range	When to use
Cosine similarity	`1 - (A . B) / (\|A\| * \|B\|)`	0 (identical) to 2 (opposite)	Text embeddings (most common)
Euclidean (L2)	`sqrt(sum((a_i - b_i)^2))`	0 to infinity	Image embeddings, spatial data
Dot product	`A . B`	-infinity to infinity	Pre-normalized vectors, ranking

Visual intuition

Imagine a 2D scatter plot (a simplified projection of the actual high-dimensional space):

Why this matters for graph databases

Traditional graph queries answer structural questions: "Who is connected to whom?" Vector search answers semantic questions: "What is conceptually similar to X?" AstraeaDB lets you ask both at once: "Among Alice's 2-hop neighbors, which ones are most semantically related to 'machine learning'?" This combination—called hybrid search—is the foundation for GraphRAG, recommendation engines, and knowledge discovery.

8.2 Adding Embeddings to Nodes

When you create a node, you can attach an embedding by passing a float32 array. The embedding is stored alongside the node's labels and properties, and is automatically indexed for fast approximate nearest-neighbor search.

Embeddings come from external models AstraeaDB stores and indexes embeddings but does not generate them. You produce embeddings using a model of your choice—OpenAI's text-embedding-3-small, Sentence Transformers, CLIP, or any other encoder. The only requirement is that all embeddings in a given search share the same dimensionality.

from astraeadb.client import JsonClient

client = JsonClient("localhost")
client.connect()

# Generate an embedding with your model of choice
# (here we use a placeholder 128-dim vector for illustration)
embedding = [0.12, -0.45, 0.78, 0.03, -0.91] + [0.0] * 123  # 128 dims

# Create a node with labels, properties, AND an embedding
node_id = client.create_node(
    ["Document"],
    {"title": "Graph Databases", "content": "An introduction to..."},
    embedding=embedding
)
print("Created node:", node_id)

# You can also update an embedding on an existing node
new_embedding = [0.22, -0.33, 0.55] + [0.0] * 125
client.set_embedding(node_id, new_embedding)

library(astraea)

client <- AstraeaClient$new("localhost")
client$connect()

# Generate an embedding (placeholder 128-dim vector)
embedding <- c(0.12, -0.45, 0.78, 0.03, -0.91, rep(0, 123))

# Create a node with labels, properties, and embedding
node_id <- client$create_node(
  labels     = c("Document"),
  properties = list(title = "Graph Databases", content = "An introduction to..."),
  embedding  = embedding
)
cat("Created node:", node_id, "\n")

# Update an embedding on an existing node
new_embedding <- c(0.22, -0.33, 0.55, rep(0, 125))
client$set_embedding(node_id, new_embedding)

package main

import (
    "fmt"
    "log"
    astraea "github.com/AstraeaDB/AstraeaDB-Official"
)

func main() {
    client, err := astraea.NewJSONClient("localhost", 7687)
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    // Generate an embedding (placeholder 128-dim vector)
    embedding := make([]float32, 128)
    embedding[0] = 0.12
    embedding[1] = -0.45
    embedding[2] = 0.78

    // Create a node with labels, properties, and embedding
    props := map[string]interface{}{
        "title":   "Graph Databases",
        "content": "An introduction to...",
    }
    nodeID, _ := client.CreateNode(
        []string{"Document"},
        props,
        astraea.WithEmbedding(embedding),
    )
    fmt.Println("Created node:", nodeID)
}

import com.astraeadb.client.JsonClient;
import java.util.*;

public class EmbeddingExample {
    public static void main(String[] args) {
        JsonClient client = new JsonClient("localhost", 7687);
        client.connect();

        // Generate an embedding (placeholder 128-dim vector)
        float[] embedding = new float[128];
        embedding[0] = 0.12f;
        embedding[1] = -0.45f;
        embedding[2] = 0.78f;

        // Create a node with labels, properties, and embedding
        Map<String, Object> props = Map.of(
            "title", "Graph Databases",
            "content", "An introduction to..."
        );
        String nodeId = client.createNode(
            List.of("Document"),
            props,
            embedding
        );
        System.out.println("Created node: " + nodeId);

        client.close();
    }
}

8.3 Vector Search (k-NN)

Once nodes carry embeddings, you can find the k most similar nodes to a query vector. This is called k-nearest-neighbor (k-NN) search. AstraeaDB uses an HNSW (Hierarchical Navigable Small World) index to answer these queries in sub-millisecond time, even on millions of vectors.

How HNSW works (in brief)

HNSW builds a multi-layer graph over the vectors. The top layer is sparse and enables long-range jumps; the bottom layer is dense and enables fine-grained search. A query starts at the top, greedily walks toward the nearest vector, then descends to the next layer and repeats. This achieves logarithmic search complexity—dramatically faster than scanning every vector.

API

Call vector_search(query_vector, k) and receive a list of results, each containing the node ID, distance, labels, and properties.

# Suppose we have an embedding for the query "machine learning"
query_vec = model.encode("machine learning")  # returns a list of floats

# Find the 10 most similar nodes
results = client.vector_search(query_vec, k=10)

for r in results:
    print(f"Node {r['node_id']}  dist={r['distance']:.4f}  "
          f"labels={r['labels']}  title={r['properties']['title']}")

# Example output:
# Node 42  dist=0.0812  labels=['Document']  title=Introduction to ML
# Node 17  dist=0.1034  labels=['Document']  title=Deep Learning Basics
# Node 91  dist=0.1567  labels=['Document']  title=Neural Network Architectures

# Suppose we have an embedding for the query "machine learning"
query_vec <- model$encode("machine learning")

# Find the 10 most similar nodes
results <- client$vector_search(query_vec, k = 10)

for (r in results) {
  cat(sprintf("Node %s  dist=%.4f  title=%s\n",
              r$node_id, r$distance, r$properties$title))
}

// Suppose queryVec is a []float32 from your embedding model
results, _ := client.VectorSearch(queryVec, 10)

for _, r := range results {
    fmt.Printf("Node %s  dist=%.4f  labels=%v  title=%s\n",
        r.NodeID, r.Distance, r.Labels, r.Properties["title"])
}

// Suppose queryVec is a float[] from your embedding model
List<VectorResult> results = client.vectorSearch(queryVec, 10);

for (VectorResult r : results) {
    System.out.printf("Node %s  dist=%.4f  labels=%s  title=%s%n",
        r.getNodeId(), r.getDistance(),
        r.getLabels(), r.getProperties().get("title"));
}

8.4 Hybrid Search

Pure vector search ignores graph structure. Pure graph traversal ignores semantic meaning. Hybrid search combines both: it finds nodes that are close in the graph to an anchor node and close in vector space to a query concept.

How it works

Hybrid Search ┌─────────────────────────────────────────────────────────┐ │ │ │ 1. Start from anchor node (e.g., "Alice") │ │ │ │ │ ▼ │ │ 2. BFS: collect all nodes within max_hops │ │ │ │ │ ▼ │ │ 3. For each candidate, compute: │ │ graph_score = 1 / (1 + hop_distance) │ │ vector_score = 1 - cosine_distance(candidate, │ │ query_vector) │ │ │ │ │ ▼ │ │ 4. Blend: │ │ final_score = alpha * vector_score │ │ + (1 - alpha) * graph_score │ │ │ │ │ ▼ │ │ 5. Return top-k by final_score │ │ │ └─────────────────────────────────────────────────────────┘

The alpha parameter

Alpha	Behavior	Use case
`0.0`	Pure graph proximity (closest by hops)	Structural exploration
`0.5`	Balanced blend	General-purpose (recommended default)
`1.0`	Pure vector similarity (ignores graph distance)	Semantic search within a subgraph

Example: "Find nodes near Alice that relate to machine learning"

# Embed the concept we are searching for
ml_vector = model.encode("machine learning")

# Hybrid search: anchor on Alice, blend graph + vector
results = client.hybrid_search(
    anchor=alice_id,          # start node
    query_vector=ml_vector,   # semantic target
    max_hops=3,               # graph radius
    k=10,                     # number of results
    alpha=0.5                 # blend factor
)

for r in results:
    print(f"Node {r['node_id']}  hops={r['hops']}  "
          f"vector_dist={r['vector_distance']:.4f}  "
          f"score={r['score']:.4f}  "
          f"title={r['properties']['title']}")

# Example output:
# Node 88  hops=2  vector_dist=0.0912  score=0.7877  title=ML Pipeline Design
# Node 45  hops=1  vector_dist=0.2134  score=0.7266  title=Data Science Team
# Node 67  hops=3  vector_dist=0.0501  score=0.7249  title=Deep Learning Lab

# Embed the concept we are searching for
ml_vector <- model$encode("machine learning")

# Hybrid search: anchor on Alice, blend graph + vector
results <- client$hybrid_search(
  anchor       = alice_id,
  query_vector = ml_vector,
  max_hops     = 3,
  k            = 10,
  alpha        = 0.5
)

for (r in results) {
  cat(sprintf("Node %s  hops=%d  vector_dist=%.4f  score=%.4f  title=%s\n",
              r$node_id, r$hops, r$vector_distance,
              r$score, r$properties$title))
}

// Embed the concept we are searching for
mlVector := model.Encode("machine learning")

// Hybrid search: anchor on Alice, blend graph + vector
results, _ := client.HybridSearch(astraea.HybridSearchParams{
    Anchor:      aliceID,
    QueryVector: mlVector,
    MaxHops:     3,
    K:           10,
    Alpha:       0.5,
})

for _, r := range results {
    fmt.Printf("Node %s  hops=%d  vector_dist=%.4f  score=%.4f  title=%s\n",
        r.NodeID, r.Hops, r.VectorDistance,
        r.Score, r.Properties["title"])
}

// Embed the concept we are searching for
float[] mlVector = model.encode("machine learning");

// Hybrid search: anchor on Alice, blend graph + vector
List<HybridResult> results = client.hybridSearch(
    aliceId,      // anchor
    mlVector,     // query vector
    3,            // max hops
    10,           // k
    0.5           // alpha
);

for (HybridResult r : results) {
    System.out.printf(
        "Node %s  hops=%d  vector_dist=%.4f  score=%.4f  title=%s%n",
        r.getNodeId(), r.getHops(), r.getVectorDistance(),
        r.getScore(), r.getProperties().get("title"));
}

8.5 Semantic Neighbors and Semantic Walk

Beyond hybrid search, AstraeaDB offers two additional semantic operations that combine vector similarity with the graph's edge structure in different ways.

Semantic Neighbors

Given a node and a concept vector, semantic neighbors ranks that node's actual graph neighbors (connected by edges) by how similar their embeddings are to the concept. Unlike hybrid search, this does not perform a BFS—it only looks at direct neighbors.

# Which of Alice's outgoing neighbors are most related to "risk"?
risk_vec = model.encode("risk management")

neighbors = client.semantic_neighbors(
    node_id=alice_id,
    concept_vector=risk_vec,
    direction="outgoing",   # "outgoing", "incoming", or "both"
    k=10
)

for n in neighbors:
    print(f"{n['node_id']}  similarity={n['similarity']:.4f}  "
          f"edge={n['edge_label']}  title={n['properties']['title']}")

# Example output:
# abc123  similarity=0.8921  edge=AUTHORED  title=Risk Assessment Framework
# def456  similarity=0.7134  edge=REVIEWED  title=Compliance Guidelines

# Which of Alice's outgoing neighbors are most related to "risk"?
risk_vec <- model$encode("risk management")

neighbors <- client$semantic_neighbors(
  node_id        = alice_id,
  concept_vector = risk_vec,
  direction      = "outgoing",
  k              = 10
)

for (n in neighbors) {
  cat(sprintf("%s  similarity=%.4f  edge=%s  title=%s\n",
              n$node_id, n$similarity, n$edge_label,
              n$properties$title))
}

// Which of Alice's outgoing neighbors are most related to "risk"?
riskVec := model.Encode("risk management")

neighbors, _ := client.SemanticNeighbors(astraea.SemanticNeighborsParams{
    NodeID:        aliceID,
    ConceptVector: riskVec,
    Direction:     "outgoing",
    K:             10,
})

for _, n := range neighbors {
    fmt.Printf("%s  similarity=%.4f  edge=%s  title=%s\n",
        n.NodeID, n.Similarity, n.EdgeLabel, n.Properties["title"])
}

// Which of Alice's outgoing neighbors are most related to "risk"?
float[] riskVec = model.encode("risk management");

List<SemanticNeighborResult> neighbors = client.semanticNeighbors(
    aliceId,        // node ID
    riskVec,        // concept vector
    "outgoing",     // direction
    10              // k
);

for (SemanticNeighborResult n : neighbors) {
    System.out.printf("%s  similarity=%.4f  edge=%s  title=%s%n",
        n.getNodeId(), n.getSimilarity(),
        n.getEdgeLabel(), n.getProperties().get("title"));
}

Semantic Walk

A semantic walk is a greedy traversal: starting from a node, at each hop the algorithm chooses the neighbor whose embedding is most similar to the concept vector, then continues from that neighbor. The walk terminates after max_hops or when no neighbor improves the similarity score.

Think of it as "follow the gradient toward a concept through the graph's structure." The result is the path taken.

# Starting from a "Database" topic node, walk toward "Machine Learning"
ml_vec = model.encode("machine learning")

path = client.semantic_walk(
    start=database_node_id,
    concept_vector=ml_vec,
    max_hops=3
)

print("Semantic walk path:")
for i, step in enumerate(path):
    print(f"  Hop {i}: {step['node_id']}  "
          f"similarity={step['similarity']:.4f}  "
          f"title={step['properties']['title']}")

# Example output:
# Semantic walk path:
#   Hop 0: n001  similarity=0.3210  title=Databases
#   Hop 1: n047  similarity=0.5892  title=Data Mining
#   Hop 2: n088  similarity=0.7831  title=Statistical Learning
#   Hop 3: n112  similarity=0.9145  title=Machine Learning Fundamentals

# Starting from a "Database" topic node, walk toward "Machine Learning"
ml_vec <- model$encode("machine learning")

path <- client$semantic_walk(
  start          = database_node_id,
  concept_vector = ml_vec,
  max_hops       = 3
)

cat("Semantic walk path:\n")
for (i in seq_along(path)) {
  step <- path[[i]]
  cat(sprintf("  Hop %d: %s  similarity=%.4f  title=%s\n",
              i - 1, step$node_id, step$similarity,
              step$properties$title))
}

// Starting from a "Database" topic node, walk toward "Machine Learning"
mlVec := model.Encode("machine learning")

path, _ := client.SemanticWalk(astraea.SemanticWalkParams{
    Start:         databaseNodeID,
    ConceptVector: mlVec,
    MaxHops:       3,
})

fmt.Println("Semantic walk path:")
for i, step := range path {
    fmt.Printf("  Hop %d: %s  similarity=%.4f  title=%s\n",
        i, step.NodeID, step.Similarity, step.Properties["title"])
}

// Starting from a "Database" topic node, walk toward "Machine Learning"
float[] mlVec = model.encode("machine learning");

List<WalkStep> path = client.semanticWalk(
    databaseNodeId,   // start
    mlVec,            // concept vector
    3                 // max hops
);

System.out.println("Semantic walk path:");
for (int i = 0; i < path.size(); i++) {
    WalkStep step = path.get(i);
    System.out.printf("  Hop %d: %s  similarity=%.4f  title=%s%n",
        i, step.getNodeId(), step.getSimilarity(),
        step.getProperties().get("title"));
}

When to use which? Use vector search when you want globally similar nodes regardless of graph structure. Use hybrid search when you want results that are both structurally connected and semantically relevant. Use semantic neighbors when you want to rank a specific node's direct connections. Use semantic walk when you want to discover a path through the graph toward a concept.

← Ch 7: Transport Protocols Ch 9: Temporal Graphs →