Chapter 10: Graph Algorithms

AstraeaDB includes production-ready implementations of key graph algorithms that run server-side for maximum efficiency. This chapter covers the built-in algorithms, their use cases, and how to invoke them from your application code.

Graph algorithms transform the raw structure of your data into actionable insights. Rather than exporting your graph to an external analytics tool, AstraeaDB executes these algorithms directly on the storage engine, avoiding serialization overhead and leveraging the index-free adjacency model for optimal traversal performance. These algorithms are currently accessed via the Rust API, with client wrappers available for common operations through GQL extensions or direct API calls.

10.1 PageRank

What It Does

PageRank is the algorithm that originally powered Google's search engine. It ranks nodes by their importance based on the quantity and quality of incoming connections. A node is considered important if it is pointed to by other important nodes. This recursive definition produces a global ranking of every node in the graph.

How It Works

PageRank models a "random surfer" who starts at a random node and follows outgoing edges. At each step, there is a small probability (controlled by the damping factor) that the surfer jumps to a completely random node instead of following an edge. After many iterations, the fraction of time the surfer spends at each node converges to that node's PageRank score.

The algorithm iterates until scores stabilize (converge) or a maximum number of iterations is reached. Nodes with more high-quality incoming links accumulate higher scores.

Configuration Parameters

ParameterDefaultDescription
damping_factor0.85Probability of following an edge vs. jumping randomly. Higher values emphasize link structure; lower values distribute rank more evenly.
max_iterations100Maximum number of iterations before stopping, even if convergence has not been reached.
convergence_threshold1e-6The algorithm stops when the total change in scores between iterations falls below this value.

Example: Ranking Influence in a Social Network

Consider a social network where edges represent "follows" relationships. PageRank will identify the most influential people -- those who are followed by other influential people, not just those with the most followers.

Conceptual Rust API

use astraeadb::algorithms::{PageRank, PageRankConfig};

// Configure the algorithm
let config = PageRankConfig {
    damping_factor: 0.85,
    max_iterations: 100,
    convergence_threshold: 1e-6,
};

// Run PageRank on the entire graph
let scores: HashMap<NodeId, f64> = graph.run_algorithm(PageRank::new(config))?;

// Print the top 5 most important nodes
let mut ranked: Vec<_> = scores.iter().collect();
ranked.sort_by(|a, b| b.1.partial_cmp(a.1).unwrap());
for (node_id, score) in ranked.iter().take(5) {
    println!("Node {}: score {:.4}", node_id, score);
}

Calling from Python

The Python client wraps the algorithm endpoint, making it accessible as a single method call:

from astraeadb import AstraeaClient

with AstraeaClient("127.0.0.1", 7687) as client:
    # Run PageRank with default parameters
    scores = client.pagerank(
        damping_factor=0.85,
        max_iterations=100,
        convergence_threshold=1e-6
    )

    # Sort by score descending
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for node_id, score in ranked[:5]:
        print(f"Node {node_id}: {score:.4f}")
source("r_client.R")

client <- AstraeaClient$new("127.0.0.1", 7687)
client$connect()

# Run PageRank
scores <- client$pagerank(
  damping_factor = 0.85,
  max_iterations = 100,
  convergence_threshold = 1e-6
)

# Sort and display top results
ranked <- scores[order(unlist(scores), decreasing = TRUE)]
head(ranked, 5)

client$close()
package main

import (
    "context"
    "fmt"
    "sort"
    "github.com/AstraeaDB/AstraeaDB-Official"
)

func main() {
    client := astraeadb.NewClient(astraeadb.WithAddress("127.0.0.1", 7687))
    ctx := context.Background()
    client.Connect(ctx)
    defer client.Close()

    scores, _ := client.PageRank(ctx, &astraeadb.PageRankConfig{
        DampingFactor:        0.85,
        MaxIterations:        100,
        ConvergenceThreshold: 1e-6,
    })

    // Sort and print top 5
    type entry struct { ID string; Score float64 }
    var entries []entry
    for id, s := range scores {
        entries = append(entries, entry{id, s})
    }
    sort.Slice(entries, func(i, j int) bool {
        return entries[i].Score > entries[j].Score
    })
    for _, e := range entries[:5] {
        fmt.Printf("Node %s: %.4f\n", e.ID, e.Score)
    }
}
import com.astraeadb.unified.UnifiedClient;
import java.util.Map;

try (var client = UnifiedClient.builder()
        .host("127.0.0.1").port(7687).build()) {
    client.connect();

    Map<String, Double> scores = client.pagerank(
        0.85,   // damping factor
        100,    // max iterations
        1e-6   // convergence threshold
    );

    // Sort by score descending and print top 5
    scores.entrySet().stream()
        .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
        .limit(5)
        .forEach(e -> System.out.printf("Node %s: %.4f%n", e.getKey(), e.getValue()));
}

Understanding the Output

PageRank returns a map of node_id -> score, where scores are floating-point values between 0.0 and 1.0. The scores across all nodes sum to 1.0 (they represent a probability distribution). A higher score means greater importance. In practice, most nodes will have very small scores, while a few highly connected "hub" nodes will dominate.

Tip: Interpreting Scores PageRank scores are relative, not absolute. A score of 0.02 does not mean "2% important" in isolation -- it means this node captures 2% of the total "importance budget" in the graph. Compare scores to each other to find the most influential nodes.

10.2 Connected and Strongly Connected Components

Connected Components

A connected component is a group of nodes where every node can reach every other node through some path, ignoring edge direction. If you have an undirected graph (or treat a directed graph as undirected), connected components tell you how many separate "islands" exist in your data.

This is computed using a simple breadth-first or depth-first traversal: start at any unvisited node, visit all reachable nodes (marking them as the same component), then repeat for any remaining unvisited nodes.

Strongly Connected Components (SCCs)

Strongly connected components are the directed-graph equivalent. A set of nodes forms an SCC if every node can reach every other node following edge direction. AstraeaDB uses Tarjan's algorithm, which finds all SCCs in a single depth-first traversal with O(V + E) complexity.

SCCs are strictly more restrictive than connected components. Every SCC is contained within a connected component, but a connected component may contain multiple SCCs (or nodes that belong to no SCC at all).

Use Cases

Example: Finding Disconnected Communities

from astraeadb import AstraeaClient
from collections import Counter

with AstraeaClient("127.0.0.1", 7687) as client:
    # Find connected components (undirected)
    components = client.connected_components()
    # Returns: {"nd-1": 0, "nd-2": 0, "nd-3": 1, "nd-4": 1, ...}

    # Count nodes per component
    counts = Counter(components.values())
    print(f"Found {len(counts)} connected components")
    for comp_id, size in counts.most_common():
        print(f"  Component {comp_id}: {size} nodes")

    # Find strongly connected components (directed)
    sccs = client.strongly_connected_components()
    scc_counts = Counter(sccs.values())
    print(f"Found {len(scc_counts)} strongly connected components")
source("r_client.R")

client <- AstraeaClient$new("127.0.0.1", 7687)
client$connect()

# Connected components
components <- client$connected_components()
comp_table <- table(unlist(components))
cat("Found", length(comp_table), "connected components\n")

# Strongly connected components
sccs <- client$strongly_connected_components()
scc_table <- table(unlist(sccs))
cat("Found", length(scc_table), "strongly connected components\n")

client$close()
package main

import (
    "context"
    "fmt"
    "github.com/AstraeaDB/AstraeaDB-Official"
)

func main() {
    client := astraeadb.NewClient(astraeadb.WithAddress("127.0.0.1", 7687))
    ctx := context.Background()
    client.Connect(ctx)
    defer client.Close()

    // Connected components
    components, _ := client.ConnectedComponents(ctx)
    counts := make(map[int]int)
    for _, compID := range components {
        counts[compID]++
    }
    fmt.Printf("Found %d connected components\n", len(counts))

    // Strongly connected components
    sccs, _ := client.StronglyConnectedComponents(ctx)
    sccCounts := make(map[int]int)
    for _, compID := range sccs {
        sccCounts[compID]++
    }
    fmt.Printf("Found %d strongly connected components\n", len(sccCounts))
}
import com.astraeadb.unified.UnifiedClient;
import java.util.Map;
import java.util.stream.Collectors;

try (var client = UnifiedClient.builder()
        .host("127.0.0.1").port(7687).build()) {
    client.connect();

    // Connected components
    Map<String, Integer> components = client.connectedComponents();
    var counts = components.values().stream()
        .collect(Collectors.groupingBy(c -> c, Collectors.counting()));
    System.out.printf("Found %d connected components%n", counts.size());

    // Strongly connected components
    Map<String, Integer> sccs = client.stronglyConnectedComponents();
    var sccCounts = sccs.values().stream()
        .collect(Collectors.groupingBy(c -> c, Collectors.counting()));
    System.out.printf("Found %d strongly connected components%n", sccCounts.size());
}

Output Format

Both algorithms return a map of node_id -> component_id. Nodes with the same component ID belong to the same component. Component IDs are arbitrary integers -- only equality matters, not the specific values.

Note: Connected vs. Strongly Connected If your graph is undirected (or you treat it as undirected), connected components and SCCs produce the same result. SCCs become meaningful only for directed graphs, where the direction of edges matters for reachability.

10.3 Centrality Measures

Centrality measures quantify how "central" or "important" a node is within the graph. Different centrality metrics capture different aspects of importance.

Degree Centrality

The simplest centrality measure: count the number of connections a node has. In a directed graph, you can distinguish between:

Normalized degree centrality divides the raw degree by (N - 1), where N is the total number of nodes. This produces a value between 0.0 and 1.0, making it comparable across graphs of different sizes.

with AstraeaClient("127.0.0.1", 7687) as client:
    # Degree centrality (normalized by default)
    centrality = client.degree_centrality(direction="both", normalized=True)

    # Find top connectors
    top = sorted(centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    for node_id, score in top:
        print(f"{node_id}: centrality = {score:.4f}")
client <- AstraeaClient$new("127.0.0.1", 7687)
client$connect()

centrality <- client$degree_centrality(direction = "both", normalized = TRUE)
ranked <- centrality[order(unlist(centrality), decreasing = TRUE)]
head(ranked, 5)

client$close()
scores, _ := client.DegreeCentrality(ctx, "both", true)
for id, score := range scores {
    fmt.Printf("%s: centrality = %.4f\n", id, score)
}
Map<String, Double> centrality = client.degreeCentrality("both", true);
centrality.entrySet().stream()
    .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
    .limit(5)
    .forEach(e -> System.out.printf("%s: %.4f%n", e.getKey(), e.getValue()));

Betweenness Centrality

Betweenness centrality measures how often a node lies on the shortest path between other pairs of nodes. It is computed using Brandes' algorithm, which efficiently calculates the shortest-path dependency for all node pairs.

A node with high betweenness centrality acts as a "bridge" or "broker" in the network. Removing it would significantly disrupt communication between other parts of the graph. This makes betweenness centrality essential for identifying critical infrastructure, key intermediaries, or bottlenecks.

Brandes' algorithm has a time complexity of O(V * E), which makes it more expensive than degree centrality. For very large graphs (millions of nodes), consider sampling or approximate methods.

Example: Identifying Critical Routers

In a network topology graph, betweenness centrality reveals which routers carry the most traffic between different network segments. A router with high betweenness is a single point of failure.

with AstraeaClient("127.0.0.1", 7687) as client:
    # Betweenness centrality (Brandes' algorithm)
    betweenness = client.betweenness_centrality(normalized=True)

    # Identify the most critical "bridge" nodes
    bridges = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)
    print("Most critical nodes (highest betweenness):")
    for node_id, score in bridges[:5]:
        print(f"  {node_id}: {score:.4f}")
betweenness <- client$betweenness_centrality(normalized = TRUE)
ranked <- betweenness[order(unlist(betweenness), decreasing = TRUE)]
cat("Most critical bridge nodes:\n")
head(ranked, 5)
betweenness, _ := client.BetweennessCentrality(ctx, true)
// Sort and print top bridges
type entry struct { ID string; Score float64 }
var entries []entry
for id, s := range betweenness {
    entries = append(entries, entry{id, s})
}
sort.Slice(entries, func(i, j int) bool {
    return entries[i].Score > entries[j].Score
})
fmt.Println("Most critical bridge nodes:")
for _, e := range entries[:5] {
    fmt.Printf("  %s: %.4f\n", e.ID, e.Score)
}
Map<String, Double> betweenness = client.betweennessCentrality(true);
System.out.println("Most critical bridge nodes:");
betweenness.entrySet().stream()
    .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
    .limit(5)
    .forEach(e -> System.out.printf("  %s: %.4f%n", e.getKey(), e.getValue()));
Warning: Computational Cost Betweenness centrality has O(V * E) complexity. On a graph with 1 million nodes and 10 million edges, this can take significant time. For large graphs, consider running this algorithm during off-peak hours or using approximate sampling methods.

10.4 Community Detection (Louvain)

What It Does

The Louvain algorithm discovers natural groupings (communities) within a graph. Nodes within the same community are densely connected to each other but sparsely connected to nodes in other communities. This is useful for understanding the macro-structure of complex networks.

How It Works

Louvain optimizes a metric called modularity through two alternating phases:

  1. Local optimization: Each node is moved to the neighboring community that produces the largest increase in modularity. This repeats until no move improves modularity.
  2. Aggregation: The communities found in phase 1 are collapsed into single "super-nodes," and the process repeats on the coarsened graph.

This continues until no further improvement is possible.

Understanding Modularity

Modularity is a score between -0.5 and 1.0 that measures how well a graph divides into communities:

Modularity ScoreInterpretation
< 0.0Worse than random assignment; no meaningful community structure.
0.0Equivalent to random assignment of nodes to communities.
0.3 - 0.7Good community structure detected. Most real-world networks fall in this range.
> 0.7Very strong community structure.

Example: Co-Authorship Networks

In a graph of academic papers where edges represent co-authorship, Louvain naturally discovers research groups -- clusters of authors who frequently collaborate.

from astraeadb import AstraeaClient
from collections import Counter

with AstraeaClient("127.0.0.1", 7687) as client:
    # Run Louvain community detection
    communities = client.louvain_communities()
    # Returns: {"nd-1": 0, "nd-2": 0, "nd-3": 1, ...}

    # Analyze the results
    counts = Counter(communities.values())
    print(f"Detected {len(counts)} communities")
    for comm_id, size in counts.most_common(10):
        print(f"  Community {comm_id}: {size} members")

    # Find which community a specific author belongs to
    alice_id = "nd-42"
    print(f"Alice is in community {communities[alice_id]}")

    # Find all members of Alice's community
    alice_comm = communities[alice_id]
    peers = [nid for nid, cid in communities.items() if cid == alice_comm]
    print(f"Alice's community has {len(peers)} members")
source("r_client.R")

client <- AstraeaClient$new("127.0.0.1", 7687)
client$connect()

# Run Louvain community detection
communities <- client$louvain_communities()

# Analyze results
comm_table <- table(unlist(communities))
cat("Detected", length(comm_table), "communities\n")
print(sort(comm_table, decreasing = TRUE))

client$close()
communities, _ := client.LouvainCommunities(ctx)

// Count nodes per community
counts := make(map[int]int)
for _, commID := range communities {
    counts[commID]++
}
fmt.Printf("Detected %d communities\n", len(counts))
for commID, size := range counts {
    fmt.Printf("  Community %d: %d members\n", commID, size)
}
Map<String, Integer> communities = client.louvainCommunities();

// Count nodes per community
var counts = communities.values().stream()
    .collect(Collectors.groupingBy(c -> c, Collectors.counting()));

System.out.printf("Detected %d communities%n", counts.size());
counts.entrySet().stream()
    .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
    .limit(10)
    .forEach(e -> System.out.printf("  Community %d: %d members%n",
        e.getKey(), e.getValue()));
Tip: Non-Determinism The Louvain algorithm is non-deterministic. Different runs may discover slightly different community assignments, especially for nodes that sit on the boundary between two communities. The overall structure (number of communities, approximate sizes) will be consistent, but individual node assignments at the margins may vary. If reproducibility is required, set a random seed in the algorithm configuration.

10.5 Algorithm Comparison

The following table summarizes all the graph algorithms available in AstraeaDB, their purpose, computational complexity, and output format.

Algorithm Purpose Complexity Output
PageRank Node importance ranking O(V + E) per iteration Scores (0.0 to 1.0)
Connected Components Cluster detection (undirected) O(V + E) Component IDs
SCCs (Tarjan) Directed cluster detection O(V + E) Component IDs
Degree Centrality Connection counting O(V + E) Scores (0.0 to 1.0)
Betweenness Centrality Bridge / broker detection O(V * E) Scores (0.0 to 1.0)
Louvain Community detection O(V * log V) avg Community IDs
Note: GPU Acceleration For large-scale analytical workloads, AstraeaDB can offload graph algorithms to the GPU via the cuGraph backend. Algorithms like PageRank and Louvain are fundamentally matrix operations and benefit enormously from GPU parallelism. See Chapter 14 (Performance) for details on enabling GPU acceleration.
← Chapter 9: Temporal Graphs Chapter 11: GraphRAG →