Chapter 10: Graph Algorithms
AstraeaDB includes production-ready implementations of key graph algorithms that run server-side for maximum efficiency. This chapter covers the built-in algorithms, their use cases, and how to invoke them from your application code.
Graph algorithms transform the raw structure of your data into actionable insights. Rather than exporting your graph to an external analytics tool, AstraeaDB executes these algorithms directly on the storage engine, avoiding serialization overhead and leveraging the index-free adjacency model for optimal traversal performance. These algorithms are currently accessed via the Rust API, with client wrappers available for common operations through GQL extensions or direct API calls.
10.1 PageRank
What It Does
PageRank is the algorithm that originally powered Google's search engine. It ranks nodes by their importance based on the quantity and quality of incoming connections. A node is considered important if it is pointed to by other important nodes. This recursive definition produces a global ranking of every node in the graph.
How It Works
PageRank models a "random surfer" who starts at a random node and follows outgoing edges. At each step, there is a small probability (controlled by the damping factor) that the surfer jumps to a completely random node instead of following an edge. After many iterations, the fraction of time the surfer spends at each node converges to that node's PageRank score.
The algorithm iterates until scores stabilize (converge) or a maximum number of iterations is reached. Nodes with more high-quality incoming links accumulate higher scores.
Configuration Parameters
| Parameter | Default | Description |
|---|---|---|
damping_factor | 0.85 | Probability of following an edge vs. jumping randomly. Higher values emphasize link structure; lower values distribute rank more evenly. |
max_iterations | 100 | Maximum number of iterations before stopping, even if convergence has not been reached. |
convergence_threshold | 1e-6 | The algorithm stops when the total change in scores between iterations falls below this value. |
Example: Ranking Influence in a Social Network
Consider a social network where edges represent "follows" relationships. PageRank will identify the most influential people -- those who are followed by other influential people, not just those with the most followers.
Conceptual Rust API
use astraeadb::algorithms::{PageRank, PageRankConfig}; // Configure the algorithm let config = PageRankConfig { damping_factor: 0.85, max_iterations: 100, convergence_threshold: 1e-6, }; // Run PageRank on the entire graph let scores: HashMap<NodeId, f64> = graph.run_algorithm(PageRank::new(config))?; // Print the top 5 most important nodes let mut ranked: Vec<_> = scores.iter().collect(); ranked.sort_by(|a, b| b.1.partial_cmp(a.1).unwrap()); for (node_id, score) in ranked.iter().take(5) { println!("Node {}: score {:.4}", node_id, score); }
Calling from Python
The Python client wraps the algorithm endpoint, making it accessible as a single method call:
from astraeadb import AstraeaClient with AstraeaClient("127.0.0.1", 7687) as client: # Run PageRank with default parameters scores = client.pagerank( damping_factor=0.85, max_iterations=100, convergence_threshold=1e-6 ) # Sort by score descending ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True) for node_id, score in ranked[:5]: print(f"Node {node_id}: {score:.4f}")
source("r_client.R") client <- AstraeaClient$new("127.0.0.1", 7687) client$connect() # Run PageRank scores <- client$pagerank( damping_factor = 0.85, max_iterations = 100, convergence_threshold = 1e-6 ) # Sort and display top results ranked <- scores[order(unlist(scores), decreasing = TRUE)] head(ranked, 5) client$close()
package main import ( "context" "fmt" "sort" "github.com/AstraeaDB/AstraeaDB-Official" ) func main() { client := astraeadb.NewClient(astraeadb.WithAddress("127.0.0.1", 7687)) ctx := context.Background() client.Connect(ctx) defer client.Close() scores, _ := client.PageRank(ctx, &astraeadb.PageRankConfig{ DampingFactor: 0.85, MaxIterations: 100, ConvergenceThreshold: 1e-6, }) // Sort and print top 5 type entry struct { ID string; Score float64 } var entries []entry for id, s := range scores { entries = append(entries, entry{id, s}) } sort.Slice(entries, func(i, j int) bool { return entries[i].Score > entries[j].Score }) for _, e := range entries[:5] { fmt.Printf("Node %s: %.4f\n", e.ID, e.Score) } }
import com.astraeadb.unified.UnifiedClient; import java.util.Map; try (var client = UnifiedClient.builder() .host("127.0.0.1").port(7687).build()) { client.connect(); Map<String, Double> scores = client.pagerank( 0.85, // damping factor 100, // max iterations 1e-6 // convergence threshold ); // Sort by score descending and print top 5 scores.entrySet().stream() .sorted(Map.Entry.<String, Double>comparingByValue().reversed()) .limit(5) .forEach(e -> System.out.printf("Node %s: %.4f%n", e.getKey(), e.getValue())); }
Understanding the Output
PageRank returns a map of node_id -> score, where scores are floating-point values between 0.0 and 1.0. The scores across all nodes sum to 1.0 (they represent a probability distribution). A higher score means greater importance. In practice, most nodes will have very small scores, while a few highly connected "hub" nodes will dominate.
10.2 Connected and Strongly Connected Components
Connected Components
A connected component is a group of nodes where every node can reach every other node through some path, ignoring edge direction. If you have an undirected graph (or treat a directed graph as undirected), connected components tell you how many separate "islands" exist in your data.
This is computed using a simple breadth-first or depth-first traversal: start at any unvisited node, visit all reachable nodes (marking them as the same component), then repeat for any remaining unvisited nodes.
Strongly Connected Components (SCCs)
Strongly connected components are the directed-graph equivalent. A set of nodes forms an SCC if every node can reach every other node following edge direction. AstraeaDB uses Tarjan's algorithm, which finds all SCCs in a single depth-first traversal with O(V + E) complexity.
SCCs are strictly more restrictive than connected components. Every SCC is contained within a connected component, but a connected component may contain multiple SCCs (or nodes that belong to no SCC at all).
Use Cases
- Finding isolated subgraphs: Identify disconnected portions of your data that have no connections to the rest of the graph.
- Detecting clusters: Connected components can reveal natural groupings without any algorithm parameters to tune.
- Data quality: A graph that should be fully connected but has multiple components may indicate missing data or broken relationships.
- Cycle detection: SCCs with more than one node contain at least one directed cycle, which is useful for detecting circular dependencies.
Example: Finding Disconnected Communities
from astraeadb import AstraeaClient from collections import Counter with AstraeaClient("127.0.0.1", 7687) as client: # Find connected components (undirected) components = client.connected_components() # Returns: {"nd-1": 0, "nd-2": 0, "nd-3": 1, "nd-4": 1, ...} # Count nodes per component counts = Counter(components.values()) print(f"Found {len(counts)} connected components") for comp_id, size in counts.most_common(): print(f" Component {comp_id}: {size} nodes") # Find strongly connected components (directed) sccs = client.strongly_connected_components() scc_counts = Counter(sccs.values()) print(f"Found {len(scc_counts)} strongly connected components")
source("r_client.R") client <- AstraeaClient$new("127.0.0.1", 7687) client$connect() # Connected components components <- client$connected_components() comp_table <- table(unlist(components)) cat("Found", length(comp_table), "connected components\n") # Strongly connected components sccs <- client$strongly_connected_components() scc_table <- table(unlist(sccs)) cat("Found", length(scc_table), "strongly connected components\n") client$close()
package main import ( "context" "fmt" "github.com/AstraeaDB/AstraeaDB-Official" ) func main() { client := astraeadb.NewClient(astraeadb.WithAddress("127.0.0.1", 7687)) ctx := context.Background() client.Connect(ctx) defer client.Close() // Connected components components, _ := client.ConnectedComponents(ctx) counts := make(map[int]int) for _, compID := range components { counts[compID]++ } fmt.Printf("Found %d connected components\n", len(counts)) // Strongly connected components sccs, _ := client.StronglyConnectedComponents(ctx) sccCounts := make(map[int]int) for _, compID := range sccs { sccCounts[compID]++ } fmt.Printf("Found %d strongly connected components\n", len(sccCounts)) }
import com.astraeadb.unified.UnifiedClient; import java.util.Map; import java.util.stream.Collectors; try (var client = UnifiedClient.builder() .host("127.0.0.1").port(7687).build()) { client.connect(); // Connected components Map<String, Integer> components = client.connectedComponents(); var counts = components.values().stream() .collect(Collectors.groupingBy(c -> c, Collectors.counting())); System.out.printf("Found %d connected components%n", counts.size()); // Strongly connected components Map<String, Integer> sccs = client.stronglyConnectedComponents(); var sccCounts = sccs.values().stream() .collect(Collectors.groupingBy(c -> c, Collectors.counting())); System.out.printf("Found %d strongly connected components%n", sccCounts.size()); }
Output Format
Both algorithms return a map of node_id -> component_id. Nodes with the same component ID belong to the same component. Component IDs are arbitrary integers -- only equality matters, not the specific values.
10.3 Centrality Measures
Centrality measures quantify how "central" or "important" a node is within the graph. Different centrality metrics capture different aspects of importance.
Degree Centrality
The simplest centrality measure: count the number of connections a node has. In a directed graph, you can distinguish between:
- In-degree centrality: Number of incoming edges. High in-degree means many nodes point to this one (e.g., a popular page).
- Out-degree centrality: Number of outgoing edges. High out-degree means this node connects to many others (e.g., an active user).
- Total degree centrality: Sum of in-degree and out-degree.
Normalized degree centrality divides the raw degree by (N - 1), where N is the total number of nodes. This produces a value between 0.0 and 1.0, making it comparable across graphs of different sizes.
with AstraeaClient("127.0.0.1", 7687) as client: # Degree centrality (normalized by default) centrality = client.degree_centrality(direction="both", normalized=True) # Find top connectors top = sorted(centrality.items(), key=lambda x: x[1], reverse=True)[:5] for node_id, score in top: print(f"{node_id}: centrality = {score:.4f}")
client <- AstraeaClient$new("127.0.0.1", 7687) client$connect() centrality <- client$degree_centrality(direction = "both", normalized = TRUE) ranked <- centrality[order(unlist(centrality), decreasing = TRUE)] head(ranked, 5) client$close()
scores, _ := client.DegreeCentrality(ctx, "both", true) for id, score := range scores { fmt.Printf("%s: centrality = %.4f\n", id, score) }
Map<String, Double> centrality = client.degreeCentrality("both", true); centrality.entrySet().stream() .sorted(Map.Entry.<String, Double>comparingByValue().reversed()) .limit(5) .forEach(e -> System.out.printf("%s: %.4f%n", e.getKey(), e.getValue()));
Betweenness Centrality
Betweenness centrality measures how often a node lies on the shortest path between other pairs of nodes. It is computed using Brandes' algorithm, which efficiently calculates the shortest-path dependency for all node pairs.
A node with high betweenness centrality acts as a "bridge" or "broker" in the network. Removing it would significantly disrupt communication between other parts of the graph. This makes betweenness centrality essential for identifying critical infrastructure, key intermediaries, or bottlenecks.
Brandes' algorithm has a time complexity of O(V * E), which makes it more expensive than degree centrality. For very large graphs (millions of nodes), consider sampling or approximate methods.
Example: Identifying Critical Routers
In a network topology graph, betweenness centrality reveals which routers carry the most traffic between different network segments. A router with high betweenness is a single point of failure.
with AstraeaClient("127.0.0.1", 7687) as client: # Betweenness centrality (Brandes' algorithm) betweenness = client.betweenness_centrality(normalized=True) # Identify the most critical "bridge" nodes bridges = sorted(betweenness.items(), key=lambda x: x[1], reverse=True) print("Most critical nodes (highest betweenness):") for node_id, score in bridges[:5]: print(f" {node_id}: {score:.4f}")
betweenness <- client$betweenness_centrality(normalized = TRUE) ranked <- betweenness[order(unlist(betweenness), decreasing = TRUE)] cat("Most critical bridge nodes:\n") head(ranked, 5)
betweenness, _ := client.BetweennessCentrality(ctx, true) // Sort and print top bridges type entry struct { ID string; Score float64 } var entries []entry for id, s := range betweenness { entries = append(entries, entry{id, s}) } sort.Slice(entries, func(i, j int) bool { return entries[i].Score > entries[j].Score }) fmt.Println("Most critical bridge nodes:") for _, e := range entries[:5] { fmt.Printf(" %s: %.4f\n", e.ID, e.Score) }
Map<String, Double> betweenness = client.betweennessCentrality(true); System.out.println("Most critical bridge nodes:"); betweenness.entrySet().stream() .sorted(Map.Entry.<String, Double>comparingByValue().reversed()) .limit(5) .forEach(e -> System.out.printf(" %s: %.4f%n", e.getKey(), e.getValue()));
10.4 Community Detection (Louvain)
What It Does
The Louvain algorithm discovers natural groupings (communities) within a graph. Nodes within the same community are densely connected to each other but sparsely connected to nodes in other communities. This is useful for understanding the macro-structure of complex networks.
How It Works
Louvain optimizes a metric called modularity through two alternating phases:
- Local optimization: Each node is moved to the neighboring community that produces the largest increase in modularity. This repeats until no move improves modularity.
- Aggregation: The communities found in phase 1 are collapsed into single "super-nodes," and the process repeats on the coarsened graph.
This continues until no further improvement is possible.
Understanding Modularity
Modularity is a score between -0.5 and 1.0 that measures how well a graph divides into communities:
| Modularity Score | Interpretation |
|---|---|
| < 0.0 | Worse than random assignment; no meaningful community structure. |
| 0.0 | Equivalent to random assignment of nodes to communities. |
| 0.3 - 0.7 | Good community structure detected. Most real-world networks fall in this range. |
| > 0.7 | Very strong community structure. |
Example: Co-Authorship Networks
In a graph of academic papers where edges represent co-authorship, Louvain naturally discovers research groups -- clusters of authors who frequently collaborate.
from astraeadb import AstraeaClient from collections import Counter with AstraeaClient("127.0.0.1", 7687) as client: # Run Louvain community detection communities = client.louvain_communities() # Returns: {"nd-1": 0, "nd-2": 0, "nd-3": 1, ...} # Analyze the results counts = Counter(communities.values()) print(f"Detected {len(counts)} communities") for comm_id, size in counts.most_common(10): print(f" Community {comm_id}: {size} members") # Find which community a specific author belongs to alice_id = "nd-42" print(f"Alice is in community {communities[alice_id]}") # Find all members of Alice's community alice_comm = communities[alice_id] peers = [nid for nid, cid in communities.items() if cid == alice_comm] print(f"Alice's community has {len(peers)} members")
source("r_client.R") client <- AstraeaClient$new("127.0.0.1", 7687) client$connect() # Run Louvain community detection communities <- client$louvain_communities() # Analyze results comm_table <- table(unlist(communities)) cat("Detected", length(comm_table), "communities\n") print(sort(comm_table, decreasing = TRUE)) client$close()
communities, _ := client.LouvainCommunities(ctx) // Count nodes per community counts := make(map[int]int) for _, commID := range communities { counts[commID]++ } fmt.Printf("Detected %d communities\n", len(counts)) for commID, size := range counts { fmt.Printf(" Community %d: %d members\n", commID, size) }
Map<String, Integer> communities = client.louvainCommunities(); // Count nodes per community var counts = communities.values().stream() .collect(Collectors.groupingBy(c -> c, Collectors.counting())); System.out.printf("Detected %d communities%n", counts.size()); counts.entrySet().stream() .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed()) .limit(10) .forEach(e -> System.out.printf(" Community %d: %d members%n", e.getKey(), e.getValue()));
10.5 Algorithm Comparison
The following table summarizes all the graph algorithms available in AstraeaDB, their purpose, computational complexity, and output format.
| Algorithm | Purpose | Complexity | Output |
|---|---|---|---|
| PageRank | Node importance ranking | O(V + E) per iteration | Scores (0.0 to 1.0) |
| Connected Components | Cluster detection (undirected) | O(V + E) | Component IDs |
| SCCs (Tarjan) | Directed cluster detection | O(V + E) | Component IDs |
| Degree Centrality | Connection counting | O(V + E) | Scores (0.0 to 1.0) |
| Betweenness Centrality | Bridge / broker detection | O(V * E) | Scores (0.0 to 1.0) |
| Louvain | Community detection | O(V * log V) avg | Community IDs |