Skip to the content.

Background

The modern world is built on graphs—from flight networks and navigation systems to the vast web of social connections. As these real-world graphs swell to include millions of vertices, finding optimal paths becomes a computational bottleneck. While sequential supercomputer implementations perform well, their hardware is prohibitively expensive. This project demonstrates how leveraging the power of General-Purpose Graphics Processing Units (GPGPU) can deliver massive performance improvements for fundamental graph pathfinding algorithms.


The Parallel Engine: The CUDA Programming Model

The key to unlocking performance on a GPU is Nvidia’s CUDA programming model. CUDA allows us to distribute a massive workload across the GPU’s thousands of cores using a software environment that supports C/C++.

The model organizes work using three core abstractions:

The GPU hardware follows a Single-Instruction, Multiple-Thread (SIMT) architecture, similar to SIMD, meaning all threads execute the same instruction set but on different data in parallel. The main challenge in GPU programming is effectively mapping an irregular problem, like graph traversal, onto this highly restrictive, yet massively parallel, hardware.


Representing Massive Graphs for GPU Performance

For optimal GPU processing, the choice of data structure is crucial. I opted for an Adjacency List due to its efficient space complexity, $O(V+E)$, where $V$ is the number of vertices and $E$ is the number of edges.

To facilitate efficient, contiguous GPU memory access, this list, $A$, is stored along with two essential auxiliary arrays:


Case Study 1: Breadth-First Search (BFS) Parallelization

Breadth-First Search (BFS), defined on an undirected, unweighted graph $G(V,E)$, finds the minimum distance from a source vertex $S$ to every other vertex.

The serial approach uses a queue to ensure level-wise progression, which guarantees the shortest path is found. This sequential implementation has a time complexity of $O(V+E)$.

The Challenge of Level Synchronization

In a parallel BFS implementation, the only viable traversal method is level-synchronous. This requires distributing each vertex in the current queue level to a thread to explore its neighbors in parallel.

This method creates several performance bottlenecks:

Solutions: Blocked and Hierarchical Queues

To mitigate race conditions and optimize memory access, I explored using the fast on-chip L2 Shared Memory available to each thread block.

  1. Blocked Queue Approach (GPU-B): Shared memory is used to maintain a block-level queue. Threads write to this queue first, and only when the block-level queue is full, or the block finishes exploration, is the data coalesced and written to the global queue.
  2. Hierarchical Queue Approach (GPU-H): This adds an additional level of specialization for warps. Threads are mapped to sub-queues within shared memory, further reducing collisions before writing directly to the global queue when the sub-queue fills up.

The Performance Gains

The table below shows the speedup achieved against the CPU execution using real-world sparse road graphs from the DIMACS shortest path datasets.

G V E CPU GPU-N GPU-B GPU-H
New York 264,346 733,846 29.9261 ms $1.21times$ $0.98times$ $0.92times$
California 1,890,815 4,657,742 237.658 ms $2.62times$ $2.38times$ $2.10times$
Western USA 6,262,104 15,248,146 857.753 ms $4.38times$ $4.09times$ $3.97times$
Central USA 14,081,816 34,292,496 2625.37 ms $mathbf{6.52times}$ $6.11times$ $6.15times$
Whole USA 23,947,347 58,333,344 3400.58 ms $5.80times$ $5.35times$ $5.30times$

The results clearly demonstrate the power of GPGPU processing, achieving a peak speedup of $6.52times$ on the Central USA road network. Counterintuitively, the Naive Parallel Approach (GPU-N) often outperformed the shared memory approaches (GPU-B and GPU-H), possibly because the overhead of managing the block and warp queues outweighed the reduction in global memory contention for these graphs. The additional write operations and thread divergence also contributed to the performance deterioration of 6-8% in shared memory approaches compared to the naive BFS.


Case Study 2: Single-Source Shortest Path (SSSP)

Single-Source Shortest Path (SSSP) extends BFS to weighted graphs $G(V,E,W)$, finding the path with the minimum total weight from a source vertex $S$ to all others.

The classic serial Dijkstra’s Algorithm typically uses a priority queue to efficiently extract the minimum weighted edge. To avoid the parallelization complexity and $O(log n)$ overhead of a binary heap, the CPU implementation here avoids the priority queue and instead sorts the queue array at each iteration using a hybrid approach (Intro sort), resulting in a running time of $O(VE+Vlog V)$.

Parallel SSSP Adaptation

The Parallel SSSP approach closely mirrors the Parallel BFS. Since all vertices in the current level are processed in parallel, there’s no need for sorting or minimum extraction. The parallelization focuses on exploring neighbors and ensuring that distance updates are reflected when a shorter path to an already visited vertex is found. This was implemented using the Hierarchical-queue approach.


Lessons Learned: Pitfalls of GPU Optimization

Optimizing graph algorithms on the GPU is an exercise in managing highly irregular and memory-bound workloads. Not every standard GPU optimization technique translated into a performance gain. Here are the crucial lessons learned from what didn’t work:

1. Atomic Operations: The Cost of Synchronization

To ensure consistency in updating the global and local queues, we can use CUDA’s Atomic Operations. These functions provide exclusive access to a memory location, preventing threads from overwriting each other.

While atomically guaranteed to be correct, this approach forces threads into a serialized queue waiting line. The performance hit from this serialization often outweighs the benefit of concurrency, negatively impacting the overall execution time.

2. Shared Memory: Balancing Complexity and Contention

The use of fast, on-chip Shared Memory was explored to reduce global memory contention via Blocked and Hierarchical queue approaches.

The graphs above, calculated on regular or near-regular graphs with an average outdegree between 2–3, show execution time being directly proportional to the number of vertices.

3. Kernel Launch and Dynamic Parallelism

The level-synchronous nature of both BFS and SSSP mandates inter-level synchronization.

4. Read-Only Caches: Lack of Locality

I investigated using CUDA’s aggressively cached Texture Memory (L2) for read-only graph arrays, such as the offsets. Texture memory excels when data exhibits spatial locality—meaning nearby threads read nearby addresses.

However, the irregular access patterns in pathfinding on large, sparse graphs do not offer the required spatial locality. Attempting to use this cache actually decreased performance by $11%$ in some cases, and even doubled execution times in others, likely due to a high rate of cache misses.

5. Prefix-Sum: Computational Trade-off

Another technique to avoid atomic operations is using Prefix-Sum (or parallel scan). This calculates the exact global queue offset for each thread based on the prefix-sum of their outdegrees.

While the Thrust library offers parallel implementations, this method requires computing a prefix-sum for the current queue at every outer loop iteration. The trade-off here is computational: does the cost of performing a parallel prefix-sum at each level outweigh the benefit of avoiding serialized atomic operations? This avenue was not fully explored but remains a high-level design question for graph parallelism.


Alternative Algorithms: Why They Were Not Prioritized

While BFS and SSSP were the core focus, we considered the feasibility of applying GPGPU to other common graph algorithms.

Depth-First Search (DFS)

DFS is inherently asynchronous, continuing the search from a single vertex until all its children are “relaxed”. This contrasts sharply with the level-synchronous nature of parallel BFS.

A* Search is essentially an expansion of SSSP that uses heuristics to guide the search, making it goal-directed.


Computation Environment

All experimental computations were performed on the following local machine specifications:

Component Specification
CPU Intel i5-7300HQ @ 2.50GHz (4 cores)
CPU RAM 8 GB
GPU Nvidia GeForce GTX 1050 mobile
GPU Clock 1.49GHz
GPU RAM 4 GB @ 3.5GHz
CUDA Cores 640

Closure

This project successfully demonstrated that leveraging GPGPU usage can achieve substantial performance gains in Graph Algorithms. We established performance gaps between different parallel approaches (Naive, Blocked, Hierarchical) and provided insights into pre-existing optimization challenges.

While optimizing graph algorithms remains inherently difficult due to the irregular and uncertain nature of the graph structure, we obtained substantially great speed-ups. With further fine-tuning and algorithm-specific adjustments, these performance gains could likely be marginally increased.