Background

During the recent decade, the Graphics Processing Unit (GPU) has become an important aspect of computation and not just in the field of Gaming. GPU computing is the use of GPUs to accelerate general-purpose tasks which are executed on the CPU. GPU is used when there are applications which use computer-intensive time-consuming operations. From user-level abstraction, it is simply an application which runs faster with help of parallel processing. This advantage of having a massive number of cores can help make applications more efficient and seamless.

When it comes to Graph, It is one of the most important data structures which can represent real-world scenarios such as Network Graphs in Routing, Flight Networks, Pathfinding for Navigation systems, etc. Evidently, we are talking about millions of vertices that need to pass information back and forth through their links. These large connections will increase the time in the order of vertices. We can see sequential implementations that run on supercomputers perform well, but have very expensive Hardware. Whereas GPUs are much cheaper when compared to these computers. GPUs are highly programming restrictive due to the nature of their hardware.

This project depicts the use of GPGPU to help improve the performance of the Pathfinding algorithms, Breadth-First Search and Single-Source Shortest Path specifically. I will present GPU implementations of these algorithms along with different approaches taken. Experimental results show that up to 6.52 times speedup is achieved over its CPU counterpart.

CUDA Programming Model

A general-purpose programming model by Nvidia that leverages the use of GPUs to solve many complex problems in an efficient way than CPU. CUDA comes with a software environment that helps developers to build applications using C/C++. The main challenge is to use an increasing number of cores in GPUs for parallelizing parts of the application. The idea is to distribute the workload among all the cores parallelly by assigning a task to each thread. The model has three main abstractions Grids, Blocks, and Threads at the lowest granularity. Depending on your GPU model, cores, and compute capabilities will vary. These partitions divide the problems into subproblems to be solved independently. You can find out more here. The Nvidia GPUs have SIMT architecture similar to SIMD, where all the threads execute the same instruction set on different data in parallel.

Graph Representation

A graph G(V,E) can be represented as an Adjacency Matrix as well as an Adjacency List. Here, I have used an adjacency list due to their efficient use of space O(V+E). This list A can be stored contiguously along with two more arrays, E which holds all the offsets, and, O which holds the outdegree for a vertex v which means A[v]to A[v+O[v]] holds all the children. There can be an additional weighted array W which is stored contiguously as well.

Breadth-First Search

Definition

Given an undirected, unweighted graph G(V,E) and a given source S, find the minimum distance from S to each vertex in Gsuch that all the vertices with the same depth must be visited before moving to the next level. Hence it is guaranteed to find the shortest path to a vertex.

Serial Approach

The BFS Algorithm is,

The queue ensures level-wise progression. This serial implementation has the time complexity of O(VE). Once the density of the graph increases it really becomes challenging for the CPU execution to finish in optimal time. Note that I have not used extra space to keep track of the visited vertices since the array of distances is sufficient to check whether the vertex is visited or not. Also, I do not store any previous vertices because in a parallel implementation the current vertex may be found via a different path.

Parallel Approach (Naive)

This approach is somewhat similar to the serial implementation. While parallelizing BFS, the only way to traverse is level-wise. So as suggested in [B4] perform level-synchronous BFS where the current queue represents the current level. Instead of only visiting the front vertex of the queue, distribute each vertex to a thread to fill the next level queue in parallel. Even if the current vertex may have different paths, the level-synchronous approach makes sure that the overwrites by different threads will always be the same.

Possible Problems

Blocked Queue Approach

CUDA provides an L2 cache for each block where only threads running in that block can access this shared memory. It is located on-chip so the access times are reduced due to high bandwidth and low latency. Depending on the number of memory allocated between multiple blocks execution times may vary due to resource use limitations of the GPU. Regardless, this shared memory could be used to maintain a block-level queue to ensure threads will only write to the global queue if and only if the block-level queue is full. This will help avoid collisions at the global queue. Once all threads explore their corresponding adjacent vertices, The block-level queue could coalesce the to the global memory if the sufficient number of threads are launched.

Hierarchical Queue Approach

Hierarchical Queue Approach As shown in [B1] it is possible to specialize the queue by adding another level for warps. Thus, avoiding collisions at the block-level queue as well. According to the indices of threads, each one is mapped to its respective sub-queue. The authors of the paper have implemented it such that they copy these sub-queues to the block-level queue. It seems unreasonable because both blocked-level and sub-queues reside inside shared memory. I have skipped copying to block-level and directly coalesced to global memory. Similar to the previous approach, when the sub-queue becomes full, the thread will write to the global queue.

G V E CPU GPU-N GPU-B GPU-H
New York 264,346 733,846 29.9261 ms 1.21x 0.98x 0.92x
California 1,890,815 4,657,742 237.658 ms 2.62x 2.38x 2.10x
Western USA 6,262,104 15,248,146 857.753 ms 4.38x 4.09x 3.97x
Central USA 14,081,816 34,292,496 2625.37 ms 6.52x 6.11x 6.15x
Whole USA 23,947,347 58,333,344 3400.58 ms 5.80x 5.35x 5.30x

Table shows the speed up against CPU execution over real-world sparse road graphs from 9th DIMACS shortest path Datasets of USA roads.

Single-Source Shortest Path

Definition

The weighted, undirected graph G(V,E,W) and a source vertex S, find the shortest path to all the vertices from S such that the distance between any two vertices (u,v) should be minimum of all the paths from u to v.

Serial Approach

The Dijkstra’s Algorithm is,

Now, this queue can be a priority queue. Thus, we can extract the minimum weighted edge in constant time. I avoid the usage of the priority queue because it is maintained as a binary heap under the hood. Therefore, the insertion and deletion require O(log n) which in large arrays could make the performance really slow at least in my case. It is better to perform insertions in constant time. So, instead, I sort the array at each iteration using STL’s sort method which uses Intro sort Algorithm which does hybrid sorting using Quicksort, Heapsort, and, Insertion sort. So this Dijkstra’s has a total running time of O(VE+Vlog V). A handicap for the CPU would be to use a Fibonacci heap where insertion and minimum extraction takes O(1). But I have not explored this because there won’t be much of a performance increase when compared with the parallel approach.

Parallel Approach

BFS and Dijkstra’s hold very similar approaches except that we need to extract minimum value from the queue in case of Dijkstra’s. It was very easy to convert the Parallel BFS approach into finding the shortest path. There is no need for sorting or extracting the minimum value when you consider all of the vertices in the current queue are executed in parallel. Although, it is important to reflect the changes accordingly when you revisit a vertex via the shortest path. But the essence or the gist remains intact between both parallel approaches. I have implemented this algorithm using Hierarchical-queue approach.

Attempted Optimizations

Atomic Operations

To avoid inconsistencies at the global queue as well as the local queue(s). CUDA threads could make use of atomic operations to correctly fill the queues without overwriting at the same location. These atomic operations require an address at which the thread will make its manipulations. Thus, when calling these functions, threads will have exclusive access at that memory location. But the access becomes serialized and also affects the performance.

Shared Memory

The graphs shown above have the average outdegree between 2-3. Therefore, the execution time is directly proportional to the number of vertices in regular or near-regular graphs.

Kernel Launch and Dynamic Parallelism

Read-only caches

Prefix-Sum

To avoid the usage of atomic operations when calculating the current position for that thread. It is possible to compute the global queue offsets for each thread based on the prefix-sum of their outdegrees. This way threads will not wait for other threads while filling the next queue. So computing a prefix-sum for the current queue would be required at each iteration of the outer loop. Thrust library provides predefined implementations of these algorithms. But would this not incur additional cost even if we compute it parallelly with the help of the GPU? Although, it could be possible to store all the prefix-sums at each iteration. We need to know which vertices will be inside the current queue for every iteration before the main execution. But this seems unreasonable considering the algorithm would not perform well on unknown graphs. But I have not explored this.

Other Algorithms

DFS

Environment

All computations were done using my local machine with Intel i5-7300HQ CPU @ 2.50GHz with 8 GB RAM and 4 cores. The GPU used is Nvidia GeForce GTX 1050 mobile @ 1.49GHz with 4 GB RAM @ 3.5GHz, and 640 CUDA cores.

Closure

I have shown that GPGPU usage can be used to achieve high performance on Graph Algorithms. I have provided my opinions on pre-existing approaches as well as shown the performance gaps between them. Granted that it is difficult to optimize these algorithms due to nature or uncertainty of the graph itself, be that as is may, substantially great speed-ups were obtained and with some fine-tuning we could keep increasing them marginally.

References

[B1] Lijuan Luo, Martin Wong, Wen-mei Hwu, “An Effective GPU Implementation of Breadth-First Search”

[B2] Duane Merrill, Micheal Garland, Andrew Grimshaw, “Scalable GPU Graph Traversal”

[B3] Rudolf Berrendorf and, Matthias Makulla, “Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems”

[B4] Pawan Harish, and P. J. Narayanan, “Accelerating large graph algorithms on the GPU using CUDA”

[D1] Peter Zhang, John Matty, Eric Holk, Marcin Zalewski, Jonathan Chu, Samantha Misurda, Scott McMillan, Andrew Lumsdaine, “Dynamic Parallelism for Simple and Efficient GPU Graph Algorithms”

[D2] Pedro J. Martín, Roberto Torres, and Antonio Gavilanes, “CUDA Solutions for the SSSP Problem”

[D3] Michael Kainer, Jesper Larsson Träff, “More Parallelism in Dijkstra’s Single-Source Shortest Path Algorithm”