Tasks completed in weeks 3 to 4
- Polished Pull Requests on Greedy Heuristics.
- Implemented Parallel Floyd Warshall All Pairs Shortest Paths algorithm.
- Implemented Karger Global Minimum Cut algorithm.
- Implemented Parallel Bellman Ford Single Source Shortest Paths algorithm.
Details
1. Polished Pull Requests on Greedy Heuristics
Several changes were made to the two Pull Requests consisted of functions to solve to 8 different problems. For example, we decided to use Bit Arrays instead of Boolean Arrays. Boolean Arrays require more memory (8x) than Bit Arrays but have faster access time. However, when the number of entries increases and it longer becomes possible to fit the entire array in a cache line, bit arrays become much faster as the number of cache misses decrease. We also decided not to push Steiner Tree and Travelling Salesman into the LightGraphs
repository as the two problems are only defined for complete graphs. The code will still be publicly available on my repository.
2. Parallel Floyd Warshall All Pairs Shortest Paths
Floyd Warshall is an algorithm used to find the shortest path between all pairs of vertices in a graph that does not have any negative weight edge cycles.
The bottle neck portion of the algorithm can be descirbed roughly as:
I wrote the parallel version roughly as:
The modification is needed to prevent a race-condition which would have occured if u
or v
was the pivot
.
For example:
pivot = 1.
Thread 1: u = 1 and v = 2
Thread 2: u = 100 and v = 2
If Thread 1 writes on dists[u, v]
when Thread 2 reads dists[pivot, v]
then we have a concurrent
read-write.
The modification does not change the correctness of the algorithm.
Since there is no negative weight cycle, dists[pivot, pivot] >= 0
.
When v = pivot: dists[u, pivot] < dists[u, pivot] + dists[pivot, pivot]
is false.
Hence the update.
Benchmarks
nthreads() = 4
g = loadsnap(:facebook_combined)
(|V| = 4039, |E| = 88234)
distmx = sparse(g)
for e in edges(g)
d[e.src, e.dst] = d[e.dst, e.src] = rand(0:n)
end
floyd_warshall_shortest_paths(g, distmx)
: 48.904 s (5 allocations: 124.46 MiB)
parallel_floyd_warshall_shortest_paths(g, distmx)
: 38.628 s (4044 allocations: 124.65 MiB)
3. Karger Global Minimum Cut algorithm
This is a simple algorithm that can be described as follows: While there are more than two nodes in a graph, choose an edge at random and contract it. The 2 nodes represent the cut. I implemented this algorithm using the disjoint set datastructure.
4. Parallel Bellman Ford Single Source Shortest Path
Bellman Ford algorithm is used to find the shortest path from a single source in a graph that does not have any negative weight edge cycles.
The code can be described as:
The parallel version can be described as:
In this implementation, every thread is provided its own local memory (suffixed with _t
) .
In the multi-threaded loop, the local variables (dists_t
, parents_t
and active_t
), are updated concurrently.
In the next loop, the shared variables (dists
, parents
and active
), are updated sequentially.
Benchmarks
nthreads() = 4
distmx = rand(-nv(g):nv(g), nv(g), nv(g))
src = 1
g = loadsnap(:ego_twitter_u)
(|V| = 81306, |E| = 1342310)
seq_bellman_ford_shortest_paths(g, src, distmx)
: 30.670 s (92897 allocations: 764.58 MiB)
parallel_bellman_ford_shortest_paths(g, src, distmx)
: 22.544 s (298931 allocations: 3.04 GiB)
g = CompleteGraph(1000)
seq_bellman_ford_shortest_paths(g, src, distmx)
: 34.048 s (18012 allocations: 48.46 MiB)
parallel_bellman_ford_shortest_paths(g, src, distmx)
: 10.983 s (69028 allocations: 196.66 MiB)
This code scales well with respect to the number of cores when the graph is dense. However, the memory useage is proportional to the number of cores. Which is expected since every thread is being given its own memory.
Lessons Learned
-
Smaller Pull Requests: This makes your code and pull requests easier to maintain . Also you can detect your mistakes earlier and avoid making them in future codes.
-
Gather performance tips before coding: You may be unaware of the performance bugs existing and end up believing there is a flaw in the design rather than the implementation. I should have figured that out when I notices
Base.Threads
is markedExperimental
. Eg. when using a multi-threaded loop@threads for
, the code must be in the outer-most scope of the function.
Must be written as:
- Save your build before updating: My build was broken for 2 days after updating to Julia 0.7 alpha. I could not revert to the previous version either. I eventually diagnosed the problem to be in one of the dependencies of LightGraphs.