Tasks completed in weeks 3 to 4

  1. Polished Pull Requests on Greedy Heuristics.
  2. Implemented Parallel Floyd Warshall All Pairs Shortest Paths algorithm.
  3. Implemented Karger Global Minimum Cut algorithm.
  4. Implemented Parallel Bellman Ford Single Source Shortest Paths algorithm.

Details

1. Polished Pull Requests on Greedy Heuristics

Several changes were made to the two Pull Requests consisted of functions to solve to 8 different problems. For example, we decided to use Bit Arrays instead of Boolean Arrays. Boolean Arrays require more memory (8x) than Bit Arrays but have faster access time. However, when the number of entries increases and it longer becomes possible to fit the entire array in a cache line, bit arrays become much faster as the number of cache misses decrease. We also decided not to push Steiner Tree and Travelling Salesman into the LightGraphs repository as the two problems are only defined for complete graphs. The code will still be publicly available on my repository.

2. Parallel Floyd Warshall All Pairs Shortest Paths

Floyd Warshall is an algorithm used to find the shortest path between all pairs of vertices in a graph that does not have any negative weight edge cycles.

The bottle neck portion of the algorithm can be descirbed roughly as:

 
for pivot in vertices(g)
    for u in vertices(g)
        for v in vertices(g)
            if dists[u, v] > dists[u, pivot] + dists[pivot, v]
                dists[u, v] = dists[u, pivot] + dists[pivot, v]
                parents[u, v] = parents[pivot, v]
            end
        end
    end
end

I wrote the parallel version roughly as:

 
for pivot in vertices(g)
    @threads for u in vertices(g)
        if u != pivot
            for v in vertices(g)
                v == pivot && continue
                
                if dists[u, v] > dists[u, pivot] + dists[pivot, v]
                    dists[u, v] = dists[u, pivot] + dists[pivot, v]
                    parents[u, v] = parents[pivot, v]
                end
            end
        end
    end
end

The modification is needed to prevent a race-condition which would have occured if u or v was the pivot.

For example:

pivot = 1.

Thread 1: u = 1 and v = 2

Thread 2: u = 100 and v = 2

If Thread 1 writes on dists[u, v] when Thread 2 reads dists[pivot, v] then we have a concurrent read-write.

The modification does not change the correctness of the algorithm. Since there is no negative weight cycle, dists[pivot, pivot] >= 0. When v = pivot: dists[u, pivot] < dists[u, pivot] + dists[pivot, pivot] is false. Hence the update.

Benchmarks


nthreads() = 4

g = loadsnap(:facebook_combined) (|V| = 4039, |E| = 88234)

distmx = sparse(g)

for e in edges(g)
  d[e.src, e.dst] = d[e.dst, e.src] = rand(0:n)
end

floyd_warshall_shortest_paths(g, distmx): 48.904 s (5 allocations: 124.46 MiB)

parallel_floyd_warshall_shortest_paths(g, distmx): 38.628 s (4044 allocations: 124.65 MiB)


3. Karger Global Minimum Cut algorithm

This is a simple algorithm that can be described as follows: While there are more than two nodes in a graph, choose an edge at random and contract it. The 2 nodes represent the cut. I implemented this algorithm using the disjoint set datastructure.

4. Parallel Bellman Ford Single Source Shortest Path

Bellman Ford algorithm is used to find the shortest path from a single source in a graph that does not have any negative weight edge cycles.

The code can be described as:

 
function seq_bellman_ford(g::AbstractGraph, src, distmx::AbstractMatrix)
    dists = fill(typemax(eltype(g)), nv(g))
    dists[src] = 0
    active = Set([src])

    for _ in 1:nv(g)
        new_active = Set([])
        for u in active
            for v in outneighbors(graph, u)
                relax_dist = distmx[u, v] + dists[u]
                if dists[v] > relax_dist 
                    dists[v] = relax_dist
                    union!(new_active, v)
                end
            end
        end
        active = new_active
        new_active = Set([])
    end
end

The parallel version can be described as:

 
function parallel_bellman_ford(g::AbstractGraph, src, distmx::AbstractMatrix)
    dists = fill(typemax(eltype(g)), nv(g))
    dists[src] = 0
    active = Set([src])

    dists_t = [deepcopy(dists) for _ in 1:nthreads()]
    active_t = [Set([]) for _ in 1:nthreads()]

    for _ in 1:nv(g)

        # Update dists_t, active_t using dists, active in parallel
        @threads for u in active
            local_dists = dists_t[threadid()]
            local_active = active_t[threadid()]

            for v in outneighbors(g, u)
                relax_dist = dists[u] + distmx[u, v]
                if dists[v] > relax_dists && local_dists[v] > relax_dists
                    local_dists[u] = relax_dists
                    push!(local_active, u)
                end
            end
        end

        # Update dists, active using dists_t, active_t in sequence
        for thread_id in 1:nthreads()
            local_dists = dists_t[thread_id]
            local_active = active_t[thread_id] 

            for v in local_active
                if local_dists[v] < dists[v]
                    dists[v] = local_dists[v]
                end
            end
        	empty!(local_active)
        end
        active = reduce(union, active_t)
    end
end

In this implementation, every thread is provided its own local memory (suffixed with _t) .

In the multi-threaded loop, the local variables (dists_t, parents_t and active_t), are updated concurrently.

In the next loop, the shared variables (dists, parents and active), are updated sequentially.

Benchmarks


nthreads() = 4


distmx = rand(-nv(g):nv(g), nv(g), nv(g))


src = 1


g = loadsnap(:ego_twitter_u) (|V| = 81306, |E| = 1342310)

seq_bellman_ford_shortest_paths(g, src, distmx): 30.670 s (92897 allocations: 764.58 MiB)

parallel_bellman_ford_shortest_paths(g, src, distmx): 22.544 s (298931 allocations: 3.04 GiB)


g = CompleteGraph(1000)

seq_bellman_ford_shortest_paths(g, src, distmx): 34.048 s (18012 allocations: 48.46 MiB)

parallel_bellman_ford_shortest_paths(g, src, distmx): 10.983 s (69028 allocations: 196.66 MiB)


This code scales well with respect to the number of cores when the graph is dense. However, the memory useage is proportional to the number of cores. Which is expected since every thread is being given its own memory.

Lessons Learned

  • Smaller Pull Requests: This makes your code and pull requests easier to maintain . Also you can detect your mistakes earlier and avoid making them in future codes.

  • Gather performance tips before coding: You may be unaware of the performance bugs existing and end up believing there is a flaw in the design rather than the implementation. I should have figured that out when I notices Base.Threads is marked Experimental. Eg. when using a multi-threaded loop @threads for, the code must be in the outer-most scope of the function.

for pivot in vertices(g)
    @threads for (u, v) Iterators.product(verties(g))
        DO SOMETHING
    end
end

Must be written as:

function _loopbody!(g, pivot, dists, parents)
	@threads for (u, v) in Iterators.product(verties(g))
        DO SOMETHING
    end

for pivot in vertices(g)
    _loopbody!(g, pivot, dists, parents)
end
  • Save your build before updating: My build was broken for 2 days after updating to Julia 0.7 alpha. I could not revert to the previous version either. I eventually diagnosed the problem to be in one of the dependencies of LightGraphs.