## A Distributed Multi-GPU System for Fast Graph Processing Zhihao Jia et al. (2018) Presented by Edward Fan ## Another graph processing framework - Frameworks we've seen so far: - Shared memory / disk-based: - Ligra, GraphChi, X-Stream - Distributed: - Pregel, PowerGraph, GraphX - Single-machine GPU: - Garaph, CuSha, MapGraph ## Another graph processing framework - Lux - Lux: a distributed multi-GPU framework - Three interesting components: - Execution model: push vs pull - Use of GPU-specific memory hierarchy - Dynamic load balancing based on runtime performance Figure 1: Multi-CPU node architecture. # Programmer interface - init, compute, update - Somewhat similar to Pregel's gather-applyscatter Figure 3: All Lux programs must implement the state-less init, compute and update functions. # Push execution - Maintains frontier of vertices to compute on - Used by many distributed systemsminimizes work #### **Algorithm 2** Pseudocode for generic push-based execution. ``` 1: while F \neq \{\} do for all v \in V do in parallel init(v, v^{old}) 3: end for 5: ▷ synchronize(V) for all u \in F do in parallel for all v \in N^+(u) do in parallel compute(v, u^{old}, (u, v)) \mathbf{q} end for end for 10: 11: b synchronize(V) F = \{\} 12: 13: for all v \in V do in parallel 14: if update(v, v^{old}) then 15: F = F \cup \{v\} 16: end if 17: end for 18: end while ``` # Pull execution - Processes all vertices and edges at each iteration - Faster on GPUs (except for very sparse updates) #### Algorithm 1 Pseudocode for generic pull-based execution. ``` while not halt do. halt = true ▶ halt is a global variable. for all v \in V do in parallel \operatorname{init}(v, v^{old}) for all u \in N^-(v) do in parallel compute(v, u^{\circ ld}, (u, v)) 6: end for if update(v, v^{old}) then halt = false end if 10: 11: end for 12: end while ``` ## GPU memory hierarchy - Three major types of memory: - Zero-copy memory: pinned region of DRAM that can be accessed directly - GPU device memory: main GPU memory - GPU shared memory: small cache shared by all threads (think L1, but if shared by CPU cores) Figure 9: Data flow for one iteration. ## GPU memory hierarchy #### - Goal is to: - Minimize transfers from zero-copy memory to device memory - Use shared memory as much as possible #### Two optimizations: - Load and update vertices only once per iteration - Pull execution can put all updates in shared memory Figure 9: Data flow for one iteration. ## GPU memory hierarchy - Coalesced memory access - When multiple GPU threads access <u>consecutive</u> addresses, the hardware combines them into one range. - Next section: assigning consecutive vertices to each GPU means that accesses are consecutive - Additional optimization: copy a block to shared memory using coalescing Figure 9: Data flow for one iteration. ## Dynamic load balancing - To start: simple edge partitioning (assign roughly equal number of edges to each GPU; sequentially pick boundary vertices through CSR) - During each iteration: observe actual runtime to see how much work is in each partition - Then, run model to see if inter-node or local repartitioning is worthwhile - Seems to converge quickly Figure 8: The estimates of f over three iterations. The blue squares indicate the actual execution times, while the red circles indicate the split points returned for a partitioning among 4 GPHs at the end of each iteration. #### Performance - Pretty good! Outperforms single-CPU and multi-CPU systems - Competitive against single-GPU when run on just 1 GPU - Arguably, deck is stacked against CPU systems- similar "cost efficiency" numbers, but lots more hardware for Lux #### Performance Model: lead time Real: bad time Real: compute time BSS Model: compute time Real: der time Model: :fer time Seat workload rebalance UK Ann time 8 x=0, y=16 x=2, y=16 Configurations s=1, y=4 n=1, y=8 (a) Pull-based executions (PR). Real: sompute time 6000 Model: compute time Reac (Not time GOOD Modern they former □ Real: workload misalance Run time (seconds) UK Configurations (b) Push-based executions (CC). Figure 20: Performance model for different executions. ## Questions? Thanks!