challenges in terms of parallelization using GPUs. It is a highly dynamic and data-dependent
problem which can induce control-flow divergence and inefficient data-access patterns. We
present a simple solution using the bulk-synchronous parallel model to ensure a uniform
mode of execution, and balanced workloads across GPU threads. The method is easy to
implement, fast and operates entirely on the GPU by relying on a topology-centred work …