> I find that CUDA's cub library is better if you're doing prefix-sums within a ...

dragontamer · on Sept 1, 2022

> And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations.

At a large granularity, yes if that's what you're doing.

But if you need to exit the kernel / device-side just to push/pop from a queue or allocate data to/from a stack (prefix-sum(sizes) -> allocate the top sum-of-(sizes) space from the stack), for a SIMD-stack push/pop operation, things will be quite slow.

SIMD-stack push/pop should be done at the block level and coordinated/synchronized between other blocks by using atomics (atomic_add(stack_head) / atomic_subtract(stack_head)). Especially if you don't know how many times a particular routine will push to the top of the stack.

Note: simd-stack is safe as long as all threads are pushing together, or popping together. If you can split your algorithm into the "push-only kernel", and then the "pop-only kernel" steps, you can have a surprising level of flexibility.

-------

Anyway, using a Thrust-level prefix sum will spin up an entire grid log(n) times each time you wanted to add/remove things from that shared stack. So you're really spawning too many grids IMO.

Instead, a CUB-level block-level prefix sum will atomic_add() / push onto the stack efficiently before exiting. So you have far fewer kernel calls.