> I find that CUDA's cub library is better if you're doing prefix-sums within a kernel.
Thrust doesn't have a __device__ prefix-sum iirc, just the global call /laugh
> "Thrust" is more of a quick-and-dirty prototype kind of code, which has weaker performance than people expect.
Yes, absolutely, they're complimentary and in many cases CUB does things slightly better, or does things that Thrust doesn't support.
But Thrust is fantastic for "I want to allocate some GPU arrays, set up some data, and run sort+prefix sum, then hand it off to something else to run the actual algorithm. It's glue that helps you get started (eg see those quickstarts - those are very short even by CUDA standards let alone OpenCL) and figure out if your idea is going to work. And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations. It's also easy to get Thrust containers to give you a real pointer and at that point you can call CUB or real kernels or do whatever else you want.
As far as performance... eh, CUB is a little faster but not like incredibly much so, maybe 10% or so from what I remember, it wasn't huge. Thrust algorithms are usually not in-place so CUB can provide slightly higher problem size in most situations (since you don't have to allocate a scratch buffer). I actually found the CUB in-place sort was slower than Thrust non-inplace though (understandable, that's a common penalty, and CUB non-inplace might be even faster).
More fundamentally, Thrust really works at the level of iterators and not kernels/grids, so you can't really do warp-level operations at all using global "sort this shit" type commands. Thrust doesn't expose the grid information to you and doesn't make guarantees about what grid topology will be executed (there is an OpenMP backend!).
But if there is some general "per-item" function in your algorithm, you can call it using the map-iterator (can't remember what it's called but like, pass this object to this function) and either pass the object to work on, or have the value passed be an index of a work-item and your function loads it (store a pointer to the array start in the map-iterator). And in that case you inherit some of the occupancy auto-tuning that Thrust does, which is nice just as a basic thing to get off the ground - it'll try to use as wide a grid as is feasible given the occupancy/utilization.
I seem to remember that I did find a way to kinda work around it somehow, like what I was iterating was grid launches instead of work-items, and obviously those can use warp-collective calls etc, but yeah at some point you'll have to make the hop to a proper kernel launch, Thrust just lets you push it off a bit. I was just seeing if I could do it to leverage Thrust's occupancy auto-tuning.
Maybe it was that I'd stride the object space (eg launch an iterator for every 32 items) and do a kernel launch on each chunk, or something like that.
> And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations.
At a large granularity, yes if that's what you're doing.
But if you need to exit the kernel / device-side just to push/pop from a queue or allocate data to/from a stack (prefix-sum(sizes) -> allocate the top sum-of-(sizes) space from the stack), for a SIMD-stack push/pop operation, things will be quite slow.
SIMD-stack push/pop should be done at the block level and coordinated/synchronized between other blocks by using atomics (atomic_add(stack_head) / atomic_subtract(stack_head)). Especially if you don't know how many times a particular routine will push to the top of the stack.
Note: simd-stack is safe as long as all threads are pushing together, or popping together. If you can split your algorithm into the "push-only kernel", and then the "pop-only kernel" steps, you can have a surprising level of flexibility.
-------
Anyway, using a Thrust-level prefix sum will spin up an entire grid log(n) times each time you wanted to add/remove things from that shared stack. So you're really spawning too many grids IMO.
Instead, a CUB-level block-level prefix sum will atomic_add() / push onto the stack efficiently before exiting. So you have far fewer kernel calls.
Thrust doesn't have a __device__ prefix-sum iirc, just the global call /laugh
> "Thrust" is more of a quick-and-dirty prototype kind of code, which has weaker performance than people expect.
Yes, absolutely, they're complimentary and in many cases CUB does things slightly better, or does things that Thrust doesn't support.
But Thrust is fantastic for "I want to allocate some GPU arrays, set up some data, and run sort+prefix sum, then hand it off to something else to run the actual algorithm. It's glue that helps you get started (eg see those quickstarts - those are very short even by CUDA standards let alone OpenCL) and figure out if your idea is going to work. And there's very little penalty to keeping the "global steps" inside thrust, eg if you're just doing "fill this index-array with 0..N and then sort(arr1,arr2)" that is not much slower than doing everything raw, or writing one big function that tries to do everything without intermediate computations. It's also easy to get Thrust containers to give you a real pointer and at that point you can call CUB or real kernels or do whatever else you want.
As far as performance... eh, CUB is a little faster but not like incredibly much so, maybe 10% or so from what I remember, it wasn't huge. Thrust algorithms are usually not in-place so CUB can provide slightly higher problem size in most situations (since you don't have to allocate a scratch buffer). I actually found the CUB in-place sort was slower than Thrust non-inplace though (understandable, that's a common penalty, and CUB non-inplace might be even faster).
More fundamentally, Thrust really works at the level of iterators and not kernels/grids, so you can't really do warp-level operations at all using global "sort this shit" type commands. Thrust doesn't expose the grid information to you and doesn't make guarantees about what grid topology will be executed (there is an OpenMP backend!).
But if there is some general "per-item" function in your algorithm, you can call it using the map-iterator (can't remember what it's called but like, pass this object to this function) and either pass the object to work on, or have the value passed be an index of a work-item and your function loads it (store a pointer to the array start in the map-iterator). And in that case you inherit some of the occupancy auto-tuning that Thrust does, which is nice just as a basic thing to get off the ground - it'll try to use as wide a grid as is feasible given the occupancy/utilization.
I seem to remember that I did find a way to kinda work around it somehow, like what I was iterating was grid launches instead of work-items, and obviously those can use warp-collective calls etc, but yeah at some point you'll have to make the hop to a proper kernel launch, Thrust just lets you push it off a bit. I was just seeing if I could do it to leverage Thrust's occupancy auto-tuning.
Maybe it was that I'd stride the object space (eg launch an iterator for every 32 items) and do a kernel launch on each chunk, or something like that.