Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A C++17 thread pool for high-performance scientific computing (arxiv.org)
137 points by pramodbiligiri on June 14, 2022 | hide | past | favorite | 50 comments


> In particular, our implementation does not utilize OpenMP or any other high-level multithreading APIs, and thus gives the programmer precise low-level control over the details of the parallelization, which permits more robust optimizations.

I totally disagree with this line of thinking. The person who knows most about the hardware on which the program will be run is the person running it, not the programmer. The OpenMP API somewhat complicated in an attempt to allow the programmer to, at a high level, express ideas about data locality to the runtime.

Unless we're imagining a universe in which the programmer is around to tweak their source code for every hardware platform, the idea of "giving the programmer more control" is a dead end. The programmer must be given expressiveness.

Threading libraries are complicated because hardware is complicated. I mean first generation threadrippers are a little old, but they still exist: do we really want to have everybody re-write the code to handle "I have multiple packages on this node, NUMA between the dies in a given package, NUMA between the packages in the node!"


Unless we're imagining a universe in which the programmer is around to tweak their source code for every hardware platform

It doesn't have to be the whole universe just a region in which a particular kind of specialized scientific computing is done. I have not idea what this specific thing is good for but looking at it in through the lens of general purpose computing truisms probably misses more than it illuminates.


There's some truth there, but I think it is actually flipped.

Scientific computing codes generally stick around a while (or at least they might, or at least we hope they will), they are run on complicated platforms by motivated people. There's incentive to do run-time tuning, and there's value to making them general.

On the other hand, you might have things like a videogame console, where the program is tuned for a particular platform. There's plenty of room there for programmer control I think (not sure though, that isn't my field -- in the Unity/Unreal era maybe they try and make things general? But I bet the programmer can exert great control if they want).


Scientific computing codes generally stick around a while

I don't think this is generally true and wasn't at all true in my limited exposure to scientific computing. There's lots of one-off, ad-hoc code that isn't building the next BLAS or whatever. The whole trope of 'scientists are terrible programmers' doesn't come from some universally prevalent incentive to write general, long-living code.


Scientific computing is not scientists doing computing. Its weather simulation, CFD, and other engineering FEA software. Lots of it has been around for decades and may even be written in Fortran.

Come to think of it, this may be targeted at scientists after all.


It says it is for "high-performance scientific computing" though. Could actually be intended for scientists who happen to be writing programs and mislabeled though. But do they typically care about manual hardware tuning, the sort that this library claims to enable?


It's common among scientists who write low-level code. Also, the author of the library we are talking about is a physicist, and the library probably exists for a reason.


Totally agree. Plus, OpenMP is a standard that’s grown to take into account new hardware. It works in C, C++ and Fortran, which is a huge bonus. It’s stuck around for a long time for a reason - before it and MPI were standardised, every set of processor hardware has it’s own parallelism libraries and the programmer had to learn a new set.


> The person who knows most about the hardware on which the program will be run is the person running it, not the programmer.

Then again, in scientific computing, often these are the same person.


AFAIK in scientific computing, the research group is targeting their university's particular cluster and the cluster's admins/operators are there to maintain and protect the shared facility, not to get the research result.


I think it depends on the group.

For example, I have more-or-less free rein a small cluster which belongs to my professor (my group is small enough to just manually coordinate to not annoy each other), then there's a university cluster if I feel like messing around with bsub and waiting in a queue. I'm also under the impression that it isn't super difficult to get a handful of xsede credits when starting a new project.

I prefer to debug on my local machine or the group cluster, so I don't have to risk putting pointless buggy runs into the shared queues (which might be costly, depending on policy, but even if they aren't costly I don't want to hog resources for no reason).

If I write the code that tunes performance for a particular machine, then I have codepaths that I wrote and can't test locally very easily. Better to use the existing OpenMP and MPI knobs to tune. Sure, you can end up with an evil thread interleaving that only shown up when you have 100 cores, but this should hopefully be rare.

That's my perspective at least. I do numerical stuff, so I'm shielded by some of the real nitty-gritty "perfectly synchronized threads" stuff by the fact that my costly operations are usually decent sized linear algebra calls, so it is usually best to give the lowest NUMA level(s) to BLAS, LAPACK, or maybe some sparse solver (which should be reasonably scalable and pretty well tested).


The paper has only a single benchmark reported from a single system where they report

> In our performance test, we see a speedup of 18.2x, saturating and even surpassing this estimated theoretical upper bound.

IMHO if you exceed your theoretical bound that's a sign you didn't go a good job analyzing the situation.


Yeah, I agree that they should have touched more on that. Something definitely is going on when your result exceeds what should be possible.

That said, I'm currently struggling with managing threads with my algorithm, so I'm looking forward to giving this a go.


Yet again another threading implementation which ignores the Actor programming model. This must be a systemic failure in both academia for not teaching students about Actors, and the industry for barely talking about Actors. It’s like the old saying, those that dont understand Unix are doomed to reimplement it, poorly.

For those readers that haven’t encountered Actors, its more than just a threading model. It allows messaging (important for interaction). An Actor is essentially a class with a std::vector<std::function void(…)> member, and a public messaging function which pushes the message with variable arguments to the queue. The message invokes the scheduler, which sequentially invokes the queued function. It eliminates locks for the client, which is the key benefit. It keeps Actors on a specific thread to use hot caches, unless work stealing moves the Actor to a free thread. The Actor abstraction is higher level than threads.

I’m sure the author will come to realise the benefits of Actors once he has a decade more experience under his belt, after which just like me he’ll be disappointed at academia and the industry for not talking about the benefits of the Actor programming model.

Shameless plug : my own Actor implementation, used in real world 24h/365 embedded projects: https://github.com/smallstepforman/Medo/tree/main/Actor


Actor's are fantastic and ideal for many situations.

However, this thread pool is specifically for _high performance_ computing which generally means working with shared memory. The actor paradigm generally assumes no shared memory and communications using message passing. It's harder to optimize performance and is an impedance mismatch for HPC.


Another important difference from the actor model is that the library uses a future interface with blocking wait to get the results. With actors there is no wait as the results are posted to the message queue of the caller.

But waiting makes the usage more similar to sequential computing code. When there is an opportunity the code starts several jobs to work on presumably independent data sets, then waits for their results in a shared memory and then process them sequentially until there is another opportunity for parallel computation.


Coming from the Scala world this comment confuses me.

We know actors very well and we've had a popular and solid implementation in Akka for years.

But that hasn't mean that every other concurrency model has gone away. If anything with the advent of libraries like ZIO (http://zio.dev) we see concurrency being a hot space for interesting new takes and approaches. I feel like maybe I am missing something if it can be used as a one-stop solution for all concurrency needs.


Chromium uses actor model extensively. Yet in Safari Apple moves into the direction of multiple fine-grain locks for performance reasons I presume. So it unclear which model will dominate the industry. Locks allow for better performance, but the code becomes much harder to maintain and reason about, although even in C++ with the right API and proper discipline the situation is not that bad.

Also it is interesting that Go does not use the actor model with its strictly one polymorphic message queue per sequence or green thread. Rather it is based on CSP or concurrent sequential processes with green threads with multiple typed message queues or channels per thread. In my limited experience with Go I would rather use a third-party actor model implementation rather than native CSP as the programming model is simpler.


That is the most C-like C++ I think I've ever seen.


Looking at the code, it seems like it is more of a class project rather than a highly optimized library.

It is using mutexes, condition variables, and futures to write a pretty much textbook implementation of a thread pool.

However, there will be significant contention, as all the workers are reading from the same queue, and the submissions are going to the same queue.

There are not multiple queues with work-stealing which is I think a minimum for a production level version of something like this.

EDIT:

IIRC C++ Concurrency in Action by Anthony Williams has a nice discussion of the issues of building a C++ thread pool using the C++ standard library. It does address things like contention and work-stealing.


"There are not multiple queues with work-stealing which is I think a minimum for a production level version of something like this."

100% agree. I also agree that would be a minimum requirement to call it "high-performance".


Multiple queues with work stealing is an optimization for really extreme cases (lots of tiny tasks). In most use cases, a multi-consumer lockless queue (eg, moodycamel::ConcurrentQueue) is more than adequate.


What I'm hearing is that in most cases, any one of the reams of existing code for this is adequate.


I generally agree with this but one thing I will say is that if you work in a huge codebase with a lot of engineers, people are going to end up throwing all kinds of insane work loads into whatever thread pool you have. The old adage about how if all you have is a hammer everything looks like a nail. Typically in a big codebase like this it might be discouraged or forbidden to create lots of your own work threads in your own pool, so in practice your'e just going to end up throwing items into whatever thread pool you're allowed to use, which is the one that's going to end up with silly optimizations like work-stealing (even if most sanely structured programs won't benefit from work-stealing).


Is it? I though that work stealing was a basic requirement for maintaining some semblance of cache locality.


It's also about prefetchability of the next task.

If all cores e.g. do an atomic add on a single counter variable to get their next task-id just in time, then there is no chance for the cores to do any reasonable prefetching. Because most of the time another core will come in and write to this variable that all prefetching is based on. With short tasks, you can end up with a full pipeline stall each time a new task is started.


For what kind of work?


If you have items running in the threadpool that are submitting more work to the threadpool, and you care about "high-performance", you probably want those items to run on the same thread for better cache (and memory locality) behavior. Really you want the same/nearby CPU, but in pure C++ "same thread" and relying on the OS is the closest you can get.

It's not that you need "work stealing" but rather that you need separate queues per thread rather than a single central queue. At that point you need a work stealing story to keep a single long running item from blocking the work queued behind it.

This matters less on systems which are "less NUMA" -- a single socket 6 core Intel home computer isn't going to care nearly as much if work migrates between CPUs unnecessarily, but the cache/NUMA effects are quite large when dealing with multi-socket servers, especially if you have a threadpool that wants to utilize the entire machine.


“Lots of tiny tasks” workloads, mostly. For more coarse-grained parallelism the cache thing matters a lot less. I still favor work-stealing schedulers as the default, since robustness across different types of workload is a very good thing, but they’re definitely not always needed.


Sean Parent's Better Concurrency video(s) go over some of it too. Also, good series of video's



Code: https://github.com/bshoshany/thread-pool/blob/master/BS_thre...

https://github.com/bshoshany/thread-pool/blob/master/BS_thre...

I don't see how this is built for particularly good performance, or with many-core machines in mind.


Because it's not.

It's an undergrad assignment basically. The writeup is nice, and it's great that it's on arxiv but it's not novel nor publishable research. The implementation that ships with a given OS is sure to run around in circles around it.


A good paper will describe how the work is interesting or useful in the field and how it opens the door for future research.

Here the author acknowledges that other libraries may be more performant, but describes the motivation for the work as essentially something that’s a small piece of code to maintain, and easy to use.

I didn’t run the code but I read through the paper and it doesn’t seem like it’s made a convincing case for this.


Depending on which C++17 standard libraries are being used, performance efficiency may be an issue and platform dependent.


Writing a thread pool which is performant and scalable is surprisingly hard, I did couple times. And now, whenever I can, I tend to avoid doing that. Instead, I prefer using the stuff already implemented by someone else, in OS userlands or runtime libraries.

• C and C++ runtime libraries have OpenMP. Level of support varies across compilers but still, all mainstream ones support at least OpenMP 2.0.

• .NET runtime comes with its own thread pool. The standard library provides both low-level APIs similar to the Windows’ built-in one, and higher-level things like Parallel.For remotely comparable to OpenMP.

• Windows userland has another one, see CreateThreadpoolWork and SubmitThreadpoolWork APIs introduced in Vista. The API is totally different though, a programmer needs to manually submit each task to the pool.

• iOS and OSX userlands also have a thread pool called Grand Central Dispatch. It’s very comparable to the Windows version.


The main issue with OpenMP is that Apple ships a crippled version of Clang that does not support it by default. If you are developing on Mac, OpenMP should be considered an external dependency, with all the unpleasantness that means in a C++ project. If you have students writing code and the target environment is Linux, as is common in the academia, replacing OpenMP with something portable can be a good idea.

Also, a thread pool that supports the equivalent of:

#pragma omp parallel for schedule(static)

and

#pragma omp parallel for schedule(dynamic, 1)

if often good enough for scientific computing. There is surprisingly little need for all the fancy things software engineers have invented because their threads have to communicate every minute.


Pretty much everyone in scientific computing who uses a Mac will use Homebrew or Macports to install a standard version of GCC or Clang in my experience.


What where your performance requirements, what where some issues that you ran into, and what did you do to improve?


It was years ago, but as far as I remember here's the hard parts.

Efficient synchronization is hard. Tasks may spawn more tasks, i.e. the pool has multiple producers and consumers. With a few cores simple ways work fine, with a few dozen of cores a global queue is a bottleneck.

Dispatching too many tasks, fire & forget style, may consume too much memory with pending tasks. Thread pools often need a backpressure, i.e. a way to stop posting tasks until some are finished.

Many articles on the internets promote lockless data structures, they have 2 issues. (1) very hard to implement and especially debug (2) the locks are still there in hardware, only their API is much worse (compare+exchange atomics), and the performance ain't necessarily better because reading from a cache line recently modified by another core is expensive, even more so than a cache miss and system RAM latency.

Good automatic scaling is hard. All operating systems have APIs to find count of hardware threads which is good, but they all run multiple programs at once. Just because a computer has 8 hardware threads doesn't mean it's always a good idea for a particular program to spawn 8 threads to compute something, depending on the environment could be too many.

Another notable mention is a tradeoff between CPU overhead and latency. Ideally you'd want neither of these things, but that's often not really possible. What's worse, optimal tradeoff varies: for large count of short tasks latency can be important, a background task which take a minute to complete doesn't care about latency at all, on a laptop it should ideally depend on plugged/unplugged global state.

As to what helped, kernel APIs for sending messages across threads did, an important piece of the puzzle already implemented in the OSes. For Linux see mq_overview, poll, epoll; for Windows PostQueuedCompletionStatus and GetQueuedCompletionStatus.


I personally found that when implementing a thread pool just using a regular mutex/condition variable and std::queue was just as performant as boost::lockfree::queue, was a lot simpler, and introduced fewer dependencies. Also when using boost::lockfree::queue I got complaints from tsan about some of the things boost was doing. I didn't really dig in but even if you assume that the boost implementation is correct, the fact that it pollutes tsan output is really annoying.

There are probably cases where a lockfree queue makes sense, but I don't think a thread work pool is one of them. Using a mutex also gives you the option for doing batched inserts efficiently (acquire mutex, insert multiple items into the pool).


This is a very primitive implementation. I like it, and it looks good, but using mutexes and condition variables will be comparatively slow. You really want to, in theory, abuse atomics in order to "simulate" condition variables, since you avoid the possible context switch to the kernel which condition variables may have (afaik).

Still cool, but I know (from building and testing my own similar implementations) like a good bit performance will be lost here due to these context switches. For an example of a more complex, but much more efficient, thread pool implementation, check out this article on thread pools in Zig [1] (the language is secondary, its mostly pseudocode).

It goes over the need for an actual scheduler to make thread pools efficient.

1: https://zig.news/kprotty/resource-efficient-thread-pools-wit...


In my experience, EIGEN's threadpool is decent. But OpenMP (edit: I mean Intel's implementation donated to LLVM) is often faster especially if threads are allowed to be affinitized to HW processors. Another promise of OpenMP that is not made by various threadpools is cooperative execution; in threadpools tasks are usually independent.

However, if any part of an app uses affinitized threads, the whole app needs to be using the same thread pool, as otherwise perf will go down. In this regard, OpenMP is less composable.


It’s been a long time since I touched c++, so pardon my naïveté. I’d have assumed that optimized thread pools were a done thing. What’s new here and why was there a gap ?


I looked through the code and I don't see anything new at all. Looks similar to the dozens of other thread pools I've reviewed over the years.

The author is a physicist. Looks like he just decided to publish about his C++ code.


There are lots of them and many are built into the OS(e.g. GCD on mac's, Windows has a thread pool api) plus TBB on all of them...)

It would be neat if the github site https://github.com/bshoshany/thread-pool or the paper did some comparisons to the existing body out there.


Read 1.1 of the pdf: https://arxiv.org/pdf/2105.00613.pdf

It's too long to quote. I'm too lazy to summarise.


Nothing in that section explains why this is in any way new or innovative. The only interesting claim is "performance", but looking at the code, I don't see anything other than a simple naive thread-pool implementation.


Perhaps slightly off topic but playing with parallel numerical computations I found that C++17/20's "parallel for" loops are significantly faster than using manually created threads.

https://en.cppreference.com/w/cpp/algorithm/execution_policy...


Lots of comments but no-one found the logical error in the paper and README, which I just fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: