Node C++ module vs libuv thread pool size

Node C++ module vs libuv thread pool size - c++

I've written a Nodejs C++ module that makes use of NAN's AsyncWorker to expose async module functionality. Works great. However, I understand that AsyncWorker makes use of libuv's thread pool, which defaults to just 4 threads.
While this (or a #-of-cores based limitation) might make sense for CPU-heavy functions, some of my exposed functions may run relatively long, even though they don't use the CPU (network activity, etc). Therefore the thread pool might get all used up even though no computation-intensive work is going on.
The easy solution is to increase the thread pool size (UV_THREADPOOL_SIZE). However, I am concerned that this thread pool is used for other things as well, which might suffer from a performance hit due to too much parallelization (the libuv documentation states, "The threadpool is global and shared across all event loops...").
Is my concern valid? Is there a way to make use of a separate, larger, thread pool only for certain AsyncWorker's that are long-running but not CPU-intenstive, while leaving the common thread-pool untouched?

Related

Cancelling arbitary jobs running in a thread_pool

Is there a way for a thread-pool to cancel a task underway? Better yet, is there a safe alternative for on-demand cancelling opaque function calls in thread_pools?
Killing the entire process is a bad idea and using native handle to perform pthread_cancel or similar API is a last resort only.
Extra
Bonus if the cancellation is immediate, but it's acceptable if the cancellation has some time constraint 'guarantees' (say cancellation within 0.1 execution seconds of the thread in question for example)
More details
I am not restricted to using Boost.Thread.thread_pool or any specific library. The only limitation is compatibility with C++14, and ability to work on at least BSD and Linux based OS.
The tasks are usually data-processing related, pre-compiled and loaded dynamically using C-API (extern "C") and thus are opaque entities. The aim is to perform compute intensive tasks with an option to cancel them when the user sends interrupts.
While launching, the thread_id for a specific task is known, and thus some API can be sued to find more details if required.
Disclaimer
I know using native thread handles to cancel/exit threads is not recommended and is a sign of bad design. I also can't modify the functions using boost::this_thread::interrupt_point, but can wrap them in lambdas/other constructs if that helps. I feel like this is a rock and hard place situation, so alternate suggestions are welcome, but they need to be minimally intrusive in existing functionality, and can be dramatic in their scope for the feature-set being discussed.
EDIT:
Clarification
I guess this should have gone in the 'More Details' section, but I want it to remain separate to show that existing 2 answers are based o limited information. After reading the answers, I went back to the drawing board and came up with the following "constraints" since the question I posed was overly generic. If I should post a new question, please let me know.
My interface promises a "const" input (functional programming style non-mutable input) by using mutexes/copy-by-value as needed and passing by const& (and expecting thread to behave well).
I also mis-used the term "arbitrary" since the jobs aren't arbitrary (empirically speaking) and have the following constraints:
some which download from "internet" already use a "condition variable"
not violate const correctness
can spawn other threads, but they must not outlast the parent
can use mutex, but those can't exist outside the function body
output is via atomic<shared_ptr> passed as argument
pure functions (no shared state with outside) **
** can be lambda binding a functor, in which case the function needs to makes sure it's data structures aren't corrupted (which is the case as usually, the state is a 1 or 2 atomic<inbuilt-type>). Usually the internal state is queried from an external db (similar architecture like cookie + web-server, and the tab/browser can be closed anytime)
These constraints aren't written down as a contract or anything, but rather I generalized based on the "modules" currently in use. The jobs are arbitrary in terms of what they can do: GPU/CPU/internet all are fair play.
It is infeasible to insert a periodic check because of heavy library usage. The libraries (not owned by us) haven't been designed to periodically check a condition variable since it'd incur a performance penalty for the general case and rewriting the libraries is not possible.

Is there a way for a thread-pool to cancel a task underway?
Not at that level of generality, no, and also not if the task running in the thread is implemented natively and arbitrarily in C or C++. You cannot terminate a running task prior to its completion without terminating its whole thread, except with the cooperation of the task.
Better
yet, is there a safe alternative for on-demand cancelling opaque
function calls in thread_pools?
No. The only way to get (approximately) on-demand preemption of a specific thread is to deliver a signal to it (that is is not blocking or ignoring) via pthread_kill(). If such a signal terminates the thread but not the whole process then it does not automatically make any provision for freeing allocated objects or managing the state of mutexes or other synchronization objects. If the signal does not terminate the thread then the interruption can produce surprising and unwanted effects in code not designed to accommodate such signal usage.
Killing the entire process is a bad idea and using native handle to
perform pthread_cancel or similar API is a last resort only.
Note that pthread_cancel() can be blocked by the thread, and that even when not blocked, its effects may be deferred indefinitely. When the effects do occur, they do not necessarily include memory or synchronization-object cleanup. You need the thread to cooperate with its own cancellation to achieve these.
Just what a thread's cooperation with cancellation looks like depends in part on the details of the cancellation mechanism you choose.

Cancelling a non cooperative, not designed to be cancelled component is only possible if that component has limited, constrained, managed interactions with the rest of the system:
the ressources owned by the components should be managed externally (the system knows which component uses what resources)
all accesses should be indirect
the modifications of shared ressources should be safe and reversible until completion
That would allow the system to clean up resource, stop operations, cancel incomplete changes...
None of these properties are cheap; all the properties of threads are the exact opposite of these properties.
Threads only have an implied concept of ownership apparent in the running thread: for a deleted thread, determining what was owned by the thread is not possible.
Threads access shared objects directly. A thread can start modifications of shared objects; after cancellation, such modifications that would be partial, non effective, incoherent if stopped in the middle of an operation.
Cancelled threads could leave locked mutexes around. At least subsequent accesses to these mutexes by other threads trying to access the shared object would deadlock.
Or they might find some data structure in a bad state.
Providing safe cancellation for arbitrary non cooperative threads is not doable even with very large scale changes to thread synchronization objects. Not even by a complete redesign of the thread primitives.
You would have to make thread almost like full processes to be able to do that; but it wouldn't be called a thread then!

Parallel programming with c++ async

Is there any way to set maximum amount of threads that can be created by use async function (from future)?
I prefer to use async/future.get because it can be translate to sync/spawn multitasking model, which is common
in textbooks on Alghoritms(ie. Cormen). I want to be able to obtain T[p] (time to finish program using p processors/ threads).

Unfortunately no. std::async is rather notoriously limited in the control it provides you over how threads are created.
You might consider using a boost thread pool instead. This is (somewhat counter intuitively) part of boost asio, and uses an io_service object, even when/if you're not actually using it for I/O.
With this it's fairly easy to control the number of threads used, including using only one.
Of course you could build your own thread pool class from the standard components. Certainly possible, but not an entirely trivial task.

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.

Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.

Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.

FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1

I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.

as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

Using asynchronous method vs thread wait

I have 2 versions of a function which are available in a C++ library which do the same task. One is a synchronous function, and another is of asynchronous type which allows a callback function to be registered.
Which of the below strategies is preferable for giving a better memory and performance optimization?
Call the synchronous function in a worker thread, and use mutex synchronization to wait until I get the result
Do not create a thread, but call the asynchronous version and get the result in callback
I am aware that worker thread creation in option 1 will cause more overhead. I am wanting to know issues related to overhead caused by thread synchronization objects, and how it compares to overhead caused by asynchronous call. Does the asynchronous version of a function internally spin off a thread and use synchronization object, or does it uses some other technique like directly talk to the kernel?

"Profile, don't speculate." (DJB)
The answer to this question depends on too many things, and there is no general answer. The role of the developer is to be able to make these decisions. If you don't know, try the options and measure. In many cases, the difference won't matter and non-performance concerns will dominate.
"Premature optimisation is the root of all evil, say 97% of the time" (DEK)
Update in response to the question edit:
C++ libraries, in general, don't get to use magic to avoid synchronisation primitives. The asynchronous vs. synchronous interfaces are likely to be wrappers around things you would do anyway. Processing must happen in a context, and if completion is to be signalled to another context, a synchronisation primitive will be necessary to do that.
Of course, there might be other considerations. If your C++ library is talking to some piece of hardware that can do processing, things might be different. But you haven't told us about anything like that.
The answer to this question depends on context you haven't given us, including information about the library interface and the structure of your code.

Use asynchronous function because will probably do what you want to do manually with synchronous one but less error prone.
Asynchronous: Will create a thread, do work, when done -> call callback
Synchronous: Create a event to wait for, Create a thread for work, Wait for event, On thread call sync version , transfer result, signal event.

You might consider that threads each have their own environment so they use more memory than a non threaded solution when all other things are equal.
Depending on your threading library there can also be significant overhead to starting and stopping threads.
If you need interprocess synchronization there can also be a lot of pain debugging threaded code.
If you're comfortable writing non threaded code (i.e. you won't burn a lot of time writing and debugging it) then that might be the best choice.

How to design multithreaded application

I have a multithreaded application. Each module is executed in a separate thread.
Modules are:
- network module - used to receive/send data from network
- parser module - encode/decode network data to internal presentation
- 2 application module - perform some application logic on the above data one after other
- counter module - used to gather statistics from other modules
- timer module - used to schedule timers
- and much more ...
All threads using message queues for inter thread communication (std::deque sync by conditional variable and mutex).
Some modules are used by others ones (e.g. all modules use timer and counter) and this for each message received from network wich should be handled in very high rates.
This is pretty complex application and the design looks "reasonable". From other hand, I'm not sure that such design, thread per module, is the "best" one? In particular, I'm afraid that such design "encorage" a lot of context switches.
What do you think?
Is there're any good guidelines or open source project to learn from how to do "correct" design of threaded application?

Thread-per-function designs are just naive: they assume that by separating tasks - by module - onto threads, that some kind of scalability will be achieved.
This kind of design is inefficient, as very few task breakdowns yield exactly as many tasks as there are CPUs.
Far more rational designs are to break tasks down into 'jobs' - and then use thread pooling mechanisms to dispatch those jobs.
Advantages over the thread-per-module approach:
Thread pools take advantage of all cores. with thread-per-module if you have modules < cores you have cores sitting idle.
Thread pools minimize contention and resources by maintaining a parity between active threads, and cores. with thread-per-module, if modules > cores you incur needless extra context switches and (on some platforms) each thread exhausts other limited per process resources (like virtual memory).
Thread pools let a "module" do multiple jobs at a time. thread-per-module means that the busiest module still only gets one core.

I wouldn't call myself an expert an multi-threaded design. But I've at least worked with threads enough to have run into various issues trying to design them to work together (communication, locking resources, waiting for threads to end, etc).
At this point, my general rule of thumb is that I must justify the existence of each new thread. For example, if the network layer I'm using provides both a synchronous and an asynchronous API, can I really justify making the network code use synchronous calls in a new thread instead of just using the asynchronous calls in the main thread? In your case, how many modules actually need a thread of their own for a specific reason. Are there any that could instead just be called in turn from the main thread?
If some threads have no good reason for existing, then you might be able to save yourself some trouble and complexity by just putting that module in the main thread.
Now of course, there are good justifiable reasons for putting things in threads. Such as making synchronous calls that may block for a long time, keeping a GUI thread responsive while performing a long task, or being able to take advantage of parallel processing of a large task on a multi-core system.
I don't know of any particular "correct" way to do it. A lot of it really comes down to the details of what your application is actually supposed to do.

A good guideline is to put operations that might block (such as I/O) in its own thread. Your network module is a definite candidate here. Have your network thread use select (I assume UNIX here) to block on input.
Asynchronous events are good in separate threads as well. Your timer module looks like a good candidate here.
You might want to put your other modules in one thread to decrease complexity of your application. BUT, you might want to split them up if you have a multi-processor system.
Have a good strategy for locking resources and mutex handling to prevent deadlocks. A dependency graph (using a whiteboard!) might help here to get your design correct.
Good luck! Sounds like a complex system which will cause many hours of fun development!

For what platform?
For instance a Win32 applications the best model for back-end servers (like yours seems to be) is the thread pool and IO Completion Port. This is not just some hear say and opinion, there are strong facts behind this claim. Rick Vicik of the Windows Performance team has posted a series of articles describing in greater detail why high end servers need to follow this model, see High Performance Windows Programs.
There are other factors that come into play, like for instance the typo of protocol your network module has to handle. Request-Response protocols are often handled by one-thread-per-request metaphor and they do well enough, but high-throughput high-scale protocols don't fare well in that model, specifically because of boxcaring requirements.
Ultimately, whether your design is sound or not is hard to tell just from this brief description. Personally I tend o favor an IO completion driven threading model, as opposed to logical-module driven one, but that's just me.

Just to add to the other answers, lets reason every single thread in your dessign:
network module
Accepted.
parser module + 2 application module
Are you sure that these 3 threads can't be merged into one, main data processing thread? If that were the case, you could then benefit of a thread pool like others sugested, having this processing performed by N threads.
timer module
This one probably is reasonable in most platforms, as you will need a message processing loop to dispatch timer events. Also, if you ever need a GUI that could be the place.
counter module
This is the one that most annoys me. I can't find the reason for having a separate thread for this. Depending on how much you increment it, it will be a nice bottleneck for the application.
I'll suggest keeping separate counters in each thread and poll(message queue) for them when you need it.
and much more ...
Hope not!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js