I was using reducers library in some of the places in my code on a production server with 32 cores to leverage some parallelism. But the Fork/Join frameworks seems to utilize cores so heavily that other processes choke out and become unresponsive.
Is there some way to limit the no. of cores being used or thread being spawned by reducers library on a jvm instance?
It seems it isn't possible to adjust the standard reducers forkjoin threadpool size through function or configuration parameters. You need to change core.reducers itself.
From core.reducers source
(def pool (delay (java.util.concurrent.ForkJoinPool.)))
This corresponds with the default java constructor without arguments
ForkJoinPool()
Creates a ForkJoinPool with parallelism equal to Runtime.availableProcessors(), using the default thread factory, no UncaughtExceptionHandler, and non-async LIFO processing mode.
instead of
ForkJoinPool(int parallelism)
Creates a ForkJoinPool with the indicated parallelism level, the default thread factory, no UncaughtExceptionHandler, and non-async LIFO processing mode.
It would be a nice addition to have at least the option to control the number of cores (there's also an even more configurable version of ForkJoinPool), but for now the only option would be to fork core.reducers and change that line to the number of max cores you want used:
(def pool (delay (java.util.concurrent.ForkJoinPool. 28)))
Related
Currently I am building a Tensorflow Graph and then performing, graph->getSession()->Run(input,model,output) in C++ in CPU
I want to achieve concurrency. What are my options to execute in parallel so I can support multiple requests executed concurrently.
Can I run sessions in multi threaded way?
By executing multiple sessions parallel will the processing time be constant? Example : If one session takes 100 ms then running 2 sessions concurrently takes approximately 100 ms.
Note: I want to run this on CPU
First thing to note is that tensorflow will use all cores for processing by default. You have some limited manner of control over this via inter and intra op perallelism discussed in this authoritative answer:
Tensorflow: executing an ops with a specific core of a CPU
The second point to note is that a session is thread safe. You can call call it from multiple threads. Each call will see a consistent point-in-time snapshot of the variables as they were when the call began, this is a question I asked once upon a time:
How are variables shared between concurrent `session.run(...)` calls in tensorflow?
The moral:
If you are running lots of small, sequential operations you can run them concurrently against one session and may be able to squeak out some improved performance if you limit tensorflow's use of parallelism. If you are running large operations (such as large matrix multiples, for example) which benefit more from distributed multi-core processing, you don't need to deal with parallelism yourself, tensorflow is already distributing across all CPU cores by default.
Also if your graph dependencies lend themselves to any amount of parallelization tensorflow handles this as well. You can set up profiling to see this in action.
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
We have a C++ program which, depending on the way the user configures it, may be CPU bound or IO bound. For the purpose of loose coupling with the program configuration, I'd like to have my thread pool automatically realize when the program would benefit from more threads (i.e. CPU bound). It would be nice if it realized when it was I/O bound and reduced the number of workers, but that would just be a bonus (i.e. I'd be happy with something that just automatically grows without automatic shrinkage).
We use Boost so if there's something there that would help we can use it. I realize that any solution would probably be platform specific, so we're mainly interested in Windows and Linux, with a tertiary interest in OS X or any other *nix.
Short answer: use distinct fixed-size thread pools for CPU intensive operations and for IOs. In addition to the pool sizes, further regulation of the number of active threads will be done by the bounded-buffer (Producer/Consumer) that synchronizes the computer and IO steps of your workflow.
For compute- and data-intensive problems where the bottlenecks are a moving target between different resources (e.g. CPU vs IO), it can be useful to make a clear distinction between a thread and a thread, particularly, as a first approximation:
A thread that is created to use more CPU cycles ("CPU thread")
A thread that is created to handle an asynchronous IO operation ("IO thread")
More generally, threads should be segregated by the type of resources that they need. The aim should be to ensure that a single thread doesn't use more than one resource (e.g. avoiding switching between reading data and processing data in the same thread). When a tread uses more than one resource, it should be split and the two resulting threads should be synchronized through a bounded-buffer.
Typically there should be exactly as many CPU threads as needed to saturate the instruction pipelines of all the cores available on the system. To ensure that, simply have a "CPU thread pool" with exactly that many threads that are dedicated to computational work only. That would be boost:: or std::thread::hardware_concurrency() if that can be trusted. When the application needs less, there will simply be unused threads in the CPU thread pool. When it needs more, the work is queued. Instead of a "CPU thread pool", you could use c++11 std::async but you would need to implement a thread throttling mechanism with your selection of synchronization tools (e.g. a counting semaphore).
In addition to the "CPU thread pool", there can be another thread pool (or several other thread pools) dedicated to asynchronous IO operations. In your case, it seems that IO resource contention is potentially a concern. If that's the case (e.g. a local hard drive) the maximum number of threads should be carefully controlled (e.g. at most 2 read and 2 write threads on a local hard drive). This is conceptually the same as with CPU threads and you should have one fixed size thread pool for reading and another one for writing. Unfortunately, there will probably not be any good primitive available to decide on the size of these thread pools (measuring might be simple though, if your IO patterns are very regular). If resource contention is not an issue (e.g. NAS or small HTTP requests) then boost::asio or c++11 std::async would probably be a better option than a thread pool; in which case, thread throttling can be entirely left to the bounded-buffers.
I have seen in some posts it has been said that to use multiple cores of processor use Boost thread (use multi-threading) library. Usually threads are not visible to operating system. So how can we sure that multi-threading will support usage of multi-cores. Is there a difference between Java threads and Boost threads?
The operating system is also called a "supervisor" because it has access to everything. Since it is responsible for managing preemptive threads, it knows exactly how many a process has, and can inspect what they are doing at any time.
Java may add a layer of indirection (green threads) to make many threads look like one, depending on JVM and configuration. Boost does not do this, but instead only wraps the POSIX interface which usually communicates directly with the OS kernel.
Massively multithreaded applications may benefit from coalescing threads, so that the number of ready-to-run threads matches the number of logical CPU cores. Reducing everything to one thread may be going too far, though :v) and #Voo says that green threads are only a legacy technology. A good JVM should support true multithreading; check your configuration options. On the C++ side, there are libraries like Intel TBB and Apple GCD to help manage parallelism.
As far as I understand, the kernel has kernelthreads for each core in a computer and threads from the userspace are scheduled onto these kernel threads (The OS decides which thread from an application gets connected to which kernelthread). Lets say I want to create an application that uses X number of cores on a computer with X cores. If I use regular pthreads, I think it would be possible that the OS decides to have all the threads I created to be scheduled onto a single core. How can I ensure that each each thread is one-on-one with the kernelthreads?
You should basically trust the kernel you are using (in particular, because there could be another heavy process running; the kernel scheduler will choose tasks to be run during a quantum of time).
Perhaps you are interested in CPU affinity, with non-portable functions like pthread_attr_setaffinity_np
You're understanding is a bit off. 'kernelthreads' on Linux are basically kernel tasks that are scheduled alongside other processes and threads. When the kernel's scheduler runs, the scheduling algorithm decides which process/thread, out of the pool of runnable threads, will be scheduled to run next on a given CPU core. As #Basile Starynkevitch mentioned, you can tell the kernel to pin individual threads from your application to a particular core, which means the operating system's scheduler will only consider running it on that core, along with other threads that are not pinned to a particular core.
In general with multithreading, you don't want your number of threads to be equal to your number of cores, unless you're doing exclusively CPU-bound processing, you want number of threads > number of cores. When waiting for network or disk IO (i.e. when you're waiting in an accept(2), recv(2), or read(2)) you're thread is not considered runnable. If N threads > N cores, the operating system may be able to schedule a different thread of yours to do work while waiting for that IO.
What you mention is one possible model to implement threading. But such a hierarchical model may not be followed at all by a given POSIX thread implementation. Since somebody already mentioned linux, it dosn't have it, all threads are equal from the point of view of the scheduler, there. They compete for the same resources if you don't specify something extra.
Last time I have seen such a hierarchical model was on a machine with an IRIX OS, long time ago.
So in summary, there is no general rule under POSIX for that, you'd have to look up the documentation of your particular OS or ask a more specific question about it.
Goal
My goal to better understand how concurrency within Java EE environment and how can I better consume it.
General questions
Let's take typical servlet container (tomcat) as example. For each request it uses 1 thread to process it. Thread pool is configured so, that it can have max 80 threads in pool. Let's also take simple webapp - it makes some processing and DB communication during each request.
At peak time I can see 80 parallel running threads (+ several other infrastructure threads). Let's also assume I running it in 'm1.large' EC2 instance.
I don't think that all these threads can really run in parallel on this hardware. So now scheduler should decide how better to split CPU time between them all. So the questions are - how big is scheduler overhead in this case? How can I find right balance between thread amount and processing speed?
Actors comparison
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resources. What if we will detach request from thread and will have only reasonable amount of threads (8 for instance) and will just send processing tasks to them. Of course in this case IO should be also non-blocking, so that I receive events when some data, that I need, is available and I send event, if I have some results.
As far as I understand, Actor model is all about this. Actors are not bound to threads (at least in Akka and Scala). So I have reasonable thread pool and bunch of actors with mailboxes that contain processing tasks.
Now question is - how actor model compares to traditional thread-per-request model in terms of performance, scheduler overhead and resources (RAM, CPU) consumption?
Custom threads
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment. And, from my point of view, it doesn't fits naturally in thread-per-request model. Question arise: how big my thread pool size should be? Even if I will make it reasonable in terms of hardware I still have this bunch of threads managed by servlet container. Thread management becomes decentralized and goes wild.
So my question - what is the best way to deal with these situations in thread-per-request model?
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resourecs.
Wrong. Exactly in this scenario the processors can handle many more threads than the number of individual cores, since most of the threads at any point in time are blocked waiting for I/O. Fair enough, context switching takes time, but that overhead is usually irrelevant compared to file/network/DB latency.
The rule of thumb that the number of threads should be equal - or a little more than - the number of processor cores applies only for computation-intensive tasks when the cores are kept busy most of the time.
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment.
Never heard about that (but I don't claim myself to be the ultimate Java EE expert). IMHO there is nothing wrong in executing tasks associated with a single request parallelly using e.g. a ThreadPoolExecutor. Note that these threads are not request handling threads, so they don't directly interfere with the thread pool used by the EJB container. Except that they compete for the same resources of course, so they may slow down or completely stop other request processing threads in a careless setup.
what is the best way to deal with these situations in thread-per-request model?
In the end, you can't escape measuring concurrent performance and fine-tuning the size of your thread pool and other parameters for your own specific environment.
The whole point of Java EE is to put common architectural concerns like security, state, and concurrency into the framework and let you provide the bits of business logic or data mappings along with the wiring to connect them. As such, Java EE intentionally hides the nasty bits of concurrency (locking to read/write mutable state) in the framework.
This approach lets a much broader range of developers successfully write correct applications. A necessary side effect though is that these abstractions create overhead and remove control. That's both good (in making it simple and encoding policies as policies not code) and bad (if you know what you're doing and can make choices impossible in the framework).
It is not inherently bad to have 80 threads on a production box. Most will be blocked or waiting on I/O which is fine. There is a (tunable) pool of threads doing the actual computation and Java EE will give you external hooks to tune those knobs.
Actors are a different model. They also let you write islands of code (the actor body) that (can) avoid locking to modify state. You can write your actors to be stateless (capturing the state in the recursive function call parameters) or hide your state completely in an actor instance so the state is all confined (for react style actors you probably still need to explicitly lock around data access to ensure visibility on the next thread that runs your actor).
I can't say that one or the other is better. I think there is adequate proof that both models can be used to write safe, high-throughput systems. To make either perform well, you need to think hard about your problem and build apps that isolate parts of state and the computations on each kind of state. For code where you understand your data well and have a high potential for parallelism I think models outside Java EE make a lot of sense.
Generally, the rule of thumb in sizing compute-bound thread pools is that they should be approximately equal to N of cores + 2. Many frameworks size to that automatically. You can use Runtime.getRuntime().availableProcessors() to get N. If your problem decomposes in a divide-and-conquer style algorithm and the number of data items is large, I would strongly suggest checking out fork/join which can be used now as a separate library and will be part of Java 7.
As far as how to manage this, you're not supposed to spawn threads as such inside Java EE (they want to control that) but you might investigate sending a request to your data-crunching thread pool via a message queue and handling that request via a return message. That can fit in the Java EE model (a bit clumsily of course).
I have a writeup of actors, fork/join, and some other concurrency models here that you might find interesting: http://tech.puredanger.com/2011/01/14/comparing-concurrent-frameworks/