Fork Join mechanism in Java 7

Fork Join mechanism in Java 7 - concurrency

Does the scheduling mechanism implemented in Fork Join necessarily entail that if at any point of time there are free cores available, threads will definitely be scheduled on those free cores?

Related

When threads are created, do they run in parallel or concurrently?

I’m learning about multi threading and I understand the difference between parallelism and concurrency. My question is, if I create a second thread and detach it from the parent thread, do these 2 run concurrently or in parallel?

Typically (for a modern OS) there's about 100 processes with maybe an average of 2 threads each for a total of 200 threads just for background services, GUI, software updaters, etc. You start your process (1 more thread) and it spawns its second thread, so now there's 202 threads.
Out of those 202 threads almost all of them will be blocked waiting for something to happen most of the time. If 30 threads are not blocked, and you have 8 CPUs, then 30 threads compete for those 8 CPUs.
If 30 threads compete for 8 CPUs, then maybe 4 threads are high priority and get a whole CPU for themselves and 10 threads are low priority and don't get any CPU time because there's more important work for CPUs to do; and maybe 12 threads are medium priority and share 4 CPUs by time multiplexing (frequently switching between threads). How this actually works depends on the OS and its scheduler (its very different for different operating systems).
Of course the number of threads and their priorities changes often (e.g. as threads are created and terminated), and threads block and unblock very often, so how many threads are competing for CPUs (not blocked) is constantly changing. Maybe there's 30 threads competing for 8 CPUs at one point in time, and 2 milliseconds later there's 5 threads and 3 CPUs are idle.
My question is, if I create a second thread and detach it from the parent thread, do these 2 run concurrently or in parallel?
Yes; your 2 threads may either run concurrently (share a CPU with time multiplexing) or run in parallel (on different CPUs); and can do both at different times (concurrently for a while, then parallel for a while, then..).

if I create a second thread and detach it...
Detach is not something your program can do to a thread. It's merely something the program can do to a std::thread object. The object is not the thread. The object is just a handle that your program can use to talk about the thread, and "detach" just grants permission for the program to destroy the handle, while the thread continues to run.
...from the parent thread
So, "detach" doesn't detach one thread from another (e.g., a "child" from its "parent,") It detaches a thread from its std::thread handle.
FYI: Most programming environments do not recognize any parent/child relationship between threads. If thread A creates thread B, that does not give thread A any special privileges or capabilities with respect to thread B that threads C, D, and E don't also have.
That's not to say that you can't recognize the relationship. It might mean something in your program. It just doesn't mean anything to the OS.
...Parallel or concurrently
That's not an either-or choice. Parallel is concurrent.
"Concurrency" does not mean "threads context switching." Rather, it's a statement about the order in which the threads access and update shared variables.
When two threads run concurrently on typical multiprocessing hardware, their actions will be serializable. That means that the outcome of the program will be the same as if some single, imaginary thread did all the same things that the real program threads did, and it did them one-by-one, in some particular order.
Two threads are concurrent with each other if the serialization is non-deterministic. That is to say, if the "particular order" in which the things all seem to happen is not entirely determined by the program. They are concurrent if different runs of the program can behave as if the imaginary single thread chose different serializations.
There's also a simpler test, which usually amounts to the same thing: Two threads probably are concurrent with each other if both threads are started before either one of them finishes.
Any way you look at it though, if two threads are truly running in parallel with each other, then they must also be concurrent with each other.
Do they run in parallel?
#Brendan already answered that one. TLDR: if the computer has more than one CPU, then they potentially can run in parallel. How much of the time they actually spend running in parallel depends on many things though, and many of those things are in the domain of the operating system.

How do I automatically add threads to a pool based on the computational needs of the program?

We have a C++ program which, depending on the way the user configures it, may be CPU bound or IO bound. For the purpose of loose coupling with the program configuration, I'd like to have my thread pool automatically realize when the program would benefit from more threads (i.e. CPU bound). It would be nice if it realized when it was I/O bound and reduced the number of workers, but that would just be a bonus (i.e. I'd be happy with something that just automatically grows without automatic shrinkage).
We use Boost so if there's something there that would help we can use it. I realize that any solution would probably be platform specific, so we're mainly interested in Windows and Linux, with a tertiary interest in OS X or any other *nix.

Short answer: use distinct fixed-size thread pools for CPU intensive operations and for IOs. In addition to the pool sizes, further regulation of the number of active threads will be done by the bounded-buffer (Producer/Consumer) that synchronizes the computer and IO steps of your workflow.
For compute- and data-intensive problems where the bottlenecks are a moving target between different resources (e.g. CPU vs IO), it can be useful to make a clear distinction between a thread and a thread, particularly, as a first approximation:
A thread that is created to use more CPU cycles ("CPU thread")
A thread that is created to handle an asynchronous IO operation ("IO thread")
More generally, threads should be segregated by the type of resources that they need. The aim should be to ensure that a single thread doesn't use more than one resource (e.g. avoiding switching between reading data and processing data in the same thread). When a tread uses more than one resource, it should be split and the two resulting threads should be synchronized through a bounded-buffer.
Typically there should be exactly as many CPU threads as needed to saturate the instruction pipelines of all the cores available on the system. To ensure that, simply have a "CPU thread pool" with exactly that many threads that are dedicated to computational work only. That would be boost:: or std::thread::hardware_concurrency() if that can be trusted. When the application needs less, there will simply be unused threads in the CPU thread pool. When it needs more, the work is queued. Instead of a "CPU thread pool", you could use c++11 std::async but you would need to implement a thread throttling mechanism with your selection of synchronization tools (e.g. a counting semaphore).
In addition to the "CPU thread pool", there can be another thread pool (or several other thread pools) dedicated to asynchronous IO operations. In your case, it seems that IO resource contention is potentially a concern. If that's the case (e.g. a local hard drive) the maximum number of threads should be carefully controlled (e.g. at most 2 read and 2 write threads on a local hard drive). This is conceptually the same as with CPU threads and you should have one fixed size thread pool for reading and another one for writing. Unfortunately, there will probably not be any good primitive available to decide on the size of these thread pools (measuring might be simple though, if your IO patterns are very regular). If resource contention is not an issue (e.g. NAS or small HTTP requests) then boost::asio or c++11 std::async would probably be a better option than a thread pool; in which case, thread throttling can be entirely left to the bounded-buffers.

How does Intel TBB concurrent_queue work? Does it achieve fine-grained parallelism?

I am new to Intel TBB. I am using concurrent_queue in order to achieve fine-grained parallelism in my project. I got few doubts. This is how i am implementing.
thread_fun(arguments) {
while(concurrent_queue.try_pop(v))
operation on v; //each thread executes its own operation(seperate v for each thread)
}
main() {
for(..)
concurrent_queue.push(); //fill the queue
_beginthreadex(..); //create 8 win32 threads and pass concurrent_queue as an argument
}
I am explicitly mentioning the number of threads. I read that TBB will create threads based on the processor core count. How can i achieve that? So that i don't need to create threads explicitly with _beginthreadex function?
Am i achieving fine-grained parallelism by using the concurrent_queue?
What do you mean by task level parallelism? How do you achieve task level parallelism with intel tbb? I am popping element from the queue. Does pop operation considered as a task? That means, all pop operations are considered as different tasks. I am popping 8 elements at a time with 8 threads. That means, i am achieving task level parallelism. Am i correct?
If i increase the number of threads to 32 on a quad-core processor(support 8 threads), how does the concurrent_queue work? Does only 8 threads concurrently executed on the queue or total 32 threads are executed concurrently?
Please help me.

TBB has a lot of parts. Concurrent_queue is just a queue that can be safely accessed by multiple threads. Concurrent queue is not directly part of the "fine grain parallelism" or "task parallelism" part of TBB. (Although those parts can use concurrent_queue.)
You create fine grained/task parallelism in tbb by using tbb::parallel_for. See the examples on that page or at this stack overflow question. When you use parallel_for tbb will take care of all the thread creation and will also do dynamic load balancing.
Concurrent_queue doesn't create any parallelism at all: it just allows multiple threads to safely (and relatively efficiently) access the same queue. It ensures there are no race conditions. In your example you are getting no more than 8 threads worth of concurrency. Actually: concurrent_queue is serializing accesses to the queue. Your parallelism is coming from your "operations on v". The accesses to the concurrent_queue should be thought of as overhead.
You are implementing a basic kind of task level parallelism in your example. I found this brilliant description of the difference between a task and a thread from a Stack Overflow answer by #Mitch Wheat:
A task is something you want doing.
A thread is one of possibly many workers who perform that task.
More technically: tasks must not block waiting for one another. Your tasks are the "operations on v" that you are performing. You are achieving task-level parallelism. The advantage of using something like tbb::parallel_for is that it (a) automatically starts the threads for you, (b) tries to automatically choose a number of threads most appropriate for the number of cores on your user's system, (c) is optimized to reduce the overhead of distributing tasks to threads, and (d) also does some pretty cool dynamic load balancing if your task operations aren't of equal length.
Let's distinguish between concurrency, and parallelism. Your machine has 8 hardware threads. So you have a maximum cpu parallelism of 8. I would use the term concurrency to mean something like "the maximum parallelism you could achieve if you had a magical processor with an infinite number of cores." Typically the main reason for creating more threads than you have cores is if your threads are doing a lot of long-latency i/o operations. The operating system can switch to one of the extra threads to do cpu work while the other thread is waiting for the disk/network.

Kernel threads in posix

As far as I understand, the kernel has kernelthreads for each core in a computer and threads from the userspace are scheduled onto these kernel threads (The OS decides which thread from an application gets connected to which kernelthread). Lets say I want to create an application that uses X number of cores on a computer with X cores. If I use regular pthreads, I think it would be possible that the OS decides to have all the threads I created to be scheduled onto a single core. How can I ensure that each each thread is one-on-one with the kernelthreads?

You should basically trust the kernel you are using (in particular, because there could be another heavy process running; the kernel scheduler will choose tasks to be run during a quantum of time).
Perhaps you are interested in CPU affinity, with non-portable functions like pthread_attr_setaffinity_np

You're understanding is a bit off. 'kernelthreads' on Linux are basically kernel tasks that are scheduled alongside other processes and threads. When the kernel's scheduler runs, the scheduling algorithm decides which process/thread, out of the pool of runnable threads, will be scheduled to run next on a given CPU core. As #Basile Starynkevitch mentioned, you can tell the kernel to pin individual threads from your application to a particular core, which means the operating system's scheduler will only consider running it on that core, along with other threads that are not pinned to a particular core.
In general with multithreading, you don't want your number of threads to be equal to your number of cores, unless you're doing exclusively CPU-bound processing, you want number of threads > number of cores. When waiting for network or disk IO (i.e. when you're waiting in an accept(2), recv(2), or read(2)) you're thread is not considered runnable. If N threads > N cores, the operating system may be able to schedule a different thread of yours to do work while waiting for that IO.

What you mention is one possible model to implement threading. But such a hierarchical model may not be followed at all by a given POSIX thread implementation. Since somebody already mentioned linux, it dosn't have it, all threads are equal from the point of view of the scheduler, there. They compete for the same resources if you don't specify something extra.
Last time I have seen such a hierarchical model was on a machine with an IRIX OS, long time ago.
So in summary, there is no general rule under POSIX for that, you'd have to look up the documentation of your particular OS or ask a more specific question about it.

Possible frameworks/ideas for thread managment and work allocation in C++

I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.

Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js