C++ - Questions about multithreading

C++ - Questions about multithreading - c++

I am having trouble understanding some concepts of multithreading. I know the basic principles but am having trouble with the understanding of when individual threads are sent and used by cores.
I know that having multiple threads allow code to run in parallel. I think this would be a good addition to my archive extraction program which could decompress blocks using multiple cores. It decompresses all of the files in a for loop and I am hoping that each available core will work on a file.
Here are my questions:
Do I need to query or even consider the number of cores on a machine or when the threads are running, they are automatically sent to free cores?
Can anyone show me an example of a for loop using threads. Say in each loop iteration it would call a function using a different thread. I read that the ideal number of threads to have active are the number of cores. How do I know when a core is free or should I check to see if it has joined main thread, and create a new thread when it has to keep a certain number of threads running.
Am I overcomplicating things or are my questions indicative that I am not grasping the concepts?

If you're decompressing files then you'll probably want a bounded number of thread rather than one thread per file. Otherwise, if you're processing 1000 files you're going to create 1000 thread, which won't make efficient use of the cpu.
As you've mentioned, one approach is to create as many threads as there are cores, and this is a reasonable approach in your case as decompression is reasonably cpu bound, and therefore any threads you create will be active for most of their time slice. If your problem with IO bound then your threads would be spending a lot of time waiting for IO to complete, and therefore you could have spin of more threads than you've got cores, within bounds.
For your application I'd probably look at spinning up one thread per core, and have each thread process one file at a time. This will help keep your algorithm simple. If you had multiple threads working on one file then you'd have to synchronize between them in order to ensure that the blocks they processed were written out to the correct location in the uncompressed file, which will cause needless headaches.
C++11 includes a thread library which you can use simplify working with threads.

No, you can use an API that keeps that transparent, for example POSIX threads on Linux (pthread library).
This answer probably depends on what API you use, though many APIs share threading basics like mutexes. Here, however, is a pthreads example (since that's the only C/C++ threading API I know).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Whatever other headers you need for your code.
#define MAX_NUM_THREADS 12
// Each thread will run this function.
void *worker( void *arg )
{
// Do stuff here and it will be 'in parallel'.
// Note: Threads can read from the same location concurrently
// without issue, but writing to any shared resource that has not been
// locked with, for example, a mutex, can cause pernicious bugs.
// Call this when you're done.
pthread_exit( NULL );
}
int main()
{
// Each is a handle for one thread, with 12 in total.
pthread_t myThreads[MAX_NUM_THREADS];
// Create the worker threads.
for(unsigned long i = 0; i < numThreads; i++)
{
// NULL thread attributes struct.
// This initializes the threads with the default PTHREAD_CREATE_JOINABLE
// attribute; we know a thread is finished when it joins, see below.
pthread_create(&myThreads[i], NULL, worker, (void *)i);
}
void *status;
// Wait for the threads to finish.
for(unsigned int i = 0; i < numThreads; i++)
{
pthread_join(myThreads[i], &status);
}
// That's all, folks.
pthread_exit(NULL);
}
Without too much detail, that's a pretty basic skeleton for a simple threaded application using pthreads.
Regarding your questions on the best way to go about applying this to your program:
I suggest one thread per file, using a Threadpool Pattern, and here's why:
Single thread per file is much simpler because there's no sharing, hence no synchronization. You can change the worker function to a decompressFile function, passing a filename each time you call pthread_create. That's basically it. Your threadpool pattern sort of falls into place here.
Multiple threads per file means synchronization, which means complexity because you have to manage access to shared resources. In order to speed up your algorithm, you'd have to isolate portions of it that can run in parallel. However, I would actually expect this method to run slower:
Imagine Thread A has File A open, and Thread B has File B open, but File A and File B are in completely different sectors of your disk. As your OS's scheduling algorithm switches between Thread A and Thread B, your hard drive has to spin like mad to keep up, making the CPU (hence your program) wait.

Since you are seemingly new to threading/parallelism, and you just want to get more performance out of multiple processors/cores, I suggest you look for libraries that deal with threading and allow you to enable parallelism without getting into thread management, work distribution etc.
It sounds all you need now is a parallel loop execution. Nowadays there is a plenty of C++ libraries that can ease this task for you, e.g. Intel's TBB, Microsoft's PPL, AMD's Bolt, Quallcomm's MARE to name a few. You may compare licensing terms, supported platforms, functionality and make a choice that best fits your needs.
To be more specific and answer your questions:
1) Generally, you should have no need to know/consider the number of processors or cores. Choose a library that abstracts this detail away from you and your program. On the other hand, if you see that with default settings CPU is not fully utilized (e.g. due to a significant number of I/O operations), you may find it useful to ask for more threads, e.g. by multiplying the default by a certain factor.
2) A sketch of a for loop made parallel with tbb::parallel_for and C++11 lambda functions:
#include <tbb/tbb.h>
void ParallelFoo( std::vector<MyDataType>& v ) {
tbb::parallel_for( size_t(0), v.size(), [&](int i){
Foo( v[i] );
} );
}
Note that it is not guaranteed that each iteration is executed by a separate thread; but you should not actually worry about such details; all you need is available cores being busy with useful work.
Disclaimer: I'm a developer of Intel's TBB library.

If you're on Windows, you could take a look at Thread Pools, a good description can be found here: http://msdn.microsoft.com/en-us/magazine/cc163327.aspx. An interesting feature of this facility is that it promises to manage the threads for you. It also selects the optimal number of threads depending on demand as well as on the available cores.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

multithreading in a for loop c++

I am trying to create program that last 100 seconds. This program will create a thread every 2 milliseconds interval. Each thread will do a job that takes say 20 ms to complete.
So ideally there will be around 10 threads running at any point in time. How should I approach this problem?
#include <thread>
void runJob(); // took ~20 ms to complete
for (int i = 0; i < 50000; i++) {
//create thread
std::thread threadObj1(runJob);
Sleep(2);
};

The problem with this approach is that with only 20ms worth of computation for each thread, you are very likely to spend considerably more CPU time spawning and shutting down threads than doing the actual work in runJob.
On most operating systems, spawning threads is quite an expensive operation and can easily take several dozen milliseconds on its own. So for relatively short lived jobs like you have, it is much more desirable to reuse the same thread for multiple jobs. This has the additional benefit that you don't create more threads than your system can handle, that is you avoid thread oversubscription.
So a better approach would be to create an appropriate number of threads upfront and then schedule the different runJob instances to those existing threads. Since this can be quite challenging to do by hand with just the barebones C++11 thread library facilities, you should consider using a higher level facility for this. OpenMP, C++17 parallel algorithms and task-based parallelization libraries like Intel Threading Building Blocks are all good options to get something like this off the ground quickly.

What is the best way to read a lot of files simultaneously in C++?

I have a multicore processor (for instance, 8 cores) and I want read a lot of files by function int read_file(...) and do it using all cores effectively. Also, after executing read_file the returned value should be placed in some place (may be in vector or queue).
I'm thinking about using async (from С++11) and future (for getting result from read_file) with launch policy launch::async in a for loop over all files. But it creates a lot of threads during the execution and reading some files can be failed. Maybe I should use some guard on an amount of threads which are created during this execution?

Reading files isn't CPU intensive. So you're focusing on the wrong thing. It's like asking how to use all the power of your car's engine effectively if you're going across the street.

I've written code and done benchmark study to do exactly that. The storage sub-system configurations vary. E.g. someone may have files spread out into multiple disks, or on the same RAID device consisting of multiple disks. The best solution in my opinion is a combination of a powerful thread pool together with async I/O, that are tailored for the system configuration. For instance, the number of threads in the thread pool can be equal to the number of hardware threads; the number of boost::io_service objects can be equal to the number of disks.

Async IO is usually done through an event based solution. You can use boost::asio, libevent, libuv etc.

I'm tempted to argue a Boost.Asio solution might be ideal.
The basic idea involves creating a thread pool that waits for tasks to arrive and queuing all your file reads into that pool.
boost::asio::io_service service;
//The work_ptr object keeps the calls to io_service.run() from returning immediately.
//We could get rid of the object by queuing the tasks before we construct the threads.
//The method presented here is (probably) faster, however.
std::unique_ptr<boost::asio::io_service::work> work_ptr = std::make_unique<boost::asio::io_service::work>(service);
std::vector<YOUR_FILE_TYPE> files = /*...*/;
//Our Thread Pool
std::vector<std::thread> threads;
//std::thread::hardware_concurrency() gets us the number of logical CPU cores.
//May be twice the number of physical cores, due to Hyperthreading/similar tech
for(unsigned int thread = 0; thread < std::thread::hardware_concurrency(); thread++) {
threads.emplace_back([&]{service.run();});
}
//The basic functionality: We "post" tasks to the io_service.
std::vector<int> ret_vals;
ret_vals.resize(files.size());
for(size_t index = 0; index < files.size(); index++) {
service.post([&files, &ret_vals, index]{ret_vals[index] = read_file(files[index], /*...*/);});
}
work_ptr.reset();
for(auto & thread : threads) {
thread.join();
}
//At this time, all ret_vals values have been filled.
/*...*/
One important caveat though: Reading from the disk is orders of magnitude slower than reading from memory. The solution I've provided will scale to virtually any number of threads, but there's very little reason to believe that multithreading will improve the performance of this task, since you'll almost certainly be I/O-bottlenecked, especially if your storage medium is a traditional hard disk, rather than a Solid State Drive.
That isn't to say this is automatically a bad idea; after all, if your read_file function involves a lot of processing of the data (not just reading it) then the performance gains could be quite real. But I do suspect that your use case is a "Premature Optimization" situation, which is the deathknell of programming productivity.

Mutex vs. standard function call

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Possible ways of implementing a dynamic barrier in mulithreaded programs

I read this in a paper...
Consequently, our tool only checkpoints a thread when it is executing
at a known safe point: kernel entry, kernel exit, or certain
interruptible sleeps in the kernel that we have determined to be safe.
The thread that initiates a multithreaded fork creates a barrier on
which it waits until all other threads reach a safe point. Once all
threads reach the barrier, the original thread creates the checkpoint,
then lets the other threads continue execution.
Now my question is, can anyone guess what kind of barrier the authors are talking about. How a thread creates a barrier and inserts the barrier dynamically in other threads as well? Any working example will be highly appreciated.
EDITED
Please don't say use pthread_barrier_wait, because that is not the question. Here apparently the authors have a thread that inserts barriers into other threads dynamically. I want to know how?

The paper you're asking about appears to be "Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism". The paper mentions:
We modified the Linux kernel to implement our techniques.
and
We therefore created a new Linux primitive, called a multithreaded fork, that creates a child process with the same number of threads as its parent.
So when the paper says that
Respec only checkpoints a thread when it is executing at a known safe point: kernel entry, kernel exit, or certain interruptible sleeps in the kernel that we have determined to be safe. The thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point. Once all threads reach the barrier, the original thread creates the checkpoint, then lets the other threads continue execution.
I'd assume that among the modifications they made to the Linux kernel was logic that threads in the process being logged will 'enter' the barrier when they reach one of those "safe points" (I'd also assume only if there's been a 'multithreaded fork' issued to create the barrier). Since this is occurring in the kernel, it would be easy enough to implement a barrier - there's not really anything dynamic going on. The modified kernel has the barriers implemented at those strategic safe points.
I haven't really read the paper (just skimmed a few bits). It's not entirely clear to me what might happen if one or more threads is performing work that doesn't require entering the kernel for a long period of time - it appears that the system depends on the threads getting to those explicit safe points. So threads shouldn't dawdle in a CPU intensive loop for too long (which is probably not an issue for the vast majority of programs):
Note that the actual execution time of an epoch may be longer than the epoch interval due to our barrier implementation; a checkpoint cannot be taken until all threads reach the barrier.

Well considering that your question is tagged with linux and pthreads, I can only imagine that it's referring to pthread barriers:
pthread_barrier_init
pthread_barrier_wait
pthread_barrier_destroy
Here's an example:
#include <pthread.h>
#include <stdio.h>
pthread_barrier_t bar;
pthread_t th;
void* function(void*)
{
printf("Second thread before the barrier\n");
pthread_barrier_wait(&bar);
printf("Second thread after the barrier\n");
return NULL;
}
int main()
{
printf("Main thread is beginning\n");
pthread_barrier_init(&bar, NULL, 2);
pthread_create(&th, NULL, function, NULL);
pthread_barrier_wait(&bar);
printf("Main thread has passed the barrier\n");
pthread_join(&th,NULL);
pthread_barrier_destroy(&bar);
return 0;
}

A barrier is a fairly standard synchronization primitive.
In basic terms, upon entering a barrier each thread is blocked until all relevant threads have reached the barrier, and then all are released.
I know you're asking about C/C++, but take a look at Java's CyclicBarrier as the concept is explained pretty well there.
Since you're asking about pthreads, take a look at pthread_barrier_init et al.
edit
But in this case, a thread seemingly dynamically inserts barriers in
the other threads. How?
It is hard to answer this without some kind of context (e.g. the paper that you're reading).
The excerpt that you quote gives an impression that this is a description of some low-level tool, that either inserts hooks that get executed on certain events (probably in the context of the threads in question), or indeed operates in kernel mode. Either way, it's little wonder it can do what it says it can.
It doesn't seem to me that anyone is talking about a user thread dynamically inserting barriers into another user thread.
Hope I'm not too far off in my guessing of the context.

Simple: use the pthread_barrier_wait pthread API call.
See the man page for the details: http://linux.die.net/man/3/pthread_barrier_wait

An OS thread barrier is nothing more than some state in memory. If you can share that state among threads (by properly initializing the threads) then the threads can use that barrier.
Essentially the main thread does:
CreateAllThreads(&barrier);
StartAllThreads();
EnterBarrier(&barrier);
All other threads do:
RuntimeInitialize();
EnterBarrier(&barrier);
The above is only a very rough pseudocode for illustrative purposes only.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js