When threads are created, do they run in parallel or concurrently? - c++

I’m learning about multi threading and I understand the difference between parallelism and concurrency. My question is, if I create a second thread and detach it from the parent thread, do these 2 run concurrently or in parallel?

Typically (for a modern OS) there's about 100 processes with maybe an average of 2 threads each for a total of 200 threads just for background services, GUI, software updaters, etc. You start your process (1 more thread) and it spawns its second thread, so now there's 202 threads.
Out of those 202 threads almost all of them will be blocked waiting for something to happen most of the time. If 30 threads are not blocked, and you have 8 CPUs, then 30 threads compete for those 8 CPUs.
If 30 threads compete for 8 CPUs, then maybe 4 threads are high priority and get a whole CPU for themselves and 10 threads are low priority and don't get any CPU time because there's more important work for CPUs to do; and maybe 12 threads are medium priority and share 4 CPUs by time multiplexing (frequently switching between threads). How this actually works depends on the OS and its scheduler (its very different for different operating systems).
Of course the number of threads and their priorities changes often (e.g. as threads are created and terminated), and threads block and unblock very often, so how many threads are competing for CPUs (not blocked) is constantly changing. Maybe there's 30 threads competing for 8 CPUs at one point in time, and 2 milliseconds later there's 5 threads and 3 CPUs are idle.
My question is, if I create a second thread and detach it from the parent thread, do these 2 run concurrently or in parallel?
Yes; your 2 threads may either run concurrently (share a CPU with time multiplexing) or run in parallel (on different CPUs); and can do both at different times (concurrently for a while, then parallel for a while, then..).

if I create a second thread and detach it...
Detach is not something your program can do to a thread. It's merely something the program can do to a std::thread object. The object is not the thread. The object is just a handle that your program can use to talk about the thread, and "detach" just grants permission for the program to destroy the handle, while the thread continues to run.
...from the parent thread
So, "detach" doesn't detach one thread from another (e.g., a "child" from its "parent,") It detaches a thread from its std::thread handle.
FYI: Most programming environments do not recognize any parent/child relationship between threads. If thread A creates thread B, that does not give thread A any special privileges or capabilities with respect to thread B that threads C, D, and E don't also have.
That's not to say that you can't recognize the relationship. It might mean something in your program. It just doesn't mean anything to the OS.
...Parallel or concurrently
That's not an either-or choice. Parallel is concurrent.
"Concurrency" does not mean "threads context switching." Rather, it's a statement about the order in which the threads access and update shared variables.
When two threads run concurrently on typical multiprocessing hardware, their actions will be serializable. That means that the outcome of the program will be the same as if some single, imaginary thread did all the same things that the real program threads did, and it did them one-by-one, in some particular order.
Two threads are concurrent with each other if the serialization is non-deterministic. That is to say, if the "particular order" in which the things all seem to happen is not entirely determined by the program. They are concurrent if different runs of the program can behave as if the imaginary single thread chose different serializations.
There's also a simpler test, which usually amounts to the same thing: Two threads probably are concurrent with each other if both threads are started before either one of them finishes.
Any way you look at it though, if two threads are truly running in parallel with each other, then they must also be concurrent with each other.
Do they run in parallel?
#Brendan already answered that one. TLDR: if the computer has more than one CPU, then they potentially can run in parallel. How much of the time they actually spend running in parallel depends on many things though, and many of those things are in the domain of the operating system.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!
The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.
Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.
Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.
Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

How does incorporating multi-threading into C++ benefit the performance, and why?

I've learnt that multi-threading is when two or more parts of a program can run concurrently but what is the actual purpose of using multi-threading? Why and how does it benefit the performance of our program?
If I have a list of say 10000 items that require an operation on each. If I use a single thread for this operation on 10000 items, lets say it will take 8 seconds. If I use multithreading for this operation on a 4 core CPU, This means that I can divide the operation among 4 threads and the cost of the operation will now be the cost of running one 4th of the item which is 2500. The time taken now will be approximately 2 seconds. since each thread is running independently on 2500 of the items making 10000 items. In this way, multithreading can speed up your computation. I am not taking into consideration the cost of setting up the threads.
Another use of multithreading is to avoid thread blocking calls that take a long time to return. For example accept and connect in TCP socket programming. You can spin a new thread to handle that not for the sake of speed but to ensure that the main thread is not blocked.
The actual purpose of multithreading is nothing more than what you said: It is a way to let two or more parts of a program run concurrently.
There are many reasons why you might want two or more parts of the same program to run concurrently. One of those reasons is, it's a way to exploit the several CPUs of a multi-processor computer system: Different concurrent threads can execute in parallel with each other if they are assigned to different CPUs.

TBB thread pool unexpectedly increasing

We have a piece of code that utilises TBB to spawn tasks to perform some processing this is done using the following TBB code to initialise the TBB thread pool:
tbb::task_scheduler_init(8);
Then for each task we want to spawn we use the following code (where MainTask is derived from the tbb::task class):
task = new (tbb::task::allocate_root()) MainTask(theAction, theOutputData);
tbb::task::enqueue(*task);
When we run our code we start off with a thread pool that is the same as the number of cores (in our case 8 threads) as expected but as the program executes and spawns new TBB tasks, as described above, the number of threads at some random points suddenly increase. After 40 minutes of program execution the thread count increases from 8 to 15 between.
Why is this happening? Shouldn’t TBB keep the number of worker threads fixed to equal the number of cores?
As I said in another answer to you: Don't worry :-)
TBB does great job preventing actual over-subscription - only 8 threads will be active in your program at the same time. Though for various reasons, it needs more threads than hardware resources sometimes. One example is tbb::task_arena with no master slots reserved and another recent addition is tbb::global_control class which allows to change the number of active threads in the pool dynamically. Unfortunately, the way how TBB implements it leaves some space for the data race. It happens when some threads are on its way back to thread-pool to get some sleep while a new work arrives and requests all the 8 threads to start processing immediately; but that these threads in the intermediate state are not accounted in the thread-pool yet and new threads created instead.
TBB reduced the window for this data race as much as possible but to close it completely, a synchronization needed on the hot path which will affect general performance. Thus the decision was made to allow the data race and get less obstacles on the hot path.
But again, don't worry, there is no resource leak because TBB has hard limit for the maximum number of threads it can create this way. Depending on platform, this number varies somewhere from 2x to 4x (though it's internal implementation specifics which keep changing).
Though, I'm surprised that it goes that far with 15 threads created and I understand your concerns. TBB team will appreciate if you share a reproducer with them. You can contribute the reproducer through either TBB Forum or OSS site.

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?
Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).
http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...