Number of Threads higher than the Number of Cores

Number of Threads higher than the Number of Cores - c++

#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;
void taskA() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskA: %d\n", i*i);
fflush(stdout);
}
}
void taskB() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskB: %d\n", i*i);
fflush(stdout);
}
}
void taskC() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskC: %d\n", i*i);
fflush(stdout);
}
}
void taskD() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskD: %d\n", i*i);
fflush(stdout);
}
}
void taskE() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskE: %d\n", i*i);
fflush(stdout);
}
}
void taskF() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskF: %d\n", i*i);
fflush(stdout);
}
}
void taskG() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskG: %d\n", i*i);
fflush(stdout);
}
}
void taskH() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskH: %d\n", i*i);
fflush(stdout);
}
}
int main(void) {
thread t1(taskA);
thread t2(taskB);
thread t3(taskC);
thread t4(taskD);
thread t5(taskE);
thread t6(taskF);
thread t7(taskG);
thread t8(taskH);
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
t7.join();
t8.join();
return 0;
}
In this C++ program, I created 8 similar functions (taskA to taskH) and 8 threads, one for each. When I executed, I got outputs of all the 8 functions parallely. But my Laptop has only 4 cores.
So the problem is how is it happening? 4 cores running 8 threads parallely, I didn't understand it! Please explain what's happening inside?
Thanks for your explanation!

Each core can run 1 thread at a time, or 2 threads in parallel if hyperthreading is enabled. So, on a system with 4 cores installed, there can be 4 or 8 threads running in parallel, max.
However, even so, your app is not the only one running threads. Every running process has at least 1 thread, maybe more. And the OS itself has dozens, maybe hundreds, of threads running. So clearly way way more threads total than the number of cores that are installed.
So, the OS has a built-in scheduler that is actively scheduling all of these running threads in such a way that cores will switch between threads at regular intervals, known as "time slices". This scheduling process is commonly known as "task switching".
This means that when a time slice on a core elapses, the core will temporarily pause the thread currently running on it, save that thread's state, and then resume an earlier paused thread for the next time slice, then pause and save that thread, switch to another thread, and so on, dozens/hundreds of times a second. Spread out over however many cores are installed.
Most systems are not real-time, so true parallel processing is just an illusion. Just a lot of switching between threads as core time slices become available.
That is it, in a nutshell. Obviously, things are more complex in practical use. There are lengthy articles, research papers, even books, on this topic, if you really want to know the gritty implementation details.

A bit about human physiology. Every person in the world has this thing called "reaction time". It is time between an event actually occuring and your brain realizing it and reacting to it. It varies between people, but it is extremely rare for a human to have a reaction time below 100ms, and below 10ms is unheard of. This means that if two events happens one after another, but the time between them is less than 10ms then a human will think that these events happened simultaneously.
What does it have to do with threading and cores? A CPU can only run N threads in parallel, where N is the number of cores (well, technically it is more complicated due to features like hyperthreading, but lets forget about it for now). So when you fire, say 10*N threads, then these cannot run in parallel. It is technically impossible. What actually happens is that the operating system has this internal piece of code called the scheduler, which controls which threads runs on which core at a given moment. And it jumps from one thread to another, so that every thread has some CPU time and can actually progress.
But you say "outputs of all the functions are coming at a same time". No they don't. The CPU processes billions of instructions per second. Or equivalently one or more instructions every 1/1bln second. The exact number depends on what exactly the CPU does, for example printing stuff to a monitor requires much more time, but still it can print probably thousands or tens of thousands characters into monitor below 10ms. And since this is below your reaction time, you only think that it happened in parallel, while in reality it did not.
and I am also using a sleep of 1sec
Sleep is not an action. It is a lack of action. And as such does not require CPU. The operations of "falling asleep" and "waking up" require some CPU (even though very little), but not waiting itself. And indeed, waiting does happen truely parallely, regardless of how many threads you have.

Related

Why does similar code running in multiple threads have different running times?

I met a very strange problem about a C++ multi-thread program which as belows.
#include<iostream>
#include<thread>
using namespace std;
int* counter = new int[1024];
void updateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
counter[position] = counter[position] + 8;
}
}
int main() {
time_t begin, end;
begin = clock();
thread t1(updateCounter, 1);
thread t2(updateCounter, 2);
thread t3(updateCounter, 3);
thread t4(updateCounter, 4);
t1.join();
t2.join();
t3.join();
t4.join();
end = clock();
cout<<end-begin<<endl; //1833
begin = clock();
thread t5(updateCounter, 16);
thread t6(updateCounter, 32);
thread t7(updateCounter, 48);
thread t8(updateCounter, 64);
t5.join();
t6.join();
t7.join();
t8.join();
end = clock();
cout<<end-begin<<endl; //358
}
the first code block run about 1833 seconds,but the second which is almost same with the first one
run about 358 seconds.Beg for an answer!Thank you!

Writing to nearby variables from multiple threads is slow due to "false sharing" which is described here: What is "false sharing"? How to reproduce / avoid it?
Your offsets of 16/32/48/64 are 64 bytes apart because the int values are (on most common platforms) 4 bytes each. And 64 bytes is a common cache line size, so this puts each target value on its own cache line.
The performance difference is not nearly as large if you compile with optimization. Which of course you should always do when measuring performance. But there's still a difference, and it may get worse the more threads you have.
Finally, your benchmark is unfair because you always run the "slow" code first. That means the code and data are "cold" for the first experiment and "hot" for the second experiment. This is a common mistake in benchmarking, and may even be the dominant factor in the performance difference you're seeing, depending on your system.

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Why main thread is slower than worker thread in pthread-win32?

void* worker(void*)
{
int clk = clock();
float val = 0;
for(int i = 0; i != 100000000; ++i)
{
val += sin(i);
}
printf("val: %f\n", val);
printf("worker: %d ms\n", clock() - clk);
return 0;
}
int main()
{
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
int clk = clock();
float val = 0;
for(int i = 0; i != 100000000; ++i)
{
val += sin(i);
}
printf("val: %f\n", val);
printf("main: %d ms\n", clock() - clk);
pthread_join(tid, 0);
return 0;
}
Main thread and the worker thread are supposed to run equally fast, but the result is:
val: 0.782206
worker: 5017 ms
val: 0.782206
main: 8252 ms
Main thread is much slower, I don't know why....
Problem solved. It's the compiler's problem, GCC(MinGW) behaves weirdly on Windows.
I compliled the code in Visual Studio 2012, there's no speed difference.

Main thread and the worker thread are supposed to run equally fast, but the result is:
I have never seen a threading system outside a realtime OS which provided such guarantees. With windows threads and all other threading systems(I have also use posix threads, and whatever the lightweight threading on MacOS X is, and threads in C# threads) in Desktop systems it is my understanding that there are no performance guarantees in terms or how fast one thread will be in relation to another.
A possible explanation (speculation) could be that since you are using a modern quadcore it could be raising the clock rate on the main core. When there are mostly single threaded workloads modern i5/i7/AMD-FX systems raise the clock rate on one core to a pre-rated level that stock cooling can dissipate the heat for. On more parallel workloads all the cores get a smaller bump in clock speed, again pre-rated based on heat dissipation and when idle all of the cores are throttled down to minimize power usage. It is possible that the amount of background work is mostly performed on a single core and the amount of time the second thread spends on the second core is not enough to justify switching to the mode where all the cores speed is boosted.
I would try again with 4 threads and 10x the workload. If you have a tool that monitors CPU load and clock-speeds I would check that. Using that information you can infer if I am right or wrong.
Another option might be profiling and seeing if what part of the work is taking time. It could be that the OS calls are taking more time than your workload.
You could also test your software on another machine with different performance characteristics such as steady clock-speed or single core. This would provide more information.

What could be happening is that the worker thread execution is being interleaved with main's execution, so that some of the worker thread's execution time is being counted against main's time. You could try putting a sleep(10) (some time larger than the run-time of the worker and of main) at the very beginning of the worker and run again.

c++ pthread multithreading for 2 x Intel Xeon X5570, quad-core CPUs on Amazan EC2 HPC ubuntu instance

I wrote a program that employs multithreading for parallel computing. I have verified that on my system (OS X) it maxes out both cores simultaneously. I just ported it to Ubuntu with no modifications needed, because I coded it with that platform in mind. In particular, I am running the Canonical HVM Oneiric image on an an Amazon EC2, cluster compute 4x large instance. Those machines feature 2 Intel Xeon X5570, quad-core CPUs.
Unfortunately, my program does not accomplish multithreading on the EC2 machine. Running more than 1 thread actually slows the computing marginally for each additional thread. Running top while my program is running shows that when more than 1 thread is initialized, the system% of CPU consumption is roughly proportional to the number of threads. With only 1 thread, %sy is ~0.1. In either case user% never goes above ~9%.
The following are the threading-relevant sections of my code
const int NUM_THREADS = N; //where changing N is how I set the # of threads
void Threading::Setup_Threading()
{
sem_unlink("producer_gate");
sem_unlink("consumer_gate");
producer_gate = sem_open("producer_gate", O_CREAT, 0700, 0);
consumer_gate = sem_open("consumer_gate", O_CREAT, 0700, 0);
completed = 0;
queued = 0;
pthread_attr_init (&attr);
pthread_attr_setdetachstate (&attr, PTHREAD_CREATE_DETACHED);
}
void Threading::Init_Threads(vector <NetClass> * p_Pop)
{
thread_list.assign(NUM_THREADS, pthread_t());
for(int q=0; q<NUM_THREADS; q++)
pthread_create(&thread_list[q], &attr, Consumer, (void*) p_Pop );
}
void* Consumer(void* argument)
{
std::vector <NetClass>* p_v_Pop = (std::vector <NetClass>*) argument ;
while(1)
{
sem_wait(consumer_gate);
pthread_mutex_lock (&access_queued);
int index = queued;
queued--;
pthread_mutex_unlock (&access_queued);
Run_Gen( (*p_v_Pop)[index-1] );
completed--;
if(!completed)
sem_post(producer_gate);
}
}
main()
{
...
t1 = time(NULL);
threads.Init_Threads(p_Pop_m);
for(int w = 0; w < MONTC_NUM_TRIALS ; w++)
{
queued = MONTC_POP;
completed = MONTC_POP;
for(int q = MONTC_POP-1 ; q > -1; q--)
sem_post(consumer_gate);
sem_wait(producer_gate);
}
threads.Close_Threads();
t2 = time(NULL);
cout << difftime(t2, t1);
...
}

Ok, just guess. There is simple way to transform your parallel code to consecutive. For example:
thread_func:
while (1) {
pthread_mutex_lock(m1);
//do something
pthread_mutex_unlock(m1);
...
pthread_mutex_lock(mN);
pthread_mutex_unlock(mN);
If you run such code in several thread, you will not see any speedup, because of mutex usage. Code will work as consecutive, not as parallel. Only one thread will work at any moment.
The bad thing, that you can not used any mutex in your program explicity, but still have such situation. For example, call of "malloc" may cause usage of mutex some where in "C" runtime, call of "write" may cause usage of mutex somewhere in Linux kernel. Even call of gettimeofday may cause mutex lock/unlock (and they cause, if tell about Linux/glibc).
You may have only one mutex, but spend under it a lot of time, and this may cause such behaviour.
And because of mutex may be used somewhere in kernel and C/C++ runtime, you can see different behaviour with different OSes.

Negative Speedup on Multithreading my Program

On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?

The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)

The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.

I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.

Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];

Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'

As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.

tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.

Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js