Why don't threads seem to run in parallel in this code? - c++

This is the first time I am working with threads so I am sorry if this is a bad question. Shouldn't the output be consisted of "randomized" mains and foos? What I get seems to be a column of foos and a column of mains.
#include <iostream>
#include <thread>
void foo() {
for (int i = 0; i < 20; ++i) {
std::cout << "foo" << std::endl;
}
}
int main(int argc, char** argv) {
std::thread first(foo);
for (int i = 0; i < 20; ++i) {
std::cout << "main" << std::endl;
}
first.join();
return 0;
}

There is a overhead starting a tread. So in this simple example the output is completely unpredictable. Both for loops running very short, and therefore if the thread start is only even a millisecond late, both code segments are executed sequentially instead of parallel. But if the operating system schedules the thread first, the "foo" sequence is showing before the "main" sequence.
Insert some sleep calls into the thread and the main function to see if they really run parallel.
#include <iostream>
#include <thread>
#include <unistd.h>
void foo() {
for (int i = 0; i < 20; ++i) {
std::cout << "foo" << std::endl;
sleep(1);
}
}
int main(int argc, char** argv) {
std::thread first(foo);
for (int i = 0; i < 20; ++i) {
std::cout << "main" << std::endl;
sleep(1);
}
first.join();
return 0;
}
Using threads does not automatically enforce parallel execution of code segments, because if you e.g. have only one CPU in your system, the execution is switched between all processes and threads, and code segments are never running parallel.
There is a good wikipedia article about threads here. Especially read the section about "Multithreading".

After cout try to yield. This may honor any waiting thread. (Although implementation dependent)

Related

Threads appear to run randomly.. Reliable only after slowing down the join after thread creation

I am observing strange behavior using pthreads. Note the following code -
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <pthread.h>
#include <unistd.h>
typedef struct _FOO_{
int ii=0;
std::string x="DEFAULT";
}foo;
void *dump(void *x)
{
foo *X;
X = (foo *)x;
std::cout << X->x << std::endl;
X->ii+=1;
}
int main(int argc, char **argv)
{
foo X[2];
const char *U[2] = {"Hello", "World"};
pthread_t t_id[2];
int t_status[2];
/*initalize data structures*/
for(int ii=0; ii < 2; ii+=1){
X[ii].x=U[ii];
}
foo *p = X;
for(int ii=0; ii < 2; ii+=1){
t_status[ii] = pthread_create(&t_id[ii], NULL, dump, (void *)p);
std::cout << "Thread ID = " << t_id[ii] << " Status = " << t_status[ii] << std::endl;
p+=1;
}
//sleep(1); /*if this is left commented out, one of the threads do not execute*/
for(int ii=0; ii < 2; ii+=1){
std::cout << pthread_join(t_status[ii], NULL) << std::endl;
}
for(int ii=0; ii < 2; ii+=1){
std::cout << X[ii].ii << std::endl;
}
}
When I leave the sleep(1) (between thread create and join) call commented out, I get erratic behavior in the randomly only 1 of the 2 thread run.
rajatkmitra#butterfly:~/mpi/tmp$ ./foo
Thread ID = 139646898239232 Status = 0
Hello
Thread ID = 139646889846528 Status = 0
3
3
1
0
When I uncomment sleep(1). Both threads execute reliably.
rajatkmitra#butterfly:~/mpi/tmp$ ./foo
Thread ID = 140072074356480 Status = 0
Hello
Thread ID = 140072065963776 Status = 0
World
3
3
1
1
The pthread_join() should hold up exit from the program, till both threads complete, but in this example I am unable to get that to happen without the sleep() function. I really do not like the implementation with sleep(). Can someone tell me if I am missing something??
See Peter's note -
pthread_join should be called with the thread id, not the status value that pthread_create returned. So: pthread_join(t_id[ii], NULL), not pthread_join(t_status[ii], NULL). Even better, since the question is tagged C++, use std::thread. –
Pete Becker

using std::thread and CUDA together

I'm looking for an quick example of using std::thread and CUDA together. When using mutiple host thread, does it require each host thread to be assigned a certain number of GPU threads that's not overlapping with each other?
You can use std::thread and CUDA together.
There is no particular arrangement required for the association between threads and GPUs. You can have 1 thread manage all GPUs, one per GPU, 4 per GPU, all threads talk to all GPUs, or whatever you like. (There is no relationship whatsoever between GPU threads and host threads, assuming by GPU threads you mean GPU threads in device code. )
Libraries like CUFFT and CUBLAS may have certain expectations about handle usage, typically that you must not share a handle between threads, and handles are inherently device-specific.
Here's a worked example demonstrating 4 threads (one per GPU) followed by one thread dispatching work to all 4 GPUs:
$ cat t1457.cu
#include <thread>
#include <vector>
#include <iostream>
#include <cstdio>
__global__ void k(int n){
printf("hello from thread %d\n", n);
}
void thread_func(int n){
if (n >= 0){
cudaSetDevice(n);
k<<<1,1>>>(n);
cudaDeviceSynchronize();}
else{
cudaError_t err = cudaGetDeviceCount(&n);
for (int i = 0; i < n; i++){
cudaSetDevice(i);
k<<<1,1>>>(-1);}
for (int i = 0; i <n; i++){
cudaSetDevice(i);
cudaDeviceSynchronize();}}
}
int main(){
int n = 0;
cudaError_t err = cudaGetDeviceCount(&n);
if (err != cudaSuccess) {std::cout << "error " << (int)err << std::endl; return 0;}
std::vector<std::thread> t;
for (int i = 0; i < n; i++)
t.push_back(std::thread(thread_func, i));
std::cout << n << " threads started" << std::endl;
for (int i = 0; i < n; i++)
t[i].join();
std::cout << "join finished" << std::endl;
std::thread ta(thread_func, -1);
ta.join();
std::cout << "finished" << std::endl;
return 0;
}
$ nvcc -o t1457 t1457.cu -std=c++11
$ ./t1457
4 threads started
hello from thread 1
hello from thread 3
hello from thread 2
hello from thread 0
join finished
hello from thread -1
hello from thread -1
hello from thread -1
hello from thread -1
finished
$
Here's an example showing 4 threads issuing work to a single GPU:
$ cat t1459.cu
#include <thread>
#include <vector>
#include <iostream>
#include <cstdio>
__global__ void k(int n){
printf("hello from thread %d\n", n);
}
void thread_func(int n){
cudaSetDevice(0);
k<<<1,1>>>(n);
cudaDeviceSynchronize();
}
int main(){
const int n = 4;
std::vector<std::thread> t;
for (int i = 0; i < n; i++)
t.push_back(std::thread(thread_func, i));
std::cout << n << " threads started" << std::endl;
for (int i = 0; i < n; i++)
t[i].join();
std::cout << "join finished" << std::endl;
return 0;
}
$ nvcc t1459.cu -o t1459 -std=c++11
$ ./t1459
4 threads started
hello from thread 0
hello from thread 1
hello from thread 3
hello from thread 2
join finished
$

How to create a certain number of threads based on a value a variable contains?

I have a integer variable, that contains the number of threads to execute. Lets call it myThreadVar. I want to execute myThreadVar threads, and cannot think of any way to do it, without a ton of if statements. Is there any way I can create myThreadVar threads, no matter what myThreadVar is?
I was thinking:
for (int i = 0; i < myThreadVar; ++i) { std::thread t_i(myFunc); }, but that obviously won't work.
Thanks in advance!
Make an array or vector of threads, put the threads in, and then if you want to wait for them to finish have a second loop go over your collection and join them all:
std::vector<std::thread> myThreads;
myThreads.reserve(myThreadVar);
for (int i = 0; i < myThreadVar; ++i)
{
myThreads.push_back(std::thread(myFunc));
}
While other answers use vector::push_back(), I prefer vector::emplace_back(). Possibly more efficient. Also use vector::reserve(). See it live here.
#include <thread>
#include <vector>
void func() {}
int main() {
int num = 3;
std::vector<std::thread> vec;
vec.reserve(num);
for (auto i = 0; i < num; ++i) {
vec.emplace_back(func);
}
for (auto& t : vec) t.join();
}
So, obvious the best solution is not to wait previous thread to done. You need to run all of them in parallel.
In this case you can use vector class to store all of instances and after that make join to all of them.
Take a look at my example.
#include <thread>
#include <vector>
void myFunc() {
/* Some code */
}
int main()
{
int myThreadVar = 50;
std::vector <thread> threadsToJoin;
threadsToJoin.resize(myThreadVar);
for (int i = 0; i < myThreadVar; ++i) {
threadsToJoin[i] = std::thread(myFunc);
}
for (int i = 0; i < threadsToJoin.size(); i++) {
threadsToJoin[i].join();
}
}
#include <iostream>
#include <thread>
void myFunc(int n) {
std::cout << "myFunc " << n << std::endl;
}
int main(int argc, char *argv[]) {
int myThreadVar = 5;
for (int i = 0; i < myThreadVar; ++i) {
std::cout << "Launching " << i << std::endl;
std::thread t_i(myFunc,i);
t_i.detach();
}
}
g++ -std=c++11 -o 35106568 35106568.cpp
./35106568
Launching 0
myFunc 0
Launching 1
myFunc 1
Launching 2
myFunc 2
Launching 3
myFunc 3
Launching 4
myFunc 4
You need to store the thread so you can send it to join.
std::thread t[myThreadVar];
for (int i = 0; i < myThreadVar; ++i) { t[i] = std::thread(myFunc); }//Start all threads
for (int i = 0; i < myThreadVar; ++i) {t[i].join;}//Wait for all threads to finish
I think this is valid syntax, but I'm more used to c so I am unsure if I initialized the array correctly.

C++ Simple multithreading program memory leak

I wrote an simple code that should make 1000 of threads, do some job, join them, and replay everything 1000 times.
I have a memory leak with this piece of code and I don't understand why. I've been looking for solution pretty much everywhere and can't find one.
#include <iostream>
#include <thread>
#include <string>
#include <windows.h>
#define NUM_THREADS 1000
std::thread t[NUM_THREADS];
using namespace std;
//This function will be called from a threads
void checkString(string str)
{
//some stuff to do
}
void START_THREADS(string text)
{
//Launch a group of threads
for (int i = 0; i < NUM_THREADS; i++)
{
t[i] = std::thread(checkString, text);
}
//Join the threads with the main thread
for (int i = 0; i < NUM_THREADS; i++) {
if (t[i].joinable())
{
t[i].join();
}
}
system("cls");
}
int main()
{
for(int i = 0; i < 1000; i++)
{
system("cls");
cout << i << "/1000" << endl;
START_THREADS("anything");
}
cout << "Launched from the main\n";
return 0;
}
I'm not sure about memory leaks, but you certainly have a memory error. You shouldn't be doing this:
delete &t[i];
t[i] was not allocated with new and it can't be deleted. You can safely remove that line.
As for memory consumption, you need to ask yourself whether you really need to spawn 1 million threads. Spawning threads isn't cheap, and it is unlikely that your platform will be able to run more than a handful of them concurrently.

Why the runtime it hasn't halved on using the multithreading in C++?

Now, the code takes 866678 clock ticks when the code is run in multithread and when I comment the for loops in the threads(each FOR loop runs 10000 times) and just run the whole FOR loop (20000 times). The run time is same for both with and without threads. But ideally it should have been half right?
// thread example
#include <iostream> // std::cout
#include <thread> // std::thread
#include <time.h>
#include<cmath>
#include<unistd.h>
int K = 20000;
long int a[20000];
void makeArray(){
for(int i=0;i<K;i++){
a[i] = i;
}
}
void foo()
{
// do stuff...
std::cout << "FOOOO Running...\n";
for(int i=K/2;i<K;i++){
// a[i] = a[i]*a[i]*10;
// a[i] = exp(2/5);
int j = i*i;
usleep(2000);
}
}
void bar(int x)
{
// do stuff...
std::cout << "BARRRR Running...\n";
for(int i=0;i<K/2;i++){
//a[i] = a[i]*a[i];
int j = i*i;
usleep(2000);
}
}
void show(){
std::cout<<"The array is:"<<"\n";
for(int i=0; i <K;i++){
std::cout<<a[i]<<"\n";
}
}
int main()
{
clock_t t1,t2;
t1 = clock();
makeArray();
// show();
std::thread first (foo); // spawn new thread that calls foo()
std::thread second (bar,0); // spawn new thread that calls bar(0)
//show();
std::cout << "main, foo and bar now execute concurrently...\n";
// synchronize threads:
first.join(); // pauses until first finishes
second.join(); // pauses until second finishes
//show();
// for(int i=0;i<K;i++){
// int j = i*i;
// //a[i] = a[i]*a[i];
// usleep(2000);
// }
std::cout << "foo and bar completed.\n";
//show();
t2 = clock();
std::cout<<"Runtime:"<< (float)t2-(float)t1<<"\n";
return 0;
}
The problem is in your use of clock(). This function actually returns the total amount of CPU run time consumed by your program across all cores / CPUs.
What you are actually interested in is the wall clock time that it took for your program to complete.
Replace clock() with time(), gettimeofday() or something similar to get what you want.
EDIT - Here's the C++ way to do timers the way you want: http://www.cplusplus.com/reference/chrono/high_resolution_clock/now/