I'm building a numa-aware processor that binds to a given socket and accepts lambdas. Here is what I've done:
#include <numa.h>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <vector>
using namespace std;
unsigned nodes = numa_num_configured_nodes();
unsigned cores = numa_num_configured_cpus();
unsigned cores_per_node = cores / nodes;
int main(int argc, char* argv[]) {
putenv("OMP_PLACES=sockets(1)");
cout << numa_available() << endl; // returns 0
numa_set_interleave_mask(numa_all_nodes_ptr);
int size = 200000000;
for (auto i = 0; i < nodes; ++i) {
auto t = thread([&]() {
// binding to given socket
numa_bind(numa_parse_nodestring(to_string(i).c_str()));
vector<int> v(size, 0);
cout << "node #" << i << ": on CPU " << sched_getcpu() << endl;
#pragma omp parallel for num_threads(cores_per_node) proc_bind(master)
for (auto i = 0; i < 200000000; ++i) {
for (auto j = 0; j < 10; ++j) {
v[i]++;
v[i] *= v[i];
v[i] *= v[i];
}
}
});
t.join();
}
}
However, all threads are running on socket 0. It seems numa_bind doesn't bind current thread to the given socket. The second numa processor -- Numac 1 outputs node #1: on CPU 0, which should be on CPU 1. So what's going wrong?
This works for me exactly as I expected:
#include <cassert>
#include <iostream>
#include <numa.h>
#include <omp.h>
#include <sched.h>
int main() {
assert (numa_available() != -1);
auto nodes = numa_num_configured_nodes();
auto cores = numa_num_configured_cpus();
auto cores_per_node = cores / nodes;
omp_set_nested(1);
#pragma omp parallel num_threads(nodes)
{
auto outer_thread_id = omp_get_thread_num();
numa_run_on_node(outer_thread_id);
#pragma omp parallel num_threads(cores_per_node)
{
auto inner_thread_id = omp_get_thread_num();
#pragma omp critical
std::cout
<< "Thread " << outer_thread_id << ":" << inner_thread_id
<< " core: " << sched_getcpu() << std::endl;
assert(outer_thread_id == numa_node_of_cpu(sched_getcpu()));
}
}
}
Program first create 2 (outer) threads on my dual-socket server. Then, it binds them to different sockets (NUMA nodes). Finally, it splits each thread into 20 (inner) threads, since each CPU has 10 physical cores and enabled hyperthreading.
All inner threads are running on the same socket as its parent thread. That is on cores 0-9 and 20-29 for outer thread 0, and on cores 10-19 and 30-39 for outer thread 1. (sched_getcpu() returned the number of virtual core from range 0-39 in my case.)
Note that there is no C++11 threading, just pure OpenMP.
Related
I am observing strange behavior using pthreads. Note the following code -
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <pthread.h>
#include <unistd.h>
typedef struct _FOO_{
int ii=0;
std::string x="DEFAULT";
}foo;
void *dump(void *x)
{
foo *X;
X = (foo *)x;
std::cout << X->x << std::endl;
X->ii+=1;
}
int main(int argc, char **argv)
{
foo X[2];
const char *U[2] = {"Hello", "World"};
pthread_t t_id[2];
int t_status[2];
/*initalize data structures*/
for(int ii=0; ii < 2; ii+=1){
X[ii].x=U[ii];
}
foo *p = X;
for(int ii=0; ii < 2; ii+=1){
t_status[ii] = pthread_create(&t_id[ii], NULL, dump, (void *)p);
std::cout << "Thread ID = " << t_id[ii] << " Status = " << t_status[ii] << std::endl;
p+=1;
}
//sleep(1); /*if this is left commented out, one of the threads do not execute*/
for(int ii=0; ii < 2; ii+=1){
std::cout << pthread_join(t_status[ii], NULL) << std::endl;
}
for(int ii=0; ii < 2; ii+=1){
std::cout << X[ii].ii << std::endl;
}
}
When I leave the sleep(1) (between thread create and join) call commented out, I get erratic behavior in the randomly only 1 of the 2 thread run.
rajatkmitra#butterfly:~/mpi/tmp$ ./foo
Thread ID = 139646898239232 Status = 0
Hello
Thread ID = 139646889846528 Status = 0
3
3
1
0
When I uncomment sleep(1). Both threads execute reliably.
rajatkmitra#butterfly:~/mpi/tmp$ ./foo
Thread ID = 140072074356480 Status = 0
Hello
Thread ID = 140072065963776 Status = 0
World
3
3
1
1
The pthread_join() should hold up exit from the program, till both threads complete, but in this example I am unable to get that to happen without the sleep() function. I really do not like the implementation with sleep(). Can someone tell me if I am missing something??
See Peter's note -
pthread_join should be called with the thread id, not the status value that pthread_create returned. So: pthread_join(t_id[ii], NULL), not pthread_join(t_status[ii], NULL). Even better, since the question is tagged C++, use std::thread. –
Pete Becker
I'm looking for an quick example of using std::thread and CUDA together. When using mutiple host thread, does it require each host thread to be assigned a certain number of GPU threads that's not overlapping with each other?
You can use std::thread and CUDA together.
There is no particular arrangement required for the association between threads and GPUs. You can have 1 thread manage all GPUs, one per GPU, 4 per GPU, all threads talk to all GPUs, or whatever you like. (There is no relationship whatsoever between GPU threads and host threads, assuming by GPU threads you mean GPU threads in device code. )
Libraries like CUFFT and CUBLAS may have certain expectations about handle usage, typically that you must not share a handle between threads, and handles are inherently device-specific.
Here's a worked example demonstrating 4 threads (one per GPU) followed by one thread dispatching work to all 4 GPUs:
$ cat t1457.cu
#include <thread>
#include <vector>
#include <iostream>
#include <cstdio>
__global__ void k(int n){
printf("hello from thread %d\n", n);
}
void thread_func(int n){
if (n >= 0){
cudaSetDevice(n);
k<<<1,1>>>(n);
cudaDeviceSynchronize();}
else{
cudaError_t err = cudaGetDeviceCount(&n);
for (int i = 0; i < n; i++){
cudaSetDevice(i);
k<<<1,1>>>(-1);}
for (int i = 0; i <n; i++){
cudaSetDevice(i);
cudaDeviceSynchronize();}}
}
int main(){
int n = 0;
cudaError_t err = cudaGetDeviceCount(&n);
if (err != cudaSuccess) {std::cout << "error " << (int)err << std::endl; return 0;}
std::vector<std::thread> t;
for (int i = 0; i < n; i++)
t.push_back(std::thread(thread_func, i));
std::cout << n << " threads started" << std::endl;
for (int i = 0; i < n; i++)
t[i].join();
std::cout << "join finished" << std::endl;
std::thread ta(thread_func, -1);
ta.join();
std::cout << "finished" << std::endl;
return 0;
}
$ nvcc -o t1457 t1457.cu -std=c++11
$ ./t1457
4 threads started
hello from thread 1
hello from thread 3
hello from thread 2
hello from thread 0
join finished
hello from thread -1
hello from thread -1
hello from thread -1
hello from thread -1
finished
$
Here's an example showing 4 threads issuing work to a single GPU:
$ cat t1459.cu
#include <thread>
#include <vector>
#include <iostream>
#include <cstdio>
__global__ void k(int n){
printf("hello from thread %d\n", n);
}
void thread_func(int n){
cudaSetDevice(0);
k<<<1,1>>>(n);
cudaDeviceSynchronize();
}
int main(){
const int n = 4;
std::vector<std::thread> t;
for (int i = 0; i < n; i++)
t.push_back(std::thread(thread_func, i));
std::cout << n << " threads started" << std::endl;
for (int i = 0; i < n; i++)
t[i].join();
std::cout << "join finished" << std::endl;
return 0;
}
$ nvcc t1459.cu -o t1459 -std=c++11
$ ./t1459
4 threads started
hello from thread 0
hello from thread 1
hello from thread 3
hello from thread 2
join finished
$
In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.
This is the first time I am working with threads so I am sorry if this is a bad question. Shouldn't the output be consisted of "randomized" mains and foos? What I get seems to be a column of foos and a column of mains.
#include <iostream>
#include <thread>
void foo() {
for (int i = 0; i < 20; ++i) {
std::cout << "foo" << std::endl;
}
}
int main(int argc, char** argv) {
std::thread first(foo);
for (int i = 0; i < 20; ++i) {
std::cout << "main" << std::endl;
}
first.join();
return 0;
}
There is a overhead starting a tread. So in this simple example the output is completely unpredictable. Both for loops running very short, and therefore if the thread start is only even a millisecond late, both code segments are executed sequentially instead of parallel. But if the operating system schedules the thread first, the "foo" sequence is showing before the "main" sequence.
Insert some sleep calls into the thread and the main function to see if they really run parallel.
#include <iostream>
#include <thread>
#include <unistd.h>
void foo() {
for (int i = 0; i < 20; ++i) {
std::cout << "foo" << std::endl;
sleep(1);
}
}
int main(int argc, char** argv) {
std::thread first(foo);
for (int i = 0; i < 20; ++i) {
std::cout << "main" << std::endl;
sleep(1);
}
first.join();
return 0;
}
Using threads does not automatically enforce parallel execution of code segments, because if you e.g. have only one CPU in your system, the execution is switched between all processes and threads, and code segments are never running parallel.
There is a good wikipedia article about threads here. Especially read the section about "Multithreading".
After cout try to yield. This may honor any waiting thread. (Although implementation dependent)
When I am using OpenMP without functions with the reduction(+ : sum) , the OpenMP version works fine.
#include <iostream>
#include <omp.h>
using namespace std;
int sum = 0;
void summation()
{
sum = sum + 1;
}
int main()
{
int i,sum;
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
std::cerr << "Sum is=" << sum << std::endl;
}
But when I am calling a function summation over a global variable, the OpenMP version is taking even more time than the sequential version.
I would like to know the reason for the same and the changes that should be made.
The summation function doesn't use the OMP shared variable that you are reducing to. Fix it:
#include <iostream>
#include <omp.h>
void summation(int& sum) { sum++; }
int main()
{
int sum;
#pragma omp parallel for reduction (+ : sum)
for(int i = 0; i < 1000000000; ++i)
summation(sum);
std::cerr << "Sum is=" << sum << '\n';
}
The time taken to synchronize the access to this one variable will be way in excess of what you gain by using multiple cores- they will all be endlessly waiting on each other, because there is only one variable and only one core can access it at a time. This design is not capable of concurrency and all the sync you're paying will just increase the run-time.