pthread_create() fails (invalid argument) every 60 threads on Xeon Phi

pthread_create() fails (invalid argument) every 60 threads on Xeon Phi - c++

I have a piece of pthread code listed as the function "thread" here. It basically creates a number of threads (usually 240 on Xeon Phi and 16 on CPU) and then join them.
If I call this thread() only once, it works perfectly on both CPU and Xeon Phi. If I call it one more time, it still works fine on CPU but the pthread_create() will report "error 22" which should be "invalid argument" every 60 threads.
For example, thread 0, thread 60, thread 120 and so on of the 2nd run of thread() which are also the 241, 301, 361 and so on threads ever created in the process would fail (error 22). But thread 1~59, 61~119, 121~240, and so on work perfectly.
Note that this problem happens only on Xeon Phi.
I have checked the stack sizes, and the argument themselves, but I didn't find the reason for this. The arguments are correct.
void thread()
{
...
int i, rv;
cpu_set_t set;
arg_t args[nthreads];
pthread_t tid[nthreads];
pthread_attr_t attr;
pthread_barrier_t barrier;
rv = pthread_barrier_init(&barrier, NULL, nthreads);
if(rv != 0)
{
printf("Couldn't create the barrier\n");
exit(EXIT_FAILURE);
}
pthread_attr_init(&attr);
for(i = 0; i < nthreads; i++)
{
int cpu_idx = get_cpu_id(i,nthreads);
DEBUGMSG(1, "Assigning thread-%d to CPU-%d\n", i, cpu_idx);
CPU_ZERO(&set);
CPU_SET(cpu_idx, &set);
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &set);
args[i].tid = i;
args[i].ht = ht;
args[i].barrier = &barrier;
/* assing part of the relR for next thread */
args[i].relR.num_tuples = (i == (nthreads-1)) ? numR : numRthr;
args[i].relR.tuples = relR->tuples + numRthr * i;
numR -= numRthr;
/* assing part of the relS for next thread */
args[i].relS.num_tuples = (i == (nthreads-1)) ? numS : numSthr;
args[i].relS.tuples = relS->tuples + numSthr * i;
numS -= numSthr;
rv = pthread_create(&tid[i], &attr, npo_thread, (void*)&args[i]);
if (rv)
{
printf("ERROR; return code from pthread_create() is %d\n", rv);
printf ("%d %s\n", args[i].tid, strerror(rv));
//exit(-1);
}
}
for(i = 0; i < nthreads; i++)
{
pthread_join(tid[i], NULL);
/* sum up results */
result += args[i].num_results;
}
}

Here's a minimal example to reproduce your problem and show where your code most likely goes wrong:
#define _GNU_SOURCE
#include <pthread.h>
#include <err.h>
#include <stdio.h>
void *
foo(void *v)
{
printf("foo\n");
return NULL;
}
int
main(int argc, char **argv)
{
pthread_attr_t attr;
pthread_t thr;
cpu_set_t set;
void *v;
int e;
if (pthread_attr_init(&attr))
err(1, "pthread_attr_init");
CPU_ZERO(&set);
CPU_SET(255, &set);
if (pthread_attr_setaffinity_np(&attr, sizeof(set), &set))
err(1, "pthread_attr_setaffinity_np");
if ((e = pthread_create(&thr, &attr, foo, NULL)))
errx(1, "pthread_create: %d", e);
if (pthread_join(thr, &v))
err(1, "pthread_join");
return 0;
}
As I speculated in the comments to your question, pthread_attr_setaffinity_np doesn't check if the cpu set is sane. Instead that error gets caught in pthread_create. Since the cpu_get_id functions in your code on github are obviously broken, that's where I'd start looking for the problem.
Tested on Linux, but that's where pthread_attr_setaffinity_np comes from, so it's probably a safe assumption.

Related

Integrating pthread_create() and pthread_join() in the same loop

I am new to multi-threaded programming and I am following this tutorial. In the tutorial, there is a simple example showing how to use pthread_create() and pthread_join(). My question: why can we not put pthread_join() in the same loop as pthread_create()?
Code for reference:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define NUM_THREADS 2
/* create thread argument struct for thr_func() */
typedef struct _thread_data_t {
int tid;
double stuff;
} thread_data_t;
/* thread function */
void *thr_func(void *arg) {
thread_data_t *data = (thread_data_t *)arg;
printf("hello from thr_func, thread id: %d\n", data->tid);
pthread_exit(NULL);
}
int main(int argc, char **argv) {
pthread_t thr[NUM_THREADS];
int i, rc;
/* create a thread_data_t argument array */
thread_data_t thr_data[NUM_THREADS];
/* create threads */
for (i = 0; i < NUM_THREADS; ++i) {
thr_data[i].tid = i;
if ((rc = pthread_create(&thr[i], NULL, thr_func, &thr_data[i]))) {
fprintf(stderr, "error: pthread_create, rc: %d\n", rc);
return EXIT_FAILURE;
}
}
/* block until all threads complete */
for (i = 0; i < NUM_THREADS; ++i) {
pthread_join(thr[i], NULL);
}
return EXIT_SUCCESS;
}

I figured it out. For other users with same question, I am writing below the answer.
If we put the pthread_join() in the same loop with pthread_create(), the calling thread i.e. main() will wait for the thread 0 to finish its work before creating the thread 1. This would force the threads to execute sequentially, not in parallel. Thus it would kill the purpose of multi-threading.

Call join child pthread in main function

I have the test code:
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
pthread_t th_worker, th_worker2;
void * worker2(void *data) {
for(int i = 0; i< 1000000; i++){
printf("thread for worker2----%d\n", i);
usleep(500);
}
}
void * worker(void *data){
pthread_create(&th_worker2, NULL, worker2, data);
for(int i = 0; i< 100; i++){
printf("thread for worker-----%d\n", i);
usleep(500);
}
}
void join(pthread_t _th){
pthread_join(_th, NULL);
}
In main() function, If I call join(the_worker2):
int main() {
char* str = "hello thread";
pthread_create(&th_worker, NULL, worker, (void*) str);
/* problem in here */
join(th_worker2);
return 1;
}
--> Segment Fault error
Else, i call:
join(the_worker);
join(th_worker2);
---> OK
Why have segment fault error in above case?
Thanks for help !!!

If you posted all your code, you have a race condition.
main is synchronized with the start of worker but not worker2.
That is, main is trying to join th_worker2 before worker has had a chance to invoke pthread_create and set up th_worker2 with a valid [non-null] value.
So, th_worker2 will be invalid until the second pthread_create completes, but that's already too late for main. It has already fetched th_worker2, which has a NULL value and main will segfault.
When you add the join for th_worker, it works because it guarantees synchronization and no race condition.
To achieve this guarantee without the join, have main do:
int
main()
{
char *str = "hello thread";
pthread_create(&th_worker, NULL, worker, (void *) str);
// give worker enough time to properly start worker2
while (! th_worker2)
usleep(100);
/* problem in here */
join(th_worker2);
return 1;
}
An even better way to do this is to add an extra variable. With this, the first loop is not needed [but I've left it in]:
#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
int worker_running;
pthread_t th_worker;
int worker2_running;
pthread_t th_worker2;
void *
worker2(void *data)
{
// tell main we're fully functional
worker2_running = 1;
for (int i = 0; i < 1000000; i++) {
printf("thread for worker2----%d\n", i);
usleep(500);
}
return NULL;
}
void *
worker(void *data)
{
// tell main we're fully functional
worker_running = 1;
pthread_create(&th_worker2, NULL, worker2, data);
for (int i = 0; i < 100; i++) {
printf("thread for worker-----%d\n", i);
usleep(500);
}
return NULL;
}
void
join(pthread_t _th)
{
pthread_join(_th, NULL);
}
int
main()
{
char *str = "hello thread";
pthread_create(&th_worker, NULL, worker, (void *) str);
// give worker enough time to properly start worker2
// NOTE: this not necessarily needed as loop below is better
while (! th_worker2)
usleep(100);
// give worker2 enough time to completely start
while (! worker2_running)
usleep(100);
/* problem in here (not anymore!) */
join(th_worker2);
return 1;
}

Pthread affinity before create threads

I need to set the affinity (thread to core, eg: 1st thread to 1st core) before creating a thread. Something like KMP_AFFINITY in OpenMP. Is it possible?
edit:
I try in this way, but dont' work :/
void* DoWork(void* args)
{
int nr = (int)args;
printf("Wątek: %d, ID: %d, CPU: %d\n", nr,pthread_self(), sched_getcpu());
}
int main()
{
int count = 8;
pthread_t threads[count];
pthread_attr_t attr;
cpu_set_t mask;
CPU_ZERO(&mask);
pthread_attr_init(&attr);
for (int i = 0; i < count ; i++)
CPU_SET(i, &mask);
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &mask);
for(int i=0; i<count ; i++)
{
pthread_create(&threads[i], &attr, DoWork, (void*)i);
}
for(int i=0; i<count ; i++)
{
pthread_join(threads[i], NULL);
}
}

As mentioned before you should use pthread_attr_setaffinity_np to bind a thread to a specific core. The number of CPU cores available in your system can be retrieved (see code below).
While creating the threads with pthread_create, each time you have to pass an instance of pthread_attr_t which is set with appropriate cpu_set_t. Every time you have to either clear the cpu_set_t or remove the previously entered number (I chose the former option) before adding the next identifier of CPU core to the set. You need to have exactly one CPU in the set when creating the thread if you want to determine exactly on which CPU the thread will be executed (see code below).
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
void* DoWork(void* args) {
printf("ID: %lu, CPU: %d\n", pthread_self(), sched_getcpu());
return 0;
}
int main() {
int numberOfProcessors = sysconf(_SC_NPROCESSORS_ONLN);
printf("Number of processors: %d\n", numberOfProcessors);
pthread_t threads[numberOfProcessors];
pthread_attr_t attr;
cpu_set_t cpus;
pthread_attr_init(&attr);
for (int i = 0; i < numberOfProcessors; i++) {
CPU_ZERO(&cpus);
CPU_SET(i, &cpus);
pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
pthread_create(&threads[i], &attr, DoWork, NULL);
}
for (int i = 0; i < numberOfProcessors; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}

You can call pthread_self() to get thread id for your main thread and use that in pthread_setaffinity_np.

You can use pthread_attr_setaffinity_np for setting affinity attributes for pthread_create function.

Idea Behind Recursive Mutex Lock

I'm working on a school lab and we are instructed to create a recursive mutex lock for a counting program. I've written some code (which doesn't work), but I think that this is mostly because I do not understand the real idea behind using a recursive mutex lock. Could anyone elaborate what a recursive mutex lock should do/look like?
General Note: I'm not asking for an answer, just some clarification as to what recursive mutex lock should do.
Also, if anyone is curious, here is the code required for this. The code that I am editing/implementing is the recmutex.c.
recmutex.h
#include <pthread.h>
/*
* The recursive_mutex structure.
*/
struct recursive_mutex {
pthread_cond_t cond;
pthread_mutex_t mutex; //a non-recursive pthread mutex
pthread_t owner;
unsigned int count;
unsigned int wait_count;
};
typedef struct recursive_mutex recursive_mutex_t;
/* Initialize the recursive mutex object.
*Return a non-zero integer if errors occur.
*/
int recursive_mutex_init (recursive_mutex_t *mu);
/* Destroy the recursive mutex object.
*Return a non-zero integer if errors occur.
*/
int recursive_mutex_destroy (recursive_mutex_t *mu);
/* The recursive mutex object referenced by mu shall be
locked by calling pthread_mutex_lock(). When a thread
successfully acquires a mutex for the first time,
the lock count shall be set to one and successfully return.
Every time a thread relocks this mutex, the lock count
shall be incremented by one and return success immediately.
And any other calling thread can only wait on the conditional
variable until being waked up. Return a non-zero integer if errors occur.
*/
int recursive_mutex_lock (recursive_mutex_t *mu);
/* The recursive_mutex_unlock() function shall release the
recursive mutex object referenced by mu. Each time the owner
thread unlocks the mutex, the lock count shall be decremented by one.
When the lock count reaches zero, the mutex shall become available
for other threads to acquire. If a thread attempts to unlock a
mutex that it has not locked or a mutex which is unlocked,
an error shall be returned. Return a non-zero integer if errors occur.
*/
int recursive_mutex_unlock (recursive_mutex_t *mu);
recmutex.c: contains the functions for the recursive mutex
#include <stdio.h>
#include <pthread.h>
#include <errno.h>
#include "recmutex.h"
int recursive_mutex_init (recursive_mutex_t *mu){
int err;
err = pthread_mutex_init(&mu->mutex, NULL);
if(err != 0){
perror("pthread_mutex_init");
return -1;
}else{
return 0;
}
return 0;
}
int recursive_mutex_destroy (recursive_mutex_t *mu){
int err;
err = pthread_mutex_destroy(&mu->mutex);
if(err != 0){
perror("pthread_mutex_destroy");
return -1;
}else{
return 1;
}
return 0;
}
int recursive_mutex_lock (recursive_mutex_t *mu){
if(mutex_lock_count == 0){
pthread_mutex_lock(&mu->mutex);
mu->count++;
mu->owner = pthread_self();
printf("%s", mu->owner);
return 0;
}else if(mutex_lock_count > 0){
pthread_mutex_lock(&mu->mutex);
mu->count++;
mu->owner = pthread_self();
return 0;
}else{
perror("Counter decremented incorrectly");
return -1;
}
}
int recursive_mutex_unlock (recursive_mutex_t *mu){
if(mutex_lock_count <= 0){
printf("Nothing to unlock");
return -1;
}else{
mutex_lock_count--;
pthread_mutex_unlock(&mu->mutex);
return 0;
}
}
count_recursive.cc: The counting program mentioned above. Uses the recmutex functions.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <unistd.h>
#include <assert.h>
#include <string.h>
#include "recmutex.h"
//argument structure for the thread
typedef struct _arg_{
int n1;
int n2;
int ntimes;
}Arg;
int count; //global counter
recursive_mutex_t mutex; //the recursive mutex
void do_inc(int n){
int ret;
if(n == 0){
return;
}else{
int c;
ret = recursive_mutex_lock(&mutex);
assert(ret == 0);
c = count;
c = c + 1;
count = c;
do_inc(n - 1);
ret = recursive_mutex_unlock(&mutex);
assert(ret == 0);
}
}
/* Counter increment function. It will increase the counter by n1 * n2 * ntimes. */
void inc(void *arg){
Arg * a = (Arg *)arg;
for(int i = 0; i < a->n1; i++){
for(int j = 0; j < a->n2; j++){
do_inc(a->ntimes);
}
}
}
int isPositiveInteger (const char * s)
{
if (s == NULL || *s == '\0' || isspace(*s))
return 0;
char * p;
int ret = strtol (s, &p, 10);
if(*p == '\0' && ret > 0)
return 1;
else
return 0;
}
int test1(char **argv){
printf("==========================Test 1===========================\n");
int ret;
//Get the arguments from the command line.
int num_threads = atoi(argv[1]); //The number of threads to be created.
int n1 = atoi(argv[2]); //The outer loop count of the inc function.
int n2 = atoi(argv[3]); //The inner loop count of the inc function.
int ntimes = atoi(argv[4]); //The number of increments to be performed in the do_inc function.
pthread_t *th_pool = new pthread_t[num_threads];
pthread_attr_t attr;
pthread_attr_init( &attr );
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
ret = recursive_mutex_init(&mutex);
assert(ret == 0);
printf("Start Test. Final count should be %d\n", num_threads * n1 * n2 * ntimes );
// Create threads
for(int i = 0; i < num_threads; i++){
Arg *arg = (Arg *)malloc(sizeof(Arg));
arg->n1 = n1;
arg->n2 = n2;
arg->ntimes = ntimes;
ret = pthread_create(&(th_pool[i]), &attr, (void * (*)(void *)) inc, (void *)arg);
assert(ret == 0);
}
// Wait until threads are done
for(int i = 0; i < num_threads; i++){
ret = pthread_join(th_pool[i], NULL);
assert(ret == 0);
}
if ( count != num_threads * n1 * n2 * ntimes) {
printf("\n****** Error. Final count is %d\n", count );
printf("****** It should be %d\n", num_threads * n1 * n2 * ntimes );
}
else {
printf("\n>>>>>> O.K. Final count is %d\n", count );
}
ret = recursive_mutex_destroy(&mutex);
assert(ret == 0);
delete [] th_pool;
return 0;
}
int foo(){
int ret;
printf("Function foo\n");
ret = recursive_mutex_unlock(&mutex);
assert(ret != 0);
return ret;
}
//test a thread call unlock without actually holding it.
int test2(){
int ret;
printf("\n==========================Test 2==========================\n");
pthread_t th;
pthread_attr_t attr;
pthread_attr_init( &attr );
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
ret = recursive_mutex_init(&mutex);
ret = pthread_create(&th, &attr, (void * (*)(void *))foo, NULL);
printf("Waiting for thread to finish\n");
ret = pthread_join(th, NULL);
assert(ret == 0);
return 0;
}
int main( int argc, char ** argv )
{
int ret;
count = 0;
if( argc != 5 ) {
printf("You must enter 4 arguments. \nUsage: ./count_recursive num_threads n1 n2 ntimes\n");
return -1;
}
if(isPositiveInteger(argv[1]) != 1 || isPositiveInteger(argv[2]) != 1 || isPositiveInteger(argv[3]) != 1 || isPositiveInteger(argv[4]) != 1 ){
printf("All the 4 arguments must be positive integers\n");
return -1;
}
test1(argv);
test2();
return 0;
}

The idea of a recursive mutex is that it can be successfully relocked by the thread that is currently holding the lock. For example:
if I had some mutexes like this (this is pseudocode):
mutex l;
recursive_mutex r;
In a single thread if I did this:
l.lock();
l.lock(); // this would hang the thread.
but
r.lock();
r.lock();
r.lock(); // this would all pass though with no issue.
In implimenting a recursive mutex you need to check what threadId has locked it, if it was locked, and if it matches the current thread id, return success.

The point of a recursive mutex, is to let you write this:
recursive_mutext_t rmutex;
void foo(...) {
recursive_lock_lock(&rmutex);
...
recursive_lock_unlock(&rmutex);
}
void bar(...) {
recursive_lock_lock(&rmutex);
...
foo(...);
...
recursive_lock_unlock(&rmutex);
}
void baz(...) {
...
foo(...);
...
}
The function foo() needs the mutex to be locked, but you want to be able to call it either from bar() where the same mutex is already locked, or from baz() where the mutex is not locked. If you used an ordinary mutex(), the thread would self-deadlock when foo() is called from bar() because the ordinary mutex lock() function will not return until the mutex is unlocked, and there's no other thread that will unlock it.
Your recursive_mutex_lock() needs to distinguish these cases; (1) The mutex is not locked, (2) the mutex is already locked, but the calling thread is the owner, and (3) the mutex is already locked by some other thread.
Case (3) needs to block the calling thread until the owner completely unlocks the mutex. At that point, it then converts to case (1). Here's a hint: Handle case (3) with a condition variable. That is to say, when the calling thread is not the owner, the calling thread should do a pthread_condition_wait(...) call.

C++ Using semaphores instead of busy waiting

I am attempting to learn about semaphores and multi-threading. The example I am working with creates 1 to t threads with each thread pointing to the next and the last thread pointing to the first thread. This program allows each thread to sequentially take a turn until all threads have taken n turns. That is when the program ends. The only problem is in the tFunc function, I am busy waiting until it is a specific thread's turn. I want to know how to use semaphores in order to make all the threads go to sleep and waking up a thread only when it is its turn to execute to improve efficiency.
int turn = 1;
int counter = 0;
int t, n;
struct tData {
int me;
int next;
};
void *tFunc(void *arg) {
struct tData *data;
data = (struct tData *) arg;
for (int i = 0; i < n; i++) {
while (turn != data->me) {
}
counter++;
turn = data->next;
}
}
int main (int argc, char *argv[]) {
t = atoi(argv[1]);
n = atoi(argv[2]);
struct tData td[t];
pthread_t threads[t];
int rc;
for (int i = 1; i <= t; i++) {
if (i == t) {
td[i].me = i;
td[i].next = 1;
}
else {
td[i].me = i;
td[i].next = i + 1;
}
rc = pthread_create(&threads[i], NULL, tFunc, (void *)&td[i]);
if (rc) {
cout << "Error: Unable to create thread, " << rc << endl;
exit(-1);
}
}
for (int i = 1; i <= t; i++) {
pthread_join(threads[i], NULL);
}
pthread_exit(NULL);
}

Uses mutexes and condition variables. Here's a working example:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
int turn = 1;
int counter = 0;
int t, n;
struct tData {
int me;
int next;
};
pthread_mutex_t mutex;
pthread_cond_t cond;
void *tFunc(void *arg)
{
struct tData *data;
data = (struct tData *) arg;
pthread_mutex_lock(&mutex);
for (int i = 0; i < n; i++)
{
while (turn != data->me)
pthread_cond_wait(&cond, &mutex);
counter++;
turn = data->next;
printf("%d goes (turn %d of %d), %d next\n", data->me, i+1, n, turn);
pthread_cond_broadcast(&cond);
}
pthread_mutex_unlock(&mutex);
}
int main (int argc, char *argv[]) {
t = atoi(argv[1]);
n = atoi(argv[2]);
struct tData td[t + 1];
pthread_t threads[t + 1];
int rc;
pthread_mutex_init(&mutex, NULL);
pthread_cond_init(&cond, NULL);
for (int i = 1; i <= t; i++)
{
td[i].me = i;
if (i == t)
td[i].next = 1;
else
td[i].next = i + 1;
rc = pthread_create(&threads[i], NULL, tFunc, (void *)&td[i]);
if (rc)
{
printf("Error: Unable to create thread: %d\n", rc);
exit(-1);
}
}
void *ret;
for (int i = 1; i <= t; i++)
pthread_join(threads[i], &ret);
}

Use N+1 semaphores. On startup, thread i waits on semaphore i. When woken up it "takes a turnand signals semaphorei + 1`.
The main thread spawns the N, threads, signals semaphore 0 and waits on semaphore N.
Pseudo code:
sem s[N+1];
thread_proc (i):
repeat N:
wait (s [i])
do_work ()
signal (s [i+1])
main():
for i in 0 .. N:
spawn (thread_proc, i)
repeat N:
signal (s [0]);
wait (s [N]);

Have one semaphore per thread. Have each thread wait on its semaphore, retrying if sem_wait returns EINTR. Once it's done with its work, have it post to the next thread's semaphore. This avoids the "thundering herd" behaviour of David's solution by waking only one thread at a time.
Also notice that, since your semaphores will never have a value larger than one, you can use a pthread_mutex_t for this.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

pthread_create() fails (invalid argument) every 60 threads on Xeon Phi - c++

Related

Integrating pthread_create() and pthread_join() in the same loop

Call join child pthread in main function

Pthread affinity before create threads

Idea Behind Recursive Mutex Lock

C++ Using semaphores instead of busy waiting

Categories

Resources