Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I'm kinda new to pthreads and I'm trying to create a program that sorts 1 million randomly generated integers. I seem to have lost a bit of control over the threads. When run the first time, the code only produces a single thread, but when subsequently run, the thread, the threads multiply out of control. Since I don't really know precisely where the problem lies, I've provided the code below.
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <iostream>
#define N 8 /* # of thread */
#define NUM_INTS 10000 //ideally should be able to sort 1,000,000
int int_list[NUM_INTS];
/* structure for array index
* used to keep low/high end of sub arrays
*/
typedef struct Arr {
int low;
int high;
} ArrayIndex;
void merge(int low, int high) {
int mid = (low+high)/2;
int left = low;
int right = mid+1;
int list_b[high-low+1];
volatile int i, cur = 0;
while((left <= mid) && (right <= high)) {
if (int_list[left] > int_list[right])
list_b[cur++] = int_list[right++];
else
list_b[cur++] = int_list[right++];
}
while(left <= mid)
list_b[cur++] = int_list[left++];
while(right <= high)
list_b[cur++] = int_list[left++];
for (i = 0; i < (high-low+1) ; i++)
int_list[low+i] = list_b[i];
}
void * mergesort(void *a){
ArrayIndex *pa = (ArrayIndex *)int_list;
int mid = (pa->low + pa->high)/2;
ArrayIndex aIndex[N];
pthread_t thread[N];
aIndex[0].low = pa->low;
aIndex[0].high = mid;
aIndex[1].low = mid+1;
aIndex[1].high = pa->high;
if (pa->low >= pa->high)
return 0;
volatile int i;
for(i = 0; i < N; i++)
pthread_create(&thread[i], NULL, mergesort, &aIndex[i]);
for(i = 0; i < N; i++)
pthread_join(thread[i], NULL);
merge(pa->low, pa->high);
pthread_exit(NULL);
}
int main(){
volatile int i;
struct timeval start_time, end_time;
srand(getpid());
for(i=0; i<NUM_INTS; i++)
int_list[i] = rand();
ArrayIndex ai;
ai.low = 0;
ai.high = NUM_INTS/sizeof(int_list[0])-1;
pthread_t thread;
pthread_create(&thread, NULL, mergesort, &ai);
pthread_join(thread, NULL);
return 0;
}
gdb output:
(gdb) run
Starting program: /.../sort.o
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6fd5700 (LWP 25801)]
[Thread 0x7ffff6fd5700 (LWP 25801) exited]
Computation Time: 38006 micro-seconds.
[Inferior 1 (process 25797) exited normally]
(gdb) run
Starting program: /.../sort.o
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6fd5700 (LWP 25804)]
[New Thread 0x7ffff67d4700 (LWP 25805)]
[New Thread 0x7ffff5fd3700 (LWP 25806)]
[New Thread 0x7ffff57d2700 (LWP 25807)]
[New Thread 0x7ffff4fd1700 (LWP 25808)]
[New Thread 0x7fffef7fe700 (LWP 25811)]
[New Thread 0x7fffeeffd700 (LWP 25810)]
...
[New Thread 0x7ffeca6ec700 (LWP 26148)]
Program received signal SIGINT, Interrupt.
[Switching to Thread 0x7ffee8728700 (LWP 26088)]
__GI___nptl_create_event () at events.c:25
25 events.c: No such file or directory.
The problem is that you try to implement recursive divide-and-conquer parallelism by starting a new thread for each sub-problem, up to the point when a thread is given a single array item to "sort". This approach is just plain wrong for multiple reasons. To give you just one, sorting an array of 1 million items would require a million of threads at leaf calls of the recursion, and another million at all recursion levels above. Even if you introduce some grain size - a threshold after which the recursion becomes serial - the total amount of threads would still likely be very big, unless your threshold is something like NUM_INTS/N.
Even not counting the above, your implementation has some bugs:
at each level of recursion, you start N threads, even though the work is split just in halves. aIndex[i] is uninitialized for i>1, so corresponding threads receive garbage in their input parameter.
you cast int_list, which is a pointer to int, to a pointer to ArrayIndex.
There are a few ways how you may fix the design:
The simple one is to introduce a proper threshold after which the recursion becomes serial, as I said above.
The more complex one - but also more generic and flexible - is to implement a pool of threads and a pool/queue of tasks processed by the threads; so when you split the given array in halves you create two tasks for processing each half and submit these tasks to the work queue where threads take work from. Note that for good performance you would still require to set some grain size in order to have sufficient amount of work per task, but that would be much smaller threshold than the one necessary to limit the number of threads.
The right one, especially for production code, is to take a library or parallel technology that has proper primitives for recursive parallelism, such as Intel's Threading Building Blocks (tbb) or Microsoft's Parallel Patterns Library (ppl).
See also some links (and generally, google for "parallel merge sort C++")
performance problems in parallel mergesort C++
https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp
http://www.drdobbs.com/parallel/parallel-merge-sort/229400239
Related
Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class
I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.
I'm learning OpenMP in C++ using gcc 8.1.0 and MinGW64 (latest version as of this month), and I'm running into a weird debug error when my program encounters a segmentation fault.
I know the cause of the crash, attempting to create too many OpenMP threads (50,000), but it's the error itself that has me puzzled. I didn't compile gcc or MinGW64 from source, I just used the installers, and I'm on Windows.
Why is it looking for cygwin.s, and why use that file structure on Windows? My code and the error message from gdb are below the closing.
I'm learning OpenMP in the process of programming a path tracer, and I think I have a workaround for the thread limit (using while (threads < runs) and letting OpenMP set the thread count automatically), but I am stumped as to the error. Is there a workaround or solution for this?
It works fine with ~10,000 threads. I know it's not actually creating 10,000 threads simultaneously, but it's what I was doing before I thought of the workaround.
Thank you for the heads up about rand() and thread safety. I ended up replacing my RNG code with some that appears to be working fine in OpenMP, and it's literally a night and day difference visually. I will try the other changes and report back. Thanks!
WOW! It runs so much faster and the image is artifact-free! Thank you!
Jadan Bliss
Final code:
#pragma omp parellel
for (j = options.height - 1; j >= 0; j--){
for (i=0; i < options.width; i++) {
#pragma omp parallel for reduction(Vector3Add:col)
for (int s=0; s < options.samples; s++)
{
float u = (float(i) + scene_drand()) / float(options.width);
float v = (float(j) + scene_drand()) / float(options.height);
Ray r = cam.get_ray(u, v); // was: origin, lower_left_corner + u*horizontal + v*vertical);
col += color(r, world, 0);
}
col /= real(options.samples);
render.set(i,j, col);
col = Vector3(0.0);
}
}
Error:
Starting program:
C:\Users\Jadan\Documents\CBProjects\learnOMP\bin\Debug\learnOMP.exe
[New Thread 22136.0x6620] [New Thread 22136.0x80a8] [New Thread
22136.0x8008] [New Thread 22136.0x5428]
Thread 1 received signal SIGSEGV, Segmentation fault.
___chkstk_ms () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:126 126
../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file
or directory.
Here are some remarks on your code.
Using a huge number of thread will not bring you any gain and is the probable reason of your problems. Thread creation has a time and resource cost. Time cost makes that it will probably be the main time in your program and your parallel program will be by far longer than its sequential version. Concerning resource cost, each thread has its own stack segment. Its size is system dependent, but typical values are measured in MB. I do not know the characteristics of your system, but with 100000 threads, this is probably the reason why your code is crashing. I have no explaination for the message about about cygwin.s, but after a stack overflow, the behavior can be weird.
Threads are a mean to parralelize code, and, for data parallelism, it is most of the time useless to have more threads than the number of logical processors on your system. Let openmp set it, but you can experiment later to tune this number.
Besides that, there are other problems.
rand() is not thread safe as it uses a global state that will be modified concurrently by threads. rand_r() is, as the state of the random generator is not global and can be stored in every thread.
You should not modify a shared var like result without an atomic access as concurrent thread accesses can lead to unexpected results. While safe, using an atomic modification for every value is not a very efficient solution, though. Atomic accesses are very expensive and it is better to use a reduction that does local accumulation in every thread and a unique atomic access at the end.
#include <omp.h>
#include <iostream>
#include <random>
#include <time.h>
int main()
{
int runs = 100000;
double result = 0.0;
#pragma omp parallel
{
// per thread initialisation of rand_r seed.
unsigned int rand_state=omp_get_thread_num()*time(NULL);
// or whatever thread dependent seed
#pragma omp for reduction(+:result)
for(int i=0; i<runs; i++)
{
double d = double(rand_r(&rand_state))/double(RAND_MAX);
result += d;
}
}
result /= double(runs);
std::cout << "The computed average over " << runs << " runs was "
<< result << std::endl;
return 0;
}
So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.
I am trying to use the Threaded Building Blocks task_arena. There is a simple array full of '0'. Arena's threads put '1' in the array on the odd places. Main thread put '2' in the array on the even places.
/* Odd-even arenas tbb test */
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <tbb/task_arena.h>
#include <tbb/task_group.h>
#include <iostream>
using namespace std;
const int SIZE = 100;
int main()
{
tbb::task_arena limited(1); // no more than 1 thread in this arena
tbb::task_group tg;
int myArray[SIZE] = {0};
//! Main thread create another thread, then immediately returns
limited.enqueue([&]{
//! Created thread continues here
tg.run([&]{
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 == 0)
myArray[i] = 1;
}
);
});
});
//! Main thread do this work
tbb::parallel_for(tbb::blocked_range<int>(0, SIZE),
[&](const tbb::blocked_range<int> &r)
{
for(int i = 0; i != SIZE; i++)
if(i % 2 != 0)
myArray[i] = 2;
}
);
//! Main thread waiting for 'tg' group
//** it does not create any threads here (doesn't it?) */
limited.execute([&]{
tg.wait();
});
for(int i = 0; i < SIZE; i++) {
cout << myArray[i] << " ";
}
cout << endl;
return 0;
}
The output is:
0 2 0 2 ... 0 2
So the limited.enque{tg.run{...}} block doesn't work.
What's the problem? Any ideas? Thank you.
You have created limited arena for one thread only, and by default this slot is reserved for the master thread. Though, enqueuing into such a serializing arena will temporarily boost its concurrency level to 2 (in order to satisfy 'fire-and-forget' promise of the enqueue), enqueue() does not guarantee synchronous execution of the submitted task. So, tg.wait() can start before tg.run() executes and thus the program will not wait when the worker thread is created, joins the limited arena, and fills the array with '1' (BTW, the whole array is filled in each of 100 parallel_for iterations).
So, in order to wait for the tg.run() to complete, use limited.execute instead. But it will prevent automatic enhancing of the limited concurrency level and the task will be deferred till tg.wait() executed by master thread.
If you want to see asynchronous execution, set arena's concurrency to 2 manually: tbb::task_arena limited(2);
or disable slot reservation for master thread: tbb::task_arena limited(1,0) (but note, it implies additional overheads for dynamic balancing of the number of threads in arena).
P.S. TBB has no points where threads are guaranteed to come (unlike OpenMP). Only enqueue methods guarantee creation of at least one worker thread, but it says nothing about when it will come. See local observer feature to get notification when threads are actually joining arenas.
My program get segmentation fault when I run it normally. However it works just fine if I use gdb run. Moreover, the ratio of segmentation fault increases when I increase the sleep time in the philo function. I am using ubuntu 12.04. Any help or pointing is appreciated. Here is my code
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <time.h>
#include <semaphore.h>
#include <errno.h>
#define STACKSIZE 10000
#define NUMPROCS 5
#define ROUNDS 10
int ph[NUMPROCS];
//cs[i] is the chopstick between philosopher i and i+1
sem_t cs[NUMPROCS], dead;
int philo() {
int i = 0;
int cpid = getpid();
int phno;
for (i=0; i<NUMPROCS; i++)
if(ph[i] == cpid) phno = i;
for (i=0; i < ROUNDS ; i++){
// Add your entry protocol here
if (sem_wait(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
// Start of critical section -- simulation of slow n++
int sleeptime = 20000 + rand()%50000;
printf("philosopher %d is eating by chopsticks %d and %d\n",phno,phno,(phno-1+NUMPROCS)%NUMPROCS);
usleep(sleeptime) ;
// End of critical section
// Add your exit protocol here
if (sem_post(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
}
return 0;
}
int main( int argc, char ** argv){
int i;
void* stack[NUMPROCS];
srand(time(NULL));
//initialize semaphores
for (i=0; i<NUMPROCS; i++) {
if (sem_init(&cs[i],1,1) != 0){
perror(NULL);
return 1;
}
}
if (sem_init(&dead,1,4) != 0){
perror(NULL);
return 1;
}
for (i = 0; i < NUMPROCS; i++){
stack[i] = malloc(STACKSIZE) ;
if ( stack[i] == NULL ) {
printf("Error allocating memory\n") ;
exit(1) ;
}
// create a child that shares the data segment
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
if (ph[i] < 0) {
perror(NULL) ;
return 1;
}
}
for (i=0; i < NUMPROCS; i++) wait(NULL);
for (i=0; i < NUMPROCS; i++) free(stack[i]);
return 0 ;
}
A typical Heisenbug: if you look at it, it disappears. In my experience getting a segv only outside gdb or vice versa is sign of using uninitialized memory or dependence on actual pointer addresses. Normally running valgrind is ruthlessly accurate in detecting those. Unfortunately (my) valgrind can not handle your clone outside the pthread context.
Visual inspection suggests it is not a memory problem. Only the stacks are allocated on the heap and their use looks ok. Except you treat them with a void * pointer and then add something to it, which is not allowed in standard-C (a GNU extension). Proper would be to use a char *, but the GNU extensions does what you want.
Subtracting one from the top address of the stack is probably not necessary and might cause alignment errors on simple implementations of clone, but again I don't think that is the problem, as clone most likely will align the stack top again. And admittedly the manual page of clone is not very clear about the exact location of the address: "topmost address of the memory space".
Just waiting for a state change of a child and assuming it died is a bit sloppy and then taking away its stack might lead to segmentation faults, but again I don't think that is the problem, because you are probably not frantically sending signals to your philosophers.
If I run your application the philosophers can finish their diner undisturbed both inside and outside gdb, so the following is a guess. Let's call the parent process that clones philosophers "the table". Once a philosopher is cloned the table stores the returned pid in ph, say assign that number to a chair. The first thing a philosopher does is looking for his chair. If he doesn't find his chair he will have an uninitialized phno which is used to access his semaphores. Now this may very well lead to segmentation faults.
The implementation is assuming that control is returned to the table before the philosophers start. I can't find such guarantee in the manual page and I would actually expect this not to be true. Also the clone interface has a possibility to place process ids in memory shared between the child and the parent, suggesting this is a recognized problem (see parameters pid and ctid). If those are used the pid will be written before either the table or the just cloned philosopher gets control.
It is highly possible that this error explains the difference between inside and outside gdb, because gdb is well aware of the processes that are spawned under its supervision and may treat them differently than the operating system.
Alternatively you could assign a semaphore to the table. So nobody sits at the table until the table says so, obviously after it assigned all chairs. This would make a much better use for the semaphore dead.
BTW. You are of course fully aware that the setup of your solution does allow for the situation where all philosophers end up each having one fork (eh chopstick) and starve to death waiting for the other. Luckily chances of that happening are very slim.
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
This creates a thread of execution, which glibc knows nothing about. As such, glibc does not create any thread-specific internal structures that it needs for e.g. dynamic symbol resolution.
With such setup, calling into any glibc function from your philo function invokes undefined behavior, and you sometimes crash (because the dynamic loader will use main thread's private data to perform symbol resolution, and because the loader assumes that each thread has its own private area, but you've violated this assumption by creating clones which share the single private area "behind glibc's back").
If you look at a core dump, there is a high chance that the actual crash happens in ld.so, which would confirm my guess.
Don't ever use clone directly (unless you know what you are doing). Use pthread_create instead.
Here is what I see in the core that I just got (which is exactly the problem I described):
Program terminated with signal 4, Illegal instruction.
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239 vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
(gdb) bt
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1 0x00007fb694e1dc45 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:127
#2 0x00007fb694e0dee5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:42
#3 0x00000000004009ec in philo ()
#4 0x00007fb69486669d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112