Random crashes with short c openmp program on RHEL5

Random crashes with short c openmp program on RHEL5 - c++

I have a real puzzler for you folks.
Below is a small, self-contained, simple 40-line program that calculates partial sums of a bunch of numbers and routinely (but stochastically) crashes nodes on a distributed memory cluster that I'm using. If I spawn 50 PBS jobs that run this code, between 0 and 4 of them will crash their nodes. It will happen on a different repeat of the main loop each time and on different nodes each time, there is no discernible pattern. The nodes just go "down" on the ganglia report and I can't ssh to them ("no route to host"). If instead of submitting jobs I ssh onto one of the nodes and run my program there, if I'm unlucky and it crashes then I just stop seeing text and then see that that node is dead on ganglia.
The program is threaded with openmp and the crashes only happen when a large number of threads are spawned (like 12).
The cluster it's killing is a RHEL 5 cluster with nodes that have 2 6-core x5650 processors:
[jamelang#hooke ~]$ tail /etc/redhat-release
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
I have tried enabling core dumps ulimit -c unlimited but no files show up. This is the code, with comments:
#include <cstdlib>
#include <cstdio>
#include <omp.h>
int main() {
const unsigned int numberOfThreads = 12;
const unsigned int numberOfPartialSums = 30000;
const unsigned int numbersPerPartialSum = 40;
// make some numbers
srand(0); // every instance of program should get same results
const unsigned int totalNumbersToSum = numbersPerPartialSum * numberOfPartialSums;
double * inputData = new double[totalNumbersToSum];
for (unsigned int index = 0; index < totalNumbersToSum; ++index) {
inputData[index] = rand()/double(RAND_MAX);
}
omp_set_num_threads(numberOfThreads);
// prepare a place to dump output
double * partialSums = new double[numberOfPartialSums];
// do the following algorithm many times to induce a problem
for (unsigned int repeatIndex = 0; repeatIndex < 100000; ++repeatIndex) {
if (repeatIndex % 1000 == 0) {
printf("Absurd testing is on repeat %06u\n", repeatIndex);
}
#pragma omp parallel for
for (unsigned int partialSumIndex = 0; partialSumIndex < numberOfPartialSums;
++partialSumIndex) {
// get this partial sum's limits
const unsigned int beginIndex = numbersPerPartialSum * partialSumIndex;
const unsigned int endIndex = numbersPerPartialSum * (partialSumIndex + 1);
// we just sum the 40 numbers, can't get much simpler
double sumOfNumbers = 0;
for (unsigned int index = beginIndex; index < endIndex; ++index) {
// only reading, thread-safe
sumOfNumbers += inputData[index];
}
// writing to non-overlapping indices (guaranteed by omp),
// should be thread-safe.
// at worst we would have false sharing, but that would just affect
// performance, not throw sigabrts.
partialSums[partialSumIndex] = sumOfNumbers;
}
}
delete[] inputData;
delete[] partialSums;
return 0;
}
I compile it with the following:
/home/jamelang/gcc-4.8.1/bin/g++ -O3 -Wall -fopenmp Killer.cc -o Killer
It seems to be linking against the right shared objects:
[jamelang#hooke Killer]$ ldd Killer
linux-vdso.so.1 => (0x00007fffc0599000)
libstdc++.so.6 => /home/jamelang/gcc-4.8.1/lib64/libstdc++.so.6 (0x00002b155b636000)
libm.so.6 => /lib64/libm.so.6 (0x0000003293600000)
libgomp.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgomp.so.1 (0x00002b155b983000)
libgcc_s.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgcc_s.so.1 (0x00002b155bb92000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003293a00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003292e00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003292a00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003298600000)
Some Notes:
1. On osx lion with gcc 4.7, this code will throw a SIGABRT, similar to this question: Why is this code giving SIGABRT with openMP?. Using gcc 4.8 seems to fix the issue on OSX. However, using gcc 4.8 on the RHEL5 machine does not fix it. The RHEL5 machine has GLIBC version 2.5, and it seems that yum doesn't provide a later one, so the admins are sticking with 2.5.
2. If I define a SIGABRT signal handler, it doesn't catch the problem on the RHEL5 machine, but it does catch it on OSX with gcc47.
3. I believe that no variables should need to be shared in the omp clause because they can all have private copies, but adding them as shared does not change the behavior.
4. The killing of nodes occurs regardless of the level of optimization used.
5. The killing of nodes occurs even if I run the program from within gdb (i.e. put "gdb -batch -x gdbCommands Killer" in the pbs file) where "gdbCommands" is a file with one line: "run"
6. This example spawns threads on every repeat. One strategy would be to make a parallel block that contains the repeats loop in order to prevent this. However, that does not help me - this example is only representative of a much larger research code in which I cannot use that strategy.
I'm all out of ideas, at my last straw, at my wit's end, ready to pull my hair out, etc with this. Does anyone have suggestions or ideas?

You are trying to parallelize nested for loops, in this case you need to make the variables in the inner loop private, so that each thread has its own variable. It can be done using private clause, as in the example below.
#pragma omp parallel for private(j)
for (i = 0; i < height; i++)
for (j = 0; j < width; j++)
c[i][j] = 2;
In your case, index and sumOfNumbers need to be private.

Related

How can I realize data local spawning or scheduling of tasks in OpenMP on NUMA CPUs?

I have this simple self-contained example of a very rudimentary 2 dimensional stencil application using OpenMP tasks on dynamic arrays to represent an issue that I am having on a problem that is less of a toy problem.
There are 2 update steps in which for each point in the array 3 values are added from another array, from the corresponding location as well as the upper and lower neighbour location. The program is executed on a NUMA CPU with 8 cores and 2 hardware threads on each NUMA node. The array initializations are parallelized and using the environment variables OMP_PLACES=threads and OMP_PROC_BIND=spread the data is evenly distributed among the nodes' memories. To avoid data races I have set up dependencies so that for every section on the second update a task can only be scheduled if the relevant tasks for the sections from the first update step are executed. The computation is correct but not NUMA aware. The affinity clause seems to be not enough to change the scheduling as it is just a hint. I am also not sure whether using the single for task creation is efficient but all I know is it is the only way to make all task sibling tasks and thus the dependencies applicable.
Is there a way in OpenMP where I could parallelize the task creation under these constraints or guide the runtime system to a more NUMA-aware task scheduling? If not, it is also okay, I am just trying to see whether there are options available that use OpenMP in a way that it is intended and not trying to break it. I already have a version that only uses worksharing loops. This is for research.
NUMA NODE 0 pus {0-7,16-23}
NUMA NODE 1 pus {8-15,24-31}
Environment Variables
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
#define _GNU_SOURCE // sched_getcpu(3) is glibc-specific (see the man page)
#include <sched.h>
#include <iostream>
#include <omp.h>
#include <math.h>
#include <vector>
#include <string>
typedef double value_type;
int main(int argc, char * argv[]){
std::size_t rows = 8192;
std::size_t cols = 8192;
std::size_t part_rows = 32;
std::size_t num_threads = 16;
std::size_t parts = ceil(float(rows)/part_rows);
value_type * A = (value_type *) malloc(sizeof(value_type)*rows*cols);
value_type * B = (value_type *) malloc(sizeof(value_type)*rows*cols);
value_type * C = (value_type *) malloc(sizeof(value_type)*rows*cols);
#pragma omp parallel for schedule(static)
for (int i = 0; i < rows; ++i)
for(int j = 0; j<cols; ++j){
A[i*cols+j] = 1;
B[i*cols+j] = 1;
C[i*cols+j] = 0;
}
std::vector<std::vector<std::size_t>> putasks(32, std::vector<std::size_t>(2,0));
std::cout << std::endl;
#pragma omp parallel num_threads(num_threads)
#pragma omp single
{
for(int part=0; part<parts; part++){
std::size_t row = part * part_rows;
std::size_t current_first_loc = row * cols;
//first index of the upper part in the array
std::size_t upper_part_first_loc = part != 0 ? (part-1)*part_rows*cols : current_first_loc;
//first index of the lower part in the array
std::size_t lower_part_first_loc = part != parts-1 ? (part+1)*part_rows * cols : current_first_loc;
std::size_t start = row;
std::size_t end = part == parts-1 ? rows-1 : start+part_rows;
if(part==0) start = 1;
#pragma omp task depend(in: A[current_first_loc], A[upper_part_first_loc], A[lower_part_first_loc])\
depend(out: B[current_first_loc]) affinity(A[current_first_loc], B[current_first_loc])
{
if(end <= ceil(rows/2.0))
putasks[sched_getcpu()][0]++;
else putasks[sched_getcpu()][1]++;
for(std::size_t i=start; i<end; ++i){
for(std::size_t j = 0; j < cols; ++j)
B[i*cols+j] += A[i*cols+j] + A[(i-1)*cols+j] + A[(i+1)*cols+j];
}
}
}
for(int part=0; part<parts; part++){
std::size_t row = part * part_rows;
std::size_t current_first_loc = row * cols;
std::size_t upper_part_first_loc = part != 0 ? (part-1)*part_rows*cols : current_first_loc;
std::size_t lower_part_first_loc = part != parts-1 ? (part+1)*part_rows * cols : current_first_loc;
std::size_t start = row;
std::size_t end = part == parts-1 ? rows-1 : start+part_rows;
if(part==0) start = 1;
#pragma omp task depend(in: B[current_first_loc], B[upper_part_first_loc], B[lower_part_first_loc])\
depend(out: C[current_first_loc]) affinity(B[current_first_loc], C[current_first_loc])
{
if(end <= ceil(rows/2.0))
putasks[sched_getcpu()][0]++;
else putasks[sched_getcpu()][1]++;
for(std::size_t i=start; i<end; ++i){
for(std::size_t j = 0; j < cols; ++j)
C[i*cols+j] += B[i*cols+j] + B[(i-1)*cols+j] + B[(i+1)*cols+j];
}
}
}
}
if(rows <= 16 & cols <= 16)
for(std::size_t i = 0; i < rows; ++i){
for(std::size_t j = 0; j < cols; ++j){
std::cout << C[i*cols+j] << " ";
}
std::cout << std::endl;
}
for(std::size_t i = 0; i<putasks.size(); ++i){
if(putasks[i][0]!=0 && putasks[i][1]!=0){
for(std::size_t node = 0; node < putasks[i].size(); ++node){
std::cout << "pu: " << i << " worked on ";
std::cout << putasks[i][node]<< " NODE " << node << " tasks" << std::endl;
}
std::cout << std::endl;
}
}
return 0;
}
Task Distribution Output Excerpt
pu: 1 worked on 26 NODE 0 tasks
pu: 1 worked on 12 NODE 1 tasks
pu: 2 worked on 27 NODE 0 tasks
pu: 2 worked on 13 NODE 1 tasks
...
pu: 7 worked on 26 NODE 0 tasks
pu: 7 worked on 13 NODE 1 tasks
pu: 8 worked on 10 NODE 0 tasks
pu: 8 worked on 11 NODE 1 tasks
pu: 9 worked on 8 NODE 0 tasks
pu: 9 worked on 14 NODE 1 tasks
...
pu: 15 worked on 8 NODE 0 tasks
pu: 15 worked on 12 NODE 1 tasks

First of all, the state of the OpenMP task scheduling on NUMA systems is far from being great in practice. It has been the subject of many research project in the past and they is still ongoing project working on it. Some research runtimes consider the affinity hint properly and schedule the tasks regarding the NUMA node of the in/out/inout dependencies. However, AFAIK mainstream runtimes does not do much to schedule tasks well on NUMA systems, especially if you create all the tasks from a unique NUMA node. Indeed, AFAIK GOMP (GCC) just ignore this and actually exhibit a behavior that make it inefficient on NUMA systems (eg. the creation of the tasks is temporary stopped when there are too many of them and tasks are executed on all NUMA nodes disregarding the source/target NUMA node). IOMP (Clang/ICC) takes into account locality but AFAIK in your case, the scheduling should not be great. The affinity hint for tasks is not available upstream yet. Thus, GOMP and IOMP will clearly not behave well in your case as tasks of different steps will be often distributed in a way that produces many remote NUMA node accesses that are known to be inefficient. In fact, this is critical in your case as stencils are generally memory bound.
If you work with IOMP, be aware that its task scheduler tends to execute tasks on the same NUMA node where they are created. Thus, a good solution is to create the tasks in parallel. The tasks can be created in many threads bound to NUMA nodes. The scheduler will first try to execute the tasks on the same threads. Workers on the same NUMA node will try to steal tasks of the threads in the same NUMA node, and if there not enough tasks, then from any threads. While this work stealing strategy works relatively well in practice, there is a huge catch: tasks of different parent tasks cannot share dependencies. This limitation of the current OpenMP specification is a big issue for stencil codes (at least the ones that creates tasks working on different time steps). An alternative solution is to create tasks with dependencies from one thread and create smaller tasks from these tasks but due to the often bad scheduling of the big tasks, this approach is generally inefficient in practice on NUMA systems. In practice, on mainstream runtimes, the basic statically-scheduled loops behave relatively well on NUMA system for stencil although it is clearly sub-optimal for large stencils. This is sad and I hope this situation will improve in the current decade.
Be aware that data initialization matters a lot on NUMA systems as many platform actually allocate pages on the NUMA node performing the first touch. Thus the initialization has to be parallel (otherwise all the pages could be located on the same NUMA node causing a saturation of this node during stencil steps). The default policy is not the same on all platforms and some can move pages between NUMA nodes regarding their use. You can tweak the behavior with numactl. You can also fetch very useful information from the hw-loc tool. I strongly advise you to manually the location of all OpenMP threads using OMP_PROC_BIND=True and OMP_PLACES="{0},{1},...,{n}" where the OMP_PLACES string set can be generated from hw-loc regarding the actual platform.
For more information you can read this research paper (disclaimer: I am one of the authors). You can certainly find other similar research paper on the IWOMP conference and the Super-Computing conference too. You could try to use research runtime though most of them are not designed to be used in production (eg. KOMP which is not actively developed anymore, StarPU which mainly focus on GPUs and optimizing the critical path, OmpSS which is not fully compatible with OpenMP but try to extend it, PaRSEC which is mainly designed for linear algebra applications).

Calling into Haskell from multiple C/C++ threads

I have a small function in written Haskell with the following type:
foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)
I am calling this from multiple C++ threads running concurrently (via
TBB). During this part of the execution of my program I can barely get a
load average above 1.4 even though I'm running on a six-core CPU (12
logical cores). I therefore suspect that either the calls into Haskell all
get funnelled through a single thread, or there is some significant
synchronization going on.
I am not doing any such thing explicitly, all the function does is operate
on the incoming data (after storing it into a Data.Vector.Storable), and
return the result back as a newly allocated array (from Data.Marshal.Array).
Is there anything I need to do to fully enable concurrent calls like this?
I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.
Looking forward to reading some advice,
Sebastian

Using the simple example at the end of this answer, if I compile with:
$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test
then running it with ./test occupies only one core at 100%. I need to run it with ./test +RTS -N, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.
So, the RTS -N flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded and running with +RTS -n) to fully utilize all available cores.
So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg or limiting the number of cores involved with -qn2 or something. To enable these options, you need to call hs_init_with_rtsopts() in place of the usual hs_init() as in my example.
If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.
My example:
caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>
#define NUM_THREAD 4
void*
work(void* arg)
{
for (;;) {
fibIO(30);
}
}
int
main(int argc, char **argv)
{
hs_init_with_rtsopts(&argc, &argv);
pthread_t threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
int rc = pthread_create(&threads[i], NULL, work, NULL);
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(threads[i], NULL);
}
hs_exit();
return 0;
}
Worker.hs
module Worker where
import Foreign
fibIO :: Int -> IO Int
fibIO = return . fib
fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
| otherwise = 1
foreign export ccall fibIO :: Int -> IO Int

GCC 8.1.0/MinGW64-compiled OpenMP program crashes looking for cygwin.s?

I'm learning OpenMP in C++ using gcc 8.1.0 and MinGW64 (latest version as of this month), and I'm running into a weird debug error when my program encounters a segmentation fault.
I know the cause of the crash, attempting to create too many OpenMP threads (50,000), but it's the error itself that has me puzzled. I didn't compile gcc or MinGW64 from source, I just used the installers, and I'm on Windows.
Why is it looking for cygwin.s, and why use that file structure on Windows? My code and the error message from gdb are below the closing.
I'm learning OpenMP in the process of programming a path tracer, and I think I have a workaround for the thread limit (using while (threads < runs) and letting OpenMP set the thread count automatically), but I am stumped as to the error. Is there a workaround or solution for this?
It works fine with ~10,000 threads. I know it's not actually creating 10,000 threads simultaneously, but it's what I was doing before I thought of the workaround.
Thank you for the heads up about rand() and thread safety. I ended up replacing my RNG code with some that appears to be working fine in OpenMP, and it's literally a night and day difference visually. I will try the other changes and report back. Thanks!
WOW! It runs so much faster and the image is artifact-free! Thank you!
Jadan Bliss
Final code:
#pragma omp parellel
for (j = options.height - 1; j >= 0; j--){
for (i=0; i < options.width; i++) {
#pragma omp parallel for reduction(Vector3Add:col)
for (int s=0; s < options.samples; s++)
{
float u = (float(i) + scene_drand()) / float(options.width);
float v = (float(j) + scene_drand()) / float(options.height);
Ray r = cam.get_ray(u, v); // was: origin, lower_left_corner + u*horizontal + v*vertical);
col += color(r, world, 0);
}
col /= real(options.samples);
render.set(i,j, col);
col = Vector3(0.0);
}
}
Error:
Starting program:
C:\Users\Jadan\Documents\CBProjects\learnOMP\bin\Debug\learnOMP.exe
[New Thread 22136.0x6620] [New Thread 22136.0x80a8] [New Thread
22136.0x8008] [New Thread 22136.0x5428]
Thread 1 received signal SIGSEGV, Segmentation fault.
___chkstk_ms () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:126 126
../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file
or directory.

Here are some remarks on your code.
Using a huge number of thread will not bring you any gain and is the probable reason of your problems. Thread creation has a time and resource cost. Time cost makes that it will probably be the main time in your program and your parallel program will be by far longer than its sequential version. Concerning resource cost, each thread has its own stack segment. Its size is system dependent, but typical values are measured in MB. I do not know the characteristics of your system, but with 100000 threads, this is probably the reason why your code is crashing. I have no explaination for the message about about cygwin.s, but after a stack overflow, the behavior can be weird.
Threads are a mean to parralelize code, and, for data parallelism, it is most of the time useless to have more threads than the number of logical processors on your system. Let openmp set it, but you can experiment later to tune this number.
Besides that, there are other problems.
rand() is not thread safe as it uses a global state that will be modified concurrently by threads. rand_r() is, as the state of the random generator is not global and can be stored in every thread.
You should not modify a shared var like result without an atomic access as concurrent thread accesses can lead to unexpected results. While safe, using an atomic modification for every value is not a very efficient solution, though. Atomic accesses are very expensive and it is better to use a reduction that does local accumulation in every thread and a unique atomic access at the end.
#include <omp.h>
#include <iostream>
#include <random>
#include <time.h>
int main()
{
int runs = 100000;
double result = 0.0;
#pragma omp parallel
{
// per thread initialisation of rand_r seed.
unsigned int rand_state=omp_get_thread_num()*time(NULL);
// or whatever thread dependent seed
#pragma omp for reduction(+:result)
for(int i=0; i<runs; i++)
{
double d = double(rand_r(&rand_state))/double(RAND_MAX);
result += d;
}
}
result /= double(runs);
std::cout << "The computed average over " << runs << " runs was "
<< result << std::endl;
return 0;
}

slow serial "for" with openmp enabled

I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?

I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.

can run in gdb, segmentation fault when run directly

My program get segmentation fault when I run it normally. However it works just fine if I use gdb run. Moreover, the ratio of segmentation fault increases when I increase the sleep time in the philo function. I am using ubuntu 12.04. Any help or pointing is appreciated. Here is my code
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <time.h>
#include <semaphore.h>
#include <errno.h>
#define STACKSIZE 10000
#define NUMPROCS 5
#define ROUNDS 10
int ph[NUMPROCS];
//cs[i] is the chopstick between philosopher i and i+1
sem_t cs[NUMPROCS], dead;
int philo() {
int i = 0;
int cpid = getpid();
int phno;
for (i=0; i<NUMPROCS; i++)
if(ph[i] == cpid) phno = i;
for (i=0; i < ROUNDS ; i++){
// Add your entry protocol here
if (sem_wait(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_wait(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
// Start of critical section -- simulation of slow n++
int sleeptime = 20000 + rand()%50000;
printf("philosopher %d is eating by chopsticks %d and %d\n",phno,phno,(phno-1+NUMPROCS)%NUMPROCS);
usleep(sleeptime) ;
// End of critical section
// Add your exit protocol here
if (sem_post(&dead) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[phno]) != 0) {
perror(NULL);
return 1;
}
if (sem_post(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0){
perror(NULL);
return 1;
}
}
return 0;
}
int main( int argc, char ** argv){
int i;
void* stack[NUMPROCS];
srand(time(NULL));
//initialize semaphores
for (i=0; i<NUMPROCS; i++) {
if (sem_init(&cs[i],1,1) != 0){
perror(NULL);
return 1;
}
}
if (sem_init(&dead,1,4) != 0){
perror(NULL);
return 1;
}
for (i = 0; i < NUMPROCS; i++){
stack[i] = malloc(STACKSIZE) ;
if ( stack[i] == NULL ) {
printf("Error allocating memory\n") ;
exit(1) ;
}
// create a child that shares the data segment
ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
if (ph[i] < 0) {
perror(NULL) ;
return 1;
}
}
for (i=0; i < NUMPROCS; i++) wait(NULL);
for (i=0; i < NUMPROCS; i++) free(stack[i]);
return 0 ;
}

A typical Heisenbug: if you look at it, it disappears. In my experience getting a segv only outside gdb or vice versa is sign of using uninitialized memory or dependence on actual pointer addresses. Normally running valgrind is ruthlessly accurate in detecting those. Unfortunately (my) valgrind can not handle your clone outside the pthread context.
Visual inspection suggests it is not a memory problem. Only the stacks are allocated on the heap and their use looks ok. Except you treat them with a void * pointer and then add something to it, which is not allowed in standard-C (a GNU extension). Proper would be to use a char *, but the GNU extensions does what you want.
Subtracting one from the top address of the stack is probably not necessary and might cause alignment errors on simple implementations of clone, but again I don't think that is the problem, as clone most likely will align the stack top again. And admittedly the manual page of clone is not very clear about the exact location of the address: "topmost address of the memory space".
Just waiting for a state change of a child and assuming it died is a bit sloppy and then taking away its stack might lead to segmentation faults, but again I don't think that is the problem, because you are probably not frantically sending signals to your philosophers.
If I run your application the philosophers can finish their diner undisturbed both inside and outside gdb, so the following is a guess. Let's call the parent process that clones philosophers "the table". Once a philosopher is cloned the table stores the returned pid in ph, say assign that number to a chair. The first thing a philosopher does is looking for his chair. If he doesn't find his chair he will have an uninitialized phno which is used to access his semaphores. Now this may very well lead to segmentation faults.
The implementation is assuming that control is returned to the table before the philosophers start. I can't find such guarantee in the manual page and I would actually expect this not to be true. Also the clone interface has a possibility to place process ids in memory shared between the child and the parent, suggesting this is a recognized problem (see parameters pid and ctid). If those are used the pid will be written before either the table or the just cloned philosopher gets control.
It is highly possible that this error explains the difference between inside and outside gdb, because gdb is well aware of the processes that are spawned under its supervision and may treat them differently than the operating system.
Alternatively you could assign a semaphore to the table. So nobody sits at the table until the table says so, obviously after it assigned all chairs. This would make a much better use for the semaphore dead.
BTW. You are of course fully aware that the setup of your solution does allow for the situation where all philosophers end up each having one fork (eh chopstick) and starve to death waiting for the other. Luckily chances of that happening are very slim.

ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;
This creates a thread of execution, which glibc knows nothing about. As such, glibc does not create any thread-specific internal structures that it needs for e.g. dynamic symbol resolution.
With such setup, calling into any glibc function from your philo function invokes undefined behavior, and you sometimes crash (because the dynamic loader will use main thread's private data to perform symbol resolution, and because the loader assumes that each thread has its own private area, but you've violated this assumption by creating clones which share the single private area "behind glibc's back").
If you look at a core dump, there is a high chance that the actual crash happens in ld.so, which would confirm my guess.
Don't ever use clone directly (unless you know what you are doing). Use pthread_create instead.
Here is what I see in the core that I just got (which is exactly the problem I described):
Program terminated with signal 4, Illegal instruction.
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239 vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
(gdb) bt
#0 _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1 0x00007fb694e1dc45 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:127
#2 0x00007fb694e0dee5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:42
#3 0x00000000004009ec in philo ()
#4 0x00007fb69486669d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js