MPI_Scatter segmentation fault dependent on node number

MPI_Scatter segmentation fault dependent on node number - c++

I get a strange behavior when running a test code for MPI_Scatter. The program seems to work fine, but it returns a segmentation fault if the number of nodes is larger than 4. I compile with mpicxx and run with mpirun -n N ./a.o.
#include <mpi.h>
#include <vector>
#include <stdio.h>
using std::vector;
int main(void){
MPI_Init(NULL,NULL);
int num_PE;
MPI_Comm_size(MPI_COMM_WORLD, &num_PE);
int my_PE;
MPI_Comm_rank(MPI_COMM_WORLD, &my_PE);
int data_per_PE=2;
int remainder=0; //conceptually should be less than data_per_PE but shouldn't matter from code perspective
vector<int> elem_count(num_PE,data_per_PE); //number of elements to scatter
elem_count[num_PE-1]=data_per_PE+remainder; //let last PE take extra load
vector<int> start_send(num_PE); //the offset to send from main buffer
vector<double> small_vec(data_per_PE+remainder); //small place to store values
vector<double> bigVec; //the big list to distribute to processes
if (my_PE==0){
bigVec.reserve(data_per_PE*num_PE+remainder); //make room
for(int i=0; i<data_per_PE*num_PE+remainder; i++){
bigVec.push_back(static_cast<double>(i)+1.0); //1,2,3...
start_send[i]=i*data_per_PE; //the stride
}
}
// MPI_Scatterv(&bigVec[0],&elem_count[0],&start_send[0],MPI_DOUBLE,&small_vec[0],data_per_PE+remainder,MPI_DOUBLE,0,MPI_COMM_WORLD);
MPI_Scatter(&bigVec[0],data_per_PE,MPI_DOUBLE,&small_vec[0],data_per_PE,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter
if (my_PE==0){
printf("Proc \t elems \n");
}
MPI_Barrier(MPI_COMM_WORLD); //let everything catch up before printing
for (int i=0;i<data_per_PE+remainder;i++){
printf("%d \t %f \n", my_PE, small_vec[i]); //print the values scattered to each processor
}
MPI_Barrier(MPI_COMM_WORLD); //don't think this is necessary but won't hurt
MPI_Finalize(); //finish
return 0;
}

The issue has nothing to do with the scatter, but rather this line:
start_send[i]=i*data_per_PE;
Since i can go beyond num_PE, you write outside of the bounds of start_send - overwriting some memory that probably belongs to small_vec.
This could have easily been found by creating a truly minimal example.
You have another issue in your code: &bigVec[0] is a problem for my_PE!=0. While the parameter to MPI_Scatter is ignored by non-root ranks, the statement involves dereferencing in std::vector::operator[] the first element. As the vector is empty, this is undefined behavior on it's own. Here is an explanation as to why that can create subtle problems. Use bigVec.data() instead.

You are writing past the end of start_send's internal storage, thus corrupting the heap and any other objects contained in it:
if (my_PE==0){
bigVec.reserve(data_per_PE*num_PE+remainder); //make room
for(int i=0; i<data_per_PE*num_PE+remainder; i++){
bigVec.push_back(static_cast<double>(i)+1.0); //1,2,3...
start_send[i]=i*data_per_PE; //the stride <--- HERE
}
}
i runs until data_per_PE*num_PE+remainder - 1, but start_send has storage for num_PE elements only. Writing past the end corrupts the linked list of heap objects and the program likely segfaults when a destructor tries to free a corrupted heap block or when some other heap object is accessed.

Related

Segmentation fault(core dumped) c++

The following program gives the correct output but gives segmentation fault(core dumped in the end)
#include <iostream>
using namespace std;
int main()
{
int a[50],n,i,c[50],b[50];
cin>>n;
for(i=1;i<=n;i++)
cin>>a[i];
for(i=1;i<=100;i++)
b[i]=0;
for(i=1;i<=n;i++)
{
b[a[i]]++;
}
for(i=2;i<=100;i++)
{
b[i]=b[i]+b[i-1];
}
for(i=1;i<=n;i++)
{
c[b[a[i]]]=a[i];
b[a[i]]--;
}
for(i=1;i<=n;i++)
cout<<c[i]<<endl;
return 0;
}
Here is the debugger output
gdb

Apart from other conditions in your program, at least this loop causes a buffer overflow and hence invokes undefined behavior!
for(i=1;i<=100;i++)
b[i]=0;
That is, because b has only 50 elements, and you're accessing memory waaay beyond that range. Also C arrays a zero-index based.
Note, that you likely cause UB in other statements as well, but this one was the first I noticed skimming over your code.

You are going out of bounds here:
for(i=2;i<=100;i++)
{
b[i]=b[i]+b[i-1];
}
since b is an array of size 50 and you allow i to reach the value of 100 and index the array. This causes the segmentation fault.
PS: Other loops may also index your arrays out of bounds, depending on the input you receive, since you don't check the input.

C++ calling one function repeatedly system hang

I am getting very serious issue with one function in C++. This is my function
double** Fun1(unsigned l,unsigned n, vector<int>& list,
vector<string>& DataArray)
{
double** array2D = 0;
array2D = new double*[l];
string alphabet="ACGT";
for (int i = 0; i < l; i++)
{
array2D[i] = new double [4];
vector<double> count(4, 0.0);
for(int j=0;j<n;++j)
{
for(int k=0;k<4;k++)
{
if (toupper(DataArray[list[j]][i])==alphabet[k])
count[k]=count[k]+1;
}
}
for(int k=0;k<4;k++)
array2D[i][k]=count[k];
count.clear();
}
return array2D;
}
The value of l is around 100 and n=1, DataArray size is (50000 x l) and list will contain any one number between 0-49999.
Now i am calling this function from my main program many number of times (may be like more than 50 Million times). Upto certain number of times it going very smooth but after 2/3 minute around that my system hangs. I am unable to find what is problem with this code. I guess memory is getting short but don't know why?

You are missing the corresponding delete[] from your code.
Note the [] which means you are deleting an array. If you forget to add these you will be venturing into undefined territory (3.7.4.2 in N3797).
You may wish to try using std::array to mitigate having to new and delete[] so much. Also if this is called as much as you say and the loop is this small I would be concerned about the coherency of the data.

Upto certain number of times it going very smooth but after 2/3 minute
around that my system hangs
its not hang. If you are working on linux or unix machine check top (system performance). virtual memory of your system got full. Do delete or delete[] appropriately for every new

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

High number causes seg fault

This bit of code is from a program I am writing to take in x col and x rows to run a matrix multiplication on CUDA, parallel processing. The larger the sample size, the better.
I have a function that auto generates x amount of random numbers.
I know the answer is simple but I just wanted to know exactly why. But when I run it with say 625000000 elements in the array, it seg faults. I think it is because I have gone over the size allowed in memory for an int.
What data type should I use in place of int for a larger number?
This is how the data is being allocated, then passed into the function.
a.elements = (float*) malloc(mem_size_A);
where
int mem_size_A = sizeof(float) * size_A; //for the example let size_A be 625,000,000
Passed:
randomInit(a.elements, a.rowSize,a.colSize, oRowA, oColA);
What the randomInit is doing is say I enter a 2x2 but I am padding it up to a multiple of 16. So it takes the 2x2 and pads the matrix to a 16x16 of zeros and the 2x2 is still there.
void randomInit(float* data, int newRowSize,int newColSize, int oldRowSize, int oldColSize)
{
printf("Initializing random function. The new sized row is %d\n", newRowSize);
for (int i = 0; i < newRowSize; i++)//go per row of new sized row.
{
for(int j=0;j<newColSize;j++)
{
printf("This loop\n");
if(i<oldRowSize&&j<oldColSize)
{
data[newRowSize*i+j]=rand() / (float)RAND_MAX;//brandom();
}
else
data[newRowSize*i+j]=0;
}
}
}
I've even ran it with the printf in the loop. This is the result I get:
Creating the random numbers now
Initializing random function. The new sized row is 25000
This loop
Segmentation fault

Your memory allocation for data is probably failing.
Fortunately, you almost certainly don't need to store a large collection of random numbers.
Instead of storing:
data[n]=rand() / (float)RAND_MAX
for some huge collection of n, you can run:
srand(n);
value = rand() / (float)RAND_MAX;
when you need a particular number and you'll get the same value every time, as if they were all calculated in advance.

I think you're going past the value you allocated for data. when you're newrowsize is too large, you're accessing unallocated memory.
remember, data isn't infinitely big.

Well the real problem is that, if the problem is really the integer size used for your array access, you will be not able to fix it. I think you probably just have not enough space in your memory so as to store that huge number of data.
If you want to extends that, just define a custom structure or class if you are in C++. But you will loose the O(1) time access complexity involves with array.

Segmentation fault of an MPI program

I am writing a program with c++ that uses MPI. The simplified version of my code is
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <mpi.h>
#define RNumber 3000000 //Number of loops to go
using namespace std;
class LObject {
/*Something here*/
public:
void FillArray(long * RawT){
/*Does something*/
for (int i = 0; i < RNumber; i++){
RawT[i] = i;
}
}
};
int main() {
int my_rank;
int comm_sz;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
LObject System;
long rawT[RNumber];
long * Times = NULL;
if (my_rank == 0) Times = (long*) malloc(comm_sz*RNumber*sizeof(long));
System.FillArray(rawT);
if (my_rank == 0) {
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
}
else {
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
};
The program compiles fine, but gives a Segmentation fault error on execution. The message is
=================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
When I reduce the RNumber the program works fine. Maybe somebody could explain what precisely goes wrong? Am I trying to allocate too much space for an array? If that's the case, will this problem be solved by storing the results in a file instead of an array?
If it is possible, could you please give broad comments on the things I do wrong.
Thank you for you time and effort!

A couple of possible issues:
long rawT[RNumber];
That's rather a large array to be putting on the stack. There is usually a limit to stack size (especially in a multithreaded program), and a typical size is one or two megabytes. You'd be better off with a std::vector<long> here.
Times = (long*) malloc(comm_sz*RNumber*sizeof(long));
You should check that the memory allocation succeeded. Or better still, use std::vector<long> here as well (which will also fix your memory leak).
if (my_rank == 0) {
// do stuff
} else {
// do exactly the same stuff
}
I'm guessing the else block should do something different; in particular, something that doesn't involve Times, since that is null unless my_rank == 0.
UPDATE: to use a vector instead of a raw array, just initialise it with the size you want, and then use a pointer to the first element where you would use a (pointer to) the array:
std::vector<long> rawT(RNumber);
System.FillArray(&rawT[0]);
std::vector<long> Times(comm_sz*RNumber);
MPI_Gather(&rawT[0], RNumber, MPI_LONG, &Times[0], RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
Beware that the pointer will be invalidated if you resize the vector (although you won't need to do that if you're simply using it as a replacement for an array).

You may want to check what comes back from
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
e.g. comm_sz==0 would cause this issue.

You are not checking the return value from malloc. Considering that you are attempting to allocate over three million longs, it is quite plausible that malloc would fail.
This might not be what is causing your problem though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js