Segmentation fault of an MPI program - c++

I am writing a program with c++ that uses MPI. The simplified version of my code is
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <mpi.h>
#define RNumber 3000000 //Number of loops to go
using namespace std;
class LObject {
/*Something here*/
public:
void FillArray(long * RawT){
/*Does something*/
for (int i = 0; i < RNumber; i++){
RawT[i] = i;
}
}
};
int main() {
int my_rank;
int comm_sz;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
LObject System;
long rawT[RNumber];
long * Times = NULL;
if (my_rank == 0) Times = (long*) malloc(comm_sz*RNumber*sizeof(long));
System.FillArray(rawT);
if (my_rank == 0) {
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
}
else {
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
};
The program compiles fine, but gives a Segmentation fault error on execution. The message is
=================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
When I reduce the RNumber the program works fine. Maybe somebody could explain what precisely goes wrong? Am I trying to allocate too much space for an array? If that's the case, will this problem be solved by storing the results in a file instead of an array?
If it is possible, could you please give broad comments on the things I do wrong.
Thank you for you time and effort!

A couple of possible issues:
long rawT[RNumber];
That's rather a large array to be putting on the stack. There is usually a limit to stack size (especially in a multithreaded program), and a typical size is one or two megabytes. You'd be better off with a std::vector<long> here.
Times = (long*) malloc(comm_sz*RNumber*sizeof(long));
You should check that the memory allocation succeeded. Or better still, use std::vector<long> here as well (which will also fix your memory leak).
if (my_rank == 0) {
// do stuff
} else {
// do exactly the same stuff
}
I'm guessing the else block should do something different; in particular, something that doesn't involve Times, since that is null unless my_rank == 0.
UPDATE: to use a vector instead of a raw array, just initialise it with the size you want, and then use a pointer to the first element where you would use a (pointer to) the array:
std::vector<long> rawT(RNumber);
System.FillArray(&rawT[0]);
std::vector<long> Times(comm_sz*RNumber);
MPI_Gather(&rawT[0], RNumber, MPI_LONG, &Times[0], RNumber,
MPI_LONG, 0, MPI_COMM_WORLD);
Beware that the pointer will be invalidated if you resize the vector (although you won't need to do that if you're simply using it as a replacement for an array).

You may want to check what comes back from
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
e.g. comm_sz==0 would cause this issue.

You are not checking the return value from malloc. Considering that you are attempting to allocate over three million longs, it is quite plausible that malloc would fail.
This might not be what is causing your problem though.

Related

Time efficient design model for sending to and receiving from all mpi processes: MPI all 2 all communication

I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.

MPI - How to create partial arrays for workers when array initialization value must be constant?

I don't have much experience with C++ or MPI currently, so I assume this will be an easy question to answer.
I want to be able to change the number of processes that can work on my array sort for experimentation purposes, but when I try to declare a partial array for my worker to work on, I receive an error stating that the array size variable, PART, needs to be constant.
Is this from how I calculated or parsed it, or from an MPI mechanic?
const int arraySize = 10000
int main(int argc, char ** argv)
{
MPI_Init(&argc, &argv);
int rank;
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
const int PART = floor(arraySize / size);
auto start = std::chrono::high_resolution_clock::now(); //start timer
//================================ WORKER PROCESSES ===============================
if (rank != 0)
{
int tmpArray[PART]; //HERE IS MY PROBLEM
MPI_Recv(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); //recieve data into local initalized array
qsort(&tmpArray[0], PART, sizeof(int), compare); // quick sort
MPI_Send(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD); //send sorted array back to rank 0
}
auto tmpArray = std::make_unique<int[]>(PART);
If the size of an array is determined at runtime, as in your case, this would give a variable length array, which is supported in C, but not in standard C++.
So in C++, the size of an array needs to be a (compile time) constant.
To overcome this, you'll have to use dynamic memory allocation. This can be achieved either through "classic C" functions malloc and free (which are rarely used in C++), through their C++-pendants new and delete (or new[] and delete[]), or - the preferred way - through the use of container objects like, for example, std::vector<int> that encapsulate this memory allocation issues for you.

MPI_Scatter segmentation fault dependent on node number

I get a strange behavior when running a test code for MPI_Scatter. The program seems to work fine, but it returns a segmentation fault if the number of nodes is larger than 4. I compile with mpicxx and run with mpirun -n N ./a.o.
#include <mpi.h>
#include <vector>
#include <stdio.h>
using std::vector;
int main(void){
MPI_Init(NULL,NULL);
int num_PE;
MPI_Comm_size(MPI_COMM_WORLD, &num_PE);
int my_PE;
MPI_Comm_rank(MPI_COMM_WORLD, &my_PE);
int data_per_PE=2;
int remainder=0; //conceptually should be less than data_per_PE but shouldn't matter from code perspective
vector<int> elem_count(num_PE,data_per_PE); //number of elements to scatter
elem_count[num_PE-1]=data_per_PE+remainder; //let last PE take extra load
vector<int> start_send(num_PE); //the offset to send from main buffer
vector<double> small_vec(data_per_PE+remainder); //small place to store values
vector<double> bigVec; //the big list to distribute to processes
if (my_PE==0){
bigVec.reserve(data_per_PE*num_PE+remainder); //make room
for(int i=0; i<data_per_PE*num_PE+remainder; i++){
bigVec.push_back(static_cast<double>(i)+1.0); //1,2,3...
start_send[i]=i*data_per_PE; //the stride
}
}
// MPI_Scatterv(&bigVec[0],&elem_count[0],&start_send[0],MPI_DOUBLE,&small_vec[0],data_per_PE+remainder,MPI_DOUBLE,0,MPI_COMM_WORLD);
MPI_Scatter(&bigVec[0],data_per_PE,MPI_DOUBLE,&small_vec[0],data_per_PE,MPI_DOUBLE,0,MPI_COMM_WORLD); //scatter
if (my_PE==0){
printf("Proc \t elems \n");
}
MPI_Barrier(MPI_COMM_WORLD); //let everything catch up before printing
for (int i=0;i<data_per_PE+remainder;i++){
printf("%d \t %f \n", my_PE, small_vec[i]); //print the values scattered to each processor
}
MPI_Barrier(MPI_COMM_WORLD); //don't think this is necessary but won't hurt
MPI_Finalize(); //finish
return 0;
}
The issue has nothing to do with the scatter, but rather this line:
start_send[i]=i*data_per_PE;
Since i can go beyond num_PE, you write outside of the bounds of start_send - overwriting some memory that probably belongs to small_vec.
This could have easily been found by creating a truly minimal example.
You have another issue in your code: &bigVec[0] is a problem for my_PE!=0. While the parameter to MPI_Scatter is ignored by non-root ranks, the statement involves dereferencing in std::vector::operator[] the first element. As the vector is empty, this is undefined behavior on it's own. Here is an explanation as to why that can create subtle problems. Use bigVec.data() instead.
You are writing past the end of start_send's internal storage, thus corrupting the heap and any other objects contained in it:
if (my_PE==0){
bigVec.reserve(data_per_PE*num_PE+remainder); //make room
for(int i=0; i<data_per_PE*num_PE+remainder; i++){
bigVec.push_back(static_cast<double>(i)+1.0); //1,2,3...
start_send[i]=i*data_per_PE; //the stride <--- HERE
}
}
i runs until data_per_PE*num_PE+remainder - 1, but start_send has storage for num_PE elements only. Writing past the end corrupts the linked list of heap objects and the program likely segfaults when a destructor tries to free a corrupted heap block or when some other heap object is accessed.

Share memory across MPI nodes to prevent unecessary copying

I have an algorithm where in each iteration each node has to calculate a segment of an array, where each element of x_ depends on all the elements of x.
x_[i] = some_func(x) // each x_[i] depends on the entire x
That is, each iteration takes x and calculates x_, which will be the new x for the next iteration.
A way of paralelizing this is MPI would be to split x_ between the nodes and have an Allgather call after the calculation of x_ so that each processor would send its x_ to the appropriate location in x in all the other processors, then repeat. This is very inefficient since it requires an expensive Allgather call every iteration, not to mention it requires as many copies of x as there are nodes.
I've thought of an alternative way that doesn't require copying. If the program is running on a single machine, with shared RAM, would it be possible to just share the x_ between the nodes (without copying)? That is, after calculating x_ each processor would make it visible to the other nodes, which could then use it as their x for the next iteration without needing to make several copies. I can design the algorithm so that no processor accesses the same x_ at the same time, which is why making a private copy for each node is overkill.
I guess what I'm asking is: can I share memory in MPI simply by tagging an array as shared-between-nodes, as opposed to manually making a copy for each node? (for simplicity assume I'm running on one CPU)
You can share memory within a node using MPI_Win_allocate_shared from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).
MPI functions
The following are taken from the MPI 3.1 standard.
Allocating shared memory
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)
(if you want the Fortran declaration, click the link)
You deallocate memory using MPI_Win_free. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.
Querying the node allocations
In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)
int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)
(if you want the Fortran declaration, click the link above)
Synchronizing shared memory
MPI_WIN_SYNC(win)
IN win window object (handle)
int MPI_Win_sync(MPI_Win win)
This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.
You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize) to attain a consistent view of data.
Synchronization
If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier and MPI_Send+MPI_Recv will work for the latter. Or you can use MPI-3 atomics to build counters and locks.
Example program
The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.
This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.
#include <stdio.h>
#include <mpi.h>
/* This function synchronizes process rank i with process rank j
* in such a way that this function returns on process rank j
* only after it has been called on process rank i.
*
* No additional semantic guarantees are provided.
*
* The process ranks are with respect to the input communicator (comm). */
int p2p_xsync(int i, int j, MPI_Comm comm)
{
/* Avoid deadlock. */
if (i==j) {
return MPI_SUCCESS;
}
int rank;
MPI_Comm_rank(comm, &rank);
int tag = 666; /* The number of the beast. */
if (rank==i) {
MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
} else if (rank==j) {
MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
/* If val is the same at all MPI processes in comm,
* this function returns 1, else 0. */
int coll_check_equal(int val, MPI_Comm comm)
{
int minmax[2] = {-val,val};
MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
return ((-minmax[0])==minmax[1] ? 1 : 0);
}
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int * shptr = NULL;
MPI_Win shwin;
MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD,
&shptr, &shwin);
/* l=local r=remote */
MPI_Aint rsize = 0;
int rdisp;
int * rptr = NULL;
int lint = -999;
MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
if (rptr==NULL || rsize!=sizeof(int)) {
printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
MPI_Abort(MPI_COMM_WORLD, 1);
}
/*******************************************************/
MPI_Win_lock_all(0 /* assertion */, shwin);
if (rank==0) {
*shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
MPI_Win_sync(shwin);
}
for (int j=1; j<size; j++) {
p2p_xsync(0, j, MPI_COMM_WORLD);
}
if (rank!=0) {
MPI_Win_sync(shwin);
}
lint = *rptr;
MPI_Win_unlock_all(shwin);
/*******************************************************/
if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
if (rank==0) {
printf("SUCCESS!\n");
}
} else {
printf("rank %d: lint = %d \n", rank, lint);
}
MPI_Win_free(&shwin);
MPI_Finalize();
return 0;
}

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}