While profiling my CUDA application with NVIDIA Visual Profiler I noticed that any operation after cudaStreamSynchronize blocks until all streams are finished. This is very odd behavior because if cudaStreamSynchronize returns that means that the stream is finished, right? Here is my pseudo-code:
std::list<std::thread> waitingThreads;
void startKernelsAsync() {
for (int i = 0; i < 200; ++i) {
cudaHostAlloc(cpuPinnedMemory, size, cudaHostAllocDefault);
memcpy(cpuPinnedMemory, data, size);
cudaMalloc(gpuMemory);
cudaStreamCreate(&stream);
cudaMemcpyAsync(gpuMemory, cpuPinnedMemory, size, cudaMemcpyHostToDevice, stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(cpuPinnedMemory, gpuMemory, size, cudaMemcpyDeviceToHost, stream);
waitingThreads.push_back(std::move(std::thread(waitForFinish, cpuPinnedMemory, stream)));
}
while (waitingThreads.size() > 0) {
waitingThreads.front().join();
waitingThreads.pop_front();
}
}
void waitForFinish(void* cpuPinnedMemory, cudaStream_t stream, ...) {
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream); // <== This blocks until all streams are finished.
memcpy(data, cpuPinnedMemory, size);
cudaFreeHost(cpuPinnedMemory);
cudaFree(gpuMemory);
}
If I put cudaFreeHost before cudaStreamDestroy then it becomes the blocking operation.
Is there anything conceptually wrong here?
EDIT: I found another weird behavior, sometimes it un-blocks in the middle of processing of streams and then processes the rest of streams.
Normal behavior:
Strange behavior (happens quite often):
EDIT2: I am testing on Tesla K40c card with compute capability 3.5 on CUDA 6.0.
As suggested in comments, it may be viable to reduce number of streams however in my application the memory transfers are quite fast and I want to use streams mainly to dynamically schedule work to GPU. The problem is that after stream finishes I need to download data from pinned memory and clear allocated memory for further streams which seems to be blocking operation.
I am using one stream per data-set because every data-set has different size and processing takes unpredictably long time.
Any ideas how to solve this?
I haven't found why the operations are blocking but I concluded that I can not do anything about it so I decided ti implement memory and streams pooling (as suggested in comments) to re-use GPU memory, pinned CPU memory and streams to avoid any kind of deletion.
In case anybody would be interested here is my solution. The start kernel behaves as asynchronous operation that schedules kernel and callback is called after the kernel is finished.
std::vector<Instance*> m_idleInstances;
std::vector<Instance*> m_workingInstances;
void startKernelAsync(...) {
// Search for finished stream.
while (m_idleInstances.size() == 0) {
findFinishedInstance();
if (m_idleInstances.size() == 0) {
std::chrono::milliseconds dur(10);
std::this_thread::sleep_for(dur);
}
}
Instance* instance = m_idleInstances.back();
m_idleInstances.pop_back();
// Fill CPU pinned memory
cudaMemcpyAsync(..., stream);
runKernel<<<32, 32, 0, stream>>>(gpuMemory);
cudaMemcpyAsync(..., stream);
m_workingInstances.push_back(clusteringInstance);
}
void findFinishedInstance() {
for (auto it = m_workingInstances.begin(); it != m_workingInstances.end();) {
Instance* inst = *it;
cudaError_t status = cudaStreamQuery(inst->stream);
if (status == cudaSuccess) {
it = m_workingInstances.erase(it);
m_callback(instance->clusterGroup);
m_idleInstances.push_back(inst);
}
else {
++it;
}
}
}
And at the and just wait for everybody to finish:
virtual void waitForFinish() {
while (m_workingInstances.size() > 0) {
Instance* instance = m_workingInstances.back();
m_workingInstances.pop_back();
m_idleInstances.push_back(instance);
cudaStreamSynchronize(instance->stream);
finalizeInstance(instance);
}
}
And here is a graph form profiler, works as a charm!
Check out the list of "Implicit Synchronization" rules in the Cuda C Programming Guide PDF that comes with the toolkit. (Section 3.2.5.5.4 in my copy, but you might have a different version.)
If your GPU is "compute capability 3.0 or lower", there are some special rules that apply. My guess would be that cudaStreamDestroy() is hitting one of those limitations.
Related
I am running a thread in the Windows kernel communicating with an application over shared memory. Everything is working fine except the communication is slow due to a Sleep loop. I have been investigating spin locks, mutexes and interlocked but can't really figure this one out. I have also considered Windows events but don't know about the performance of that one. Please advice on what would be a faster solution keeping the communication over shared memory possibly suggesting Windows events.
KERNEL CODE
typedef struct _SHARED_MEMORY
{
BOOLEAN mutex;
CHAR data[BUFFER_SIZE];
} SHARED_MEMORY, *PSHARED_MEMORY;
ZwCreateSection(...)
ZwMapViewOfSection(...)
while (TRUE) {
if (((PSHARED_MEMORY)SharedSection)->mutex == TRUE) {
//... do work...
((PSHARED_MEMORY)SharedSection)->mutex = FALSE;
}
KeDelayExecutionThread(KernelMode, FALSE, &PollingInterval);
}
APPLICATION CODE
OpenFileMapping(...)
MapViewOfFile(...)
...
RtlCopyMemory(&SM->data, WriteData, Size);
SM->mutex = TRUE;
while (SM->mutex != FALSE) {
Sleep(1); // Slow and removing it will cause an infinite loop
}
RtlCopyMemory(ReadData, &SM->data, Size);
UPDATE 1
Currently this is the fastest solution I have come up with:
while(InterlockedCompareExchange(&SM->mutex, FALSE, FALSE));
However I find it funny that you need to do an exchange and that there is no function for only compare.
You don't want to use InterlockedCompareExchange. It burns the CPU, saturates core resources that might be needed by another thread sharing that physical core, and can saturate inter-core buses.
You do need to do two things:
1) Write an InterlockedGet function and use it.
2) Prevent the loop from burning CPU resources and from taking the mother of all mispredicted branches when it finally gets unblocked.
For 1, this is known to work on all compilers that support InterlockedCompareExchange, at least last time I checked:
__inline static int InterlockedGet(int *val)
{
return *((volatile int *)val);
}
For 2, put this as the body of the wait loop:
__asm
{
rep nop
}
For x86 CPUs, this is specified to solve the resource saturation and branch prediction problems.
Putting it together:
while ((*(volatile int *) &SM->mutex) != FALSE) {
__asm
{
rep nop
}
}
Change int as needed if it's not appropriate.
Is there a way to signal (success/failure) to the host at the end of kernel execution?
I am looking at an iterative process where calculations are made in device and after each iteration, a boolean variable is passed to host that tells if the process has converged. Based on the variable, host decides to either stop iterating or go through another round of iteration.
Copying a single boolean variable at the end of every iteration nullifies the time gain obtained through parallelization. Hence, I would like to find a way to let the host know of the convergence status (success/failure) without having to CudaMemCpy every time.
Note: The time issue exists after using pinned memory to transfer data.
Alternatives that I have looked at.
asm("trap;"); & assert();
These will trigger respectively Unknown error and cudaErrorAssert in host. Unfortunately, they are "sticky" in that the error cannot be reset using CudaGetLastError. The only way is to reset device using cudaDeviceReset().
using CudaHostAllocMapped to avoid CudaMemCpy This is of no use as it does not offer any time based advantage over standard pinned memory allocation + CudaMemCpy. (Pg 460, MultiCore and GPU Programming, An Integrated Approach, Morgran Kruffmann 2014).
Will appreciate other ways to overcome this issue.
I suspect the real issue here is that your iteration kernel run time is very short (on the order of 100us or less), meaning the work per iteration is very small. The best solution might be to try to increase the work per iteration (refactor your code/algorithm, tackle a larger problem, etc.)
However, here are some possibilities:
Use mapped/pinned memory. Your claim in item 2 of your question is unsupported, IMO, without a lot more context than a page reference to a book that many of us probably don't have available to look at.
Use dynamic parallelism. Move your kernel launch process to a CUDA parent kernel that is issuing child kernels. Whatever boolean is set by the child kernel will be immediately discoverable in the parent kernel, without any need for a cudaMemcpy operation or mapped/pinned memory.
Use a pipelined algorithm, and overlap a speculative kernel launch with the device->host copy of the boolean, for each pipeline stage.
I consider the first two items above fairly obvious, so I'll provide a worked example for item 3. The basic idea is that we will ping-pong between two streams, launching the kernel alternately into one stream then the other. We will have a 3rd stream so that we can overlap the device->host copy operations with the execution of the next launch. Due to the overlap of D->H copy with kernel execution, there is effectively no "cost" for the copy operation, it is hidden by kernel execution work.
Here's a fully worked example, plus a nvvp timeline:
$ cat t267.cu
#include <stdio.h>
const int stop_count = 5;
const long long tdelay = 1000000LL;
__global__ void test_kernel(int *icounter, bool *istop, int *ocounter, bool *ostop){
if (*istop) return;
long long start = clock64();
while (clock64() < tdelay+start);
int my_count = *icounter;
my_count++;
if (my_count >= stop_count) *ostop = true;
*ocounter = my_count;
}
int main(){
volatile bool *v_stop;
volatile int *v_counter;
bool *h_stop, *d_stop1, *d_stop2, *d_s1, *d_s2, *d_ss;
int *h_counter, *d_counter1, *d_counter2, *d_c1, *d_c2, *d_cs;
cudaStream_t s1, s2, s3, *sp1, *sp2, *sps;
cudaEvent_t e1, e2, *ep1, *ep2, *eps;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaEventCreate(&e1);
cudaEventCreate(&e2);
cudaMalloc(&d_counter1, sizeof(int));
cudaMalloc(&d_stop1, sizeof(bool));
cudaMalloc(&d_counter2, sizeof(int));
cudaMalloc(&d_stop2, sizeof(bool));
cudaHostAlloc(&h_stop, sizeof(bool), cudaHostAllocDefault);
cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocDefault);
v_stop = h_stop;
v_counter = h_counter;
int n_counter = 1;
h_stop[0] = false;
h_counter[0] = 0;
cudaMemcpy(d_stop1, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_stop2, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter1, h_counter, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter2, h_counter, sizeof(int), cudaMemcpyHostToDevice);
sp1 = &s1;
sp2 = &s2;
ep1 = &e1;
ep2 = &e2;
d_c1 = d_counter1;
d_c2 = d_counter2;
d_s1 = d_stop1;
d_s2 = d_stop2;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
while (v_stop[0] == false){
cudaStreamWaitEvent(*sp2, *ep1, 0);
sps = sp1; // ping-pong
sp1 = sp2;
sp2 = sps;
eps = ep1;
ep1 = ep2;
ep2 = eps;
d_cs = d_c1;
d_c1 = d_c2;
d_c2 = d_cs;
d_ss = d_s1;
d_s1 = d_s2;
d_s2 = d_ss;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
while (n_counter > v_counter[0]);
n_counter++;
if(v_stop[0] == false){
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
}
}
cudaDeviceSynchronize(); // optional
printf("terminated at counter = %d\n", v_counter[0]);
}
$ nvcc -arch=sm_52 -o t267 t267.cu
$ ./t267
terminated at counter = 5
$
In the above diagram, we see that 5 kernel launches are evident (actually 6) andy they are bouncing back and forth between two streams. (The 6th kernel launch, which we would expect from the code organization and pipelining, is a very short line at the end of stream15 above. This kernel launches but immediately witness that stop is true, so it exits.) The device -> host copies are in a 3rd stream. If we zoom in closely at the handoff from one kernel iteration to the next:
we see that even these very short D->H memcpy operations are essentially overlapped with the next kernel execution. For reference, the gap between kernel executions above is about 5us.
Note that this was entirely done on linux. If you attempt this on windows WDDM, it may be difficult to achieve anything similar, due to WDDM command batching. Windows TCC should approximately duplicate linux behavior, however.
I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:
loop
CUDA kernel making 10s of Monte Carlo steps
Copy system state back to CPU
Calculate statistics
This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.
Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.
How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?
Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).
Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.
I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.
The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.
On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.
I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.
The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.
Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.
Here's a fully worked example:
$ cat t942.cu
#include <stdio.h>
#define ITERS 1000
#define DSIZE 65536
#define nTPB 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;
__device__ void my_compute_function(int *buf, int idx, int data){
buf[idx] = data; // put your work code here
}
__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready, const int buffersize, const int iterations){
// assumption of persistent block-limited kernel launch
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int iter_count = 0;
while (iter_count < iterations ){ // persistent until iterations complete
int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
int my_idx = idx;
while (iter_count - itercnt > 1); // don't overrun buffers on device
while (*bufrdy == 2); // wait for buffer to be consumed
while (my_idx < buffersize){ // perform the "work"
my_compute_function(buf, my_idx, iter_count);
my_idx += gridDim.x*blockDim.x; // grid-striding loop
}
__syncthreads(); // wait for my block to finish
__threadfence(); // make sure global buffer writes are "visible"
if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
if (!idx){ // am I the master block/thread?
while (*blkcnt < gridDim.x); // wait for all blocks to finish
*blkcnt = 0;
*bufrdy = 2; // indicate that buffer is ready
__threadfence_system(); // push it out to mapped memory
itercnt++;
}
iter_count++;
}
}
int validate(const int *data, const int dsize, const int val){
for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
return 1;
}
int main(){
int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
volatile int *m_bufrdy1, *m_bufrdy2;
// buffer and "mailbox" setup
cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
cudaCheckErrors("cudaHostAlloc fail");
cudaMalloc(&d_buf1, DSIZE*sizeof(int));
cudaMalloc(&d_buf2, DSIZE*sizeof(int));
cudaCheckErrors("cudaMalloc fail");
cudaStream_t streamk, streamc;
cudaStreamCreate(&streamk);
cudaStreamCreate(&streamc);
cudaCheckErrors("cudaStreamCreate fail");
*m_bufrdy1 = 0;
*m_bufrdy2 = 0;
cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
cudaCheckErrors("cudaMemset fail");
// inefficient crutch for choosing number of blocks
int nblock = 0;
cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
cudaCheckErrors("get multiprocessor count fail");
testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
cudaCheckErrors("kernel launch fail");
volatile int *bufrdy;
int *hbuf, *dbuf;
for (int i = 0; i < ITERS; i++){
if (i & 1){ // ping pong on the host side
bufrdy = m_bufrdy2;
hbuf = h_buf2;
dbuf = d_buf2;}
else {
bufrdy = m_bufrdy1;
hbuf = h_buf1;
dbuf = d_buf1;}
// int qq = 0; // add for failsafe - otherwise a machine failure can hang
while ((*bufrdy)!= 2); // use this for a failsafe: if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
cudaStreamSynchronize(streamc);
cudaCheckErrors("cudaMemcpyAsync fail");
*bufrdy = 0; // release buffer back to device
if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
}
printf("Completed %d iterations successfully\n", ITERS);
}
$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$
I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.
Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.
This isn't a direct answer to your question but it may be of help.
I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why
Launch kernel
Copy data
Loop
Launch kernel
CPU work
Copy data
CPU work
This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.
I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.
I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}
I'm working on a program consisting of two concurrent threads. One (here "Clock") is performing some computation on a regular basis (10 Hz) and is quite memory-intensive. The other one (here "hugeList") uses even more RAM but is not as time critical as the first one. So I decided to reduce its priority to THREAD_PRIORITY_LOWEST. Yet, when the thread frees most of the memory it has used the critical one doesn't manage to keep its timing.
I was able to condense down the problem to this bit of code (make sure optimizations are turned off!):
while Clock tries to keep a 10Hz-timing the hugeList-thread allocates and frees more and more memory not organized in any sort of chunks.
#include "stdafx.h"
#include <stdio.h>
#include <forward_list>
#include <time.h>
#include <windows.h>
#include <vector>
void wait_ms(double _ms)
{
clock_t endwait;
endwait = clock () + _ms * CLOCKS_PER_SEC/1000;
while (clock () < endwait) {} // active wait
}
void hugeList(void)
{
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_LOWEST);
unsigned int loglimit = 3;
unsigned int limit = 1000;
while(true)
{
for(signed int cnt=loglimit; cnt>0; cnt--)
{
printf(" Countdown %d...\n", cnt);
wait_ms(1000.0);
}
printf(" Filling list...\n");
std::forward_list<double> list;
for(unsigned int cnt=0; cnt<limit; cnt++)
list.push_front(42.0);
loglimit++;
limit *= 10;
printf(" Clearing list...\n");
while(!list.empty())
list.pop_front();
}
}
void Clock()
{
clock_t start = clock()-CLOCKS_PER_SEC*100/1000;
while(true)
{
std::vector<double> dummyData(100000, 42.0); // just get some memory
printf("delta: %d ms\n", (clock()-start)*1000/CLOCKS_PER_SEC);
start = clock();
wait_ms(100.0);
}
}
int main()
{
DWORD dwThreadId;
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&Clock, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&hugeList, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
while(true) {;}
return 0;
}
First of all I noticed that allocating memory for the linked list is way faster than freeing it.
On my machine (Windows7) at around the 4th iteration of the "hugeList"-method the Clock-Thread gets significantly disturbed (up to 200ms). The effect disappears without the dummyData-vector "asking" for some memory in the Clock-Thread.
So,
Is there any way of increasing the priority of memory allocation for the Clock-Thread in Win7?
Or do I have to split both operations onto two contexts (processes)?
Note that my original code uses some communication via shared variables which would require for some kind of IPC if I chose the second option.
Note that my original code gets stuck for about 1sec when the equivalent to the "hugeList"-method clears a boost::unordered_map and enters ntdll.dll!RtIInitializeCriticalSection many many times.
(observed by systinernals process explorer)
Note that the effects observed are not due to swapping, I'm using 1.4GB of my 16GB (64bit win7).
edit:
just wanted to let you know that up to now I haven't been able to solve my issue. Splitting both parts of the code onto two processes does not seem to be an option since my time is rather limited and I've never worked with processes so far. I'm afraid I won't be able to get to a running version in time.
However, I managed to reduce the effects by reducing the number of memory deallocations made by the non-critical thread. This was achieved by using a fast pooling memory allocator (like the one provided in the boost library).
There does not seem to be the possibility of explicitly creating certain objects (like e.g. the huge forward list in my example) on some sort of threadprivate heap that would not require synchronisation.
For further reading:
http://bmagic.sourceforge.net/memalloc.html
Do threads have a distinct heap?
Memory Allocation/Deallocation Bottleneck?
http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads
http://www.boost.org/doc/libs/1_55_0/libs/pool/doc/html/boost_pool/pool/introduction.html
Replacing std::forward_list with a std::list, I ran your code on a corei7 4GB machine until 2GB is consumed. No disturbances at all. (In debug build)
P.S
Yes. The release build recreates the issue. I replaced the forward list with an array
double* p = new double[limit];
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = 42.0;
and
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = -1;
delete [] p;
It does not recreates then.
It seems thread scheduler is punishing for asking for lot of small memory chunks.