I created a C++ DLL function that uses several arrays to process what is eventually image data. I'm attempting to pass these arrays by reference, do the computation, and pass the output back by reference in a pre-allocated array. Within the function I use the Intel Performance Primitives including ippsMalloc and ippsFree:
Process.dll
int __stdcall ProcessImage(const float *Ref, const float *Source, float *Dest, const float *x, const float *xi, const int row, const int col, const int DFTlen, const int IMGlen)
{
int k, l;
IppStatus status;
IppsDFTSpec_R_32f *spec;
Ipp32f *y = ippsMalloc_32f(row),
*yi = ippsMalloc_32f(DFTlen),
*X = ippsMalloc_32f(DFTlen),
*R = ippsMalloc_32f(DFTlen);
for (int i = 0; i < col; i++)
{
for (int j = 0; j < row; j++)
y[j] = Source[j + (row * i)];
status = ippsSub_32f_I(Ref, y, row);
// Some interpolation calculations calculations here
status = ippsDFTInitAlloc_R_32f(&spec, DFTlen, IPP_FFT_DIV_INV_BY_N, ippAlgHintNone);
status = ippsDFTFwd_RToCCS_32f(yi, X, spec, NULL);
status = ippsMagnitude_32fc( (Ipp32fc*)X, R, DFTlen);
for (int m = 0; m < IMGlen; m++)
Dest[m + (IMGlen * i)] = 10 * log10(R[m]);
}
_CrtDumpMemoryLeaks();
ippsDFTFree_R_32f(spec);
ippsFree(y);
ippsFree(yi);
ippsFree(X);
ippsFree(R);
return(status);
}
The function call looks like this:
for (int i = 0; i < Frames; i++)
ProcessFrame(&ref[i * FrameSize], &source[i * FrameSize], &dest[i * FrameSize], mX, mXi, NumPixels, Alines, DFTLength, IMGLength);
The function does not fail and produces the desired output for up to 6 images, more than that and it dies with:
First-chance exception at 0x022930e0 in DLL_test.exe: 0xC0000005: Access violation reading location 0x1cdda000.
I've attempted to debug the program, unfortunately VS reports that the call stack location is in an IPP DLL with "No Source Available". It consistently fails when calling ippMagnitude32fc( (Ipp32fc*)X, R, DFTlen)
Which leads me to my questions: Is this a memory leak? If so, can anybody see where the leak is located? If not, can somebody suggest how to go about debugging this problem?
To answer your first question, no that's not a memory leak, that's a memory corruption.
A memory leak is when you don't free the memory used, and so , the memory usage is growing up. That doesn't make the program to not work, but only end up using too much memory, which results in the computer being really slow (swaping) and ultimately any program crashing with a 'Not enough memory error'.
What you have is basic pointer error, as it happend all the time in C++.
Explain how to debug is hard, I suggest you add a breakpoint just before in crash, and try to see what's wrong.
Related
I have always been confused and never understood how the alloc map-type of the map clause of the target (or target data) construct works.
What is my application - I would like to have a temporary array on a device, which is used only on the device, is initialized on the device, read on the device, everything on the device. The host does not touch the contents of the array at all. For the sake of simplicity, I have the following code, which copies an array to another array via a temporary array (using just a single team and thread, but that does not matter):
#include <cstdio>
int main()
{
const int count = 10;
int * src = new int[count];
int * tmp = new int[count];
int * dst = new int[count];
for(int i = 0; i < count; i++) src[i] = i;
for(int i = 0; i < count; i++) printf(" %3d", src[i]); printf("\n");
#pragma omp target map(to:src[0:count]) map(from:dst[0:count]) map(alloc:tmp[0:count])
{
for(int i = 0; i < count; i++) tmp[i] = src[i];
for(int i = 0; i < count; i++) dst[i] = tmp[i];
}
for(int i = 0; i < count; i++) printf(" %3d", dst[i]); printf("\n");
delete[] src;
delete[] tmp;
delete[] dst;
return 0;
}
This code works when using pgc++ -mp=gpu on Nvidia and on Intel gpu using icpx -fiopenmp -fopenmp-targets=spir64.
But the thing is, I don't want to allocate the tmp array on the host. If I just use int * tmp = nullptr, on nvidia the code fails (on intel it still works). If I leave the tmp uninitialized (using just int * tmp;, and removing the delete), the execution fails on Intel too. If I do not even declare the tmp variable, compilation fails (which kinda makes sense). I made sure it runs on the device (really offloads the code, doesn't fallback to cpu) using OMP_TARGET_OFFLOAD=MANDATORY.
This was weird to me, since I don't use the tmp array on the host at all. As I understand it, the tmp array is allocated on the device and then in the kernel the device array is used. Is that right? Why do I have to allocate and/or initialize the pointer on the host if I don't use it on the host?
So my question is: what are the exact requirements to use map(alloc) in OpenMP offloading? How does it work? How should I use it? I would appreciate an example and references from tutorials/documentation.
I wasn't able to find any useful information regarding this. The standard was not helpful at all, and the tutorials I attended and watched did not go into such depth.
I understand that the code should work even without OpenMP enabled (as if the pragmas were just ignored), so let's assume there is an #ifdef to actually allocate the tmp array if OpenMP is disabled.
I am also aware of manual memory management via omp_target_alloc(), omp_target_memcpy() and omp_target_free(), but I wanted to use the target map(alloc).
I am reading the standard 5.2, using pgc++ 22.2-0 and icpx 2022.0.0.20211123.
I'm experiencing strange memory access performance problem, any ideas?
int* pixel_ptr = somewhereFromHeap;
int local_ptr[307200]; //local
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = someCalculatedVal ;
}
//this is very slow
for(int i=0;i<307200;i++){
pixel_ptr[i] = 1 ; //constant
}
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = someCalculatedVal ;
}
Tried consolidating values to local scanline
int scanline[640]; // local
//this is very slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = val ;
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[i];
pixel_ptr[screen_pos] = 1 ; //constant
}
//this is fast
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = i; //or a constant
pixel_ptr[screen_pos] = val ;
}
//this is slow
for(int i=xMin;i<xMax;i++){
int screen_pos = sy*screen_width+i;
int val = scanline[0];
pixel_ptr[screen_pos] = val ;
}
Any ideas? I'm using mingw with cflags -01 -std=c++11 -fpermissive.
update4:
I have to say that these are snippets from my program and there are heavy code/functions running before and after. The scanline block did ran at the end of function before exit.
Now with proper test program. thks to #Iwillnotexist.
#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define SIZE 307200
#define SAMPLES 1000
double local_test(){
int local_array[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
local_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
double heap_test(){
int* heap_array=new int[SIZE];
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
delete[] heap_array;
return cpu_time_used;
}
double heap_test2(){
static int* heap_array = NULL;
if(heap_array==NULL){
heap_array = new int[SIZE];
}
timeval start, end;
long cpu_time_used_sec,cpu_time_used_usec;
double cpu_time_used;
gettimeofday(&start, NULL);
for(int i=0;i<SIZE;i++){
heap_array[i] = i;
}
gettimeofday(&end, NULL);
cpu_time_used_sec = end.tv_sec- start.tv_sec;
cpu_time_used_usec = end.tv_usec- start.tv_usec;
cpu_time_used = cpu_time_used_sec*1000 + cpu_time_used_usec/1000.0;
return cpu_time_used;
}
int main (int argc, char** argv){
double cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=local_test();
printf("local: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test();
printf("heap_: %f ms\n",cpu_time_used);
cpu_time_used = 0;
for(int i=0;i<SAMPLES;i++)
cpu_time_used+=heap_test2();
printf("heap2: %f ms\n",cpu_time_used);
}
Complied with no optimization.
local: 577.201000 ms
heap_: 826.802000 ms
heap2: 686.401000 ms
The first heap test with new and delete is 2x slower. (paging as suggested?)
The second heap with reused heap array is still 1.2x slower.
But I guess the second test is not that practical as there tend to other codes running before and after at least for my case. For my case, my pixel_ptr of course only allocated once during
prograim initialization.
But if anyone has solutions/idea to speeding things up please reply!
I'm still perplexed why heap write is so much slower than stack segment.
Surely there must be some tricks to make the heap more cpu/cache flavourable.
Final update?:
I revisited, the disassemblies again and this time, suddenly I have an idea why some of my breakpoints
don't activate. The program looks suspiciously shorter thus I suspect the complier might
have removed the redundant dummy code I put in which explains why the local array is magically many times faster.
I was a bit curious so I did the test, and indeed I could measure a difference between stack and heap access.
The first guess would be that the generated assembly is different, but after taking a look, it is actually identical for heap and stack (which makes sense, memory shouldn't be discriminated).
If the assembly is the same, then the difference must come from the paging mechanism. The guess is that on the stack, the pages are already allocated, but on the heap, first access cause a page fault and page allocation (invisible, it all happens at kernel level). To verify this, I did the same test, but first I would access the heap once before measuring. The test gave identical times for stack and heap. To be sure, I also did a test in which I first accessed the heap, but only every 4096 bytes (every 1024 int), then 8192, because a page is usually 4096 bytes long. The result is that accessing only every 4096 bytes also gives the same time for heap and stack, but accessing every 8192 gives a difference, but not as much as with no previous access at all. This is because only half of the pages were accessed and allocated beforehand.
So the answer is that on the stack, memory pages are already allocated, but on the heap, pages are allocated on-the-fly. This depends on the OS paging policy, but all major PC OSes probably have a similar one.
For all the tests I used Windows, with MS compiler targeting x64.
EDIT: For the test, I measured a single, larger loop, so there was only one access at each memory location. deleteing the array and measuring the same loop multiple time should give similar times for stack and heap, because deleteing memory probably don't de-allocate the pages, and they are already allocated for the next loop (if the next new allocated on the same space).
The following two code examples should not differ in runtime with a good compiler setting. Probably your compiler will generate the same code:
//this is fast
for(int i=0;i<307200;i++){
int val = pixel_ptr[i];
local_ptr[i] = val;
}
//this is fast
for(int i=0;i<307200;i++){
local_ptr[i] = pixel_ptr[i];
}
Please try to increase optimization setting.
I am doing some audio processing and therefore mixing some C and Objective C. I have set up a class that handles my OpenAL interface and my audio processing. I have changed the class suffix to
.mm
...as described in the Core Audio book among many examples online.
I have a C style function declared in the .h file and implemented in the .mm file:
static void granularizeWithData(float *inBuffer, unsigned long int total) {
// create grains of audio data from a buffer read in using ExtAudioFileRead() method.
// total value is: 235377
float tmpArr[total];
// now I try to zero pad a new buffer:
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// CRASH on first iteration EXC_BAD_ACCESS (code=1, address= ...blahblah)
}
}
Strange??? Yes I am totally out of ideas as to why THAT doesn't work but the FOLLOWING works:
float tmpArr[235377];
for (int j = 1; j <= 100; j++) {
tmpArr[j] = 0;
// This works and index 0 - 99 are filled with zeros
}
Does anyone have any clue as to why I can't declare an array of size 'total' which has an int value? My project uses ARC, but I don't see why this would cause a problem. When I print the value of 'total' when debugging, it is in fact the correct value. If anyone has any ideas, please help, it is driving me nuts!
Problem is that that array gets allocated on the stack and not on the heap. Stack size is limited so you can't allocate an array of 235377*sizeof(float) bytes on it, it's too large. Use the heap instead:
float *tmpArray = NULL;
tmpArray = (float *) calloc(total, sizeof(float)); // allocate it
// test that you actually got the memory you asked for
if (tmpArray)
{
// use it
free(tmpArray); // release it
}
Mind that you are always responsible of freeing memory which is allocated on the heap or you will generate a leak.
In your second example, since size is known a priori, the compiler reserves that space somewhere in the static space of the program thus allowing it to work. But in your first example it must do it on the fly, which causes the error. But in any case before being sure that your second example works you should try accessing all the elements of the array and not just the first 100.
I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount.
The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values.
Does anyone know what is causing this issue and how to resolve it?
I have attached the (tested) code involved, which will be used later in a more complicated function. The error occurs at "arg1List[i] = ..."
#include <iostream>
#include <xmmintrin.h>
#include <emmintrin.h>
void main()
{
int j;
const int loop = 130036;
const int SIMDloop = (int)(loop/4);
__m128 *arg1List = new __m128[SIMDloop];
printf("sizeof(arg1List)= %d, alignof(Arg1List)= %d, pointer= %p", sizeof(arg1List), __alignof(arg1List), arg1List);
std::cout << std::endl;
for (int i = 0; i < SIMDloop; i++)
{
j = 4*i;
arg1List[i] = _mm_set_ps((j+1)/100.0f, (j+2)/100.0f, (j+3)/100.0f, (j+4)/100.0f);
}
}
Alignment is the reason.
MOVAPS--Move Aligned Packed Single-Precision Floating-Point Values
[...] The operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated.
You can see the issue is gone as soon as you align your pointer:
__m128 *arg1List = new __m128[SIMDloop + 1];
arg1List = (__m128*) (((int) arg1List + 15) & ~15);
I've got a problem of execution with a C++ program. First of all, I'm working on a MacBook Pro, using native g++ to compile.
My program builds an array of Record*. Each record has a multidimensional key. Then it iterates over each record to find its unidimensional float key.
In the end, given an interval of two multidimensional keys, it determines if a given float corresponds to a multidimensional key in this interval. The algorithm is taken from a research paper, and it is quite simple in implementation.
Until 100,000 values computed, no problem, the program does its job. But when I goes to 1,000,000 values, execution crashes.Here is the error given by g++ :
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00007fff5f08dcd0
0x00000001000021ab in TestPyramid () at include/indextree_test.cc:444
Here is the full backtrace given by gdb :
(gdb) backtrace full
#0 0x00000001000021ab in TestPyramid () at include/indextree_test.cc:444
test_records = #1 0x00000001000027be in main (argc=<value temporarily unavailable, due to optimizations>, argv=0x7fff5fbff8f8) at include/indextree_test.cc:83
rc = <value temporarily unavailable, due to optimizations>
progName = 0x7fff5fbff9f8 "/Users/Max/Documents/indextree_test"
testNum = 4
Given lines are calls to the function.
Here is a sample of code :
Record* test_records[1000000];
float values[1000000];
int base = 0;
for (int i(0); i < 1000000; i++)
{
test_records[i] = CreateRecordBasic(i%30+10,i+i%100,"ab","Generic Payload");
if (i%30+10 > base)
base = i%30+10;
if (i+10*i > base)
base = i+10*i;
if (i > base)
base = i;
}
for (int i(0); i < 1000000; i++)
values[i] = floatValueFromKey(test_records[i]->key, base,num_char);
And in the end, I put the relevant float keys in a list.
Is the problem a limitation of my computer ? Did I allocate the memory in a bad manner ?
Thanks for your help,
Max.
Edit :
Here is the code of CreateRecordBasic :
Record *CreateRecordBasic(int32_t attribute_1, int64_t attribute_2, const char* attribute_3, const char* payload){
Attribute** a = new Attribute*[3];
a[0] = ShortAttribute(attribute_1);
a[1] = IntAttribute(attribute_2);
a[2] = VarcharAttribute(attribute_3);
Record *record = new Record;
record->key.value = a;
record->key.attribute_count = 3;
SetValue(record->payload,payload);
return record;
}
Record* test_records[1000000];
float values[1000000];
IMHO, these variables are too big to be stored in the stack whose size is defined by your environment. values takes up 4 megabytes and test_records may take 4-8 megabytes, this is pretty big amount of stack-space. Compiler does not exactly know the size of the system-allocated stack (this may change from system to system) , so you get the error at run-time. Try to allocate them on the heap...