Intending to use ALL the available on-GPU memory for my algorithm, so retrieving it's amount with:
clGetDeviceInfo( ..., CL_DEVICE_GLOBAL_MEM_SIZE, ... );
which is 536543232 bytes, and then allocate it on GPU with:
clCreateBuffer( gpuContext, CL_MEM_READ_WRITE, 536543232, NULL, & errcode_ret );
Wondered why it worked and if it would fail if try to allocate some more memory? Tried out 100 gigs and it still worked!
clCreateBuffer( gpuContext, CL_MEM_READ_WRITE, 100000000000, NULL, & errcode_ret );
So the question is why it works with whatever amount of memory specified?
I may happen if OpenCL platform has lazy memory allocation (almost every platform does that). I guess some OpenCL platforms just check if what you requested can be allocated on clCreateBuffer, and maybe yours doesn't. You will probably get an error on the first OpenCL functions that actually uses your buffer, like clEnqueueWriteBuffer() etc. What is your OpenCL platform?
Related
I am having some odd behavior when using virtualalloc. I'm in c++, Visual Studio 2010.
I have two things I want to allocate, and I'm using VirtualAlloc (I have my reasons, irrelevant to the question)
1 - Space to hold a buffer of x86 assembly code
2 - Space to hold the data structure that the x86 code wants
In my code I am doing:
thread_data_t * p_data = (thread_data_t*)VirtualAlloc(NULL, sizeof(thread_data_t), MEM_COMMIT, PAGE_READWRITE);
//set up all the values in the structure
unsigned char* p_function = (unsigned char*)VirtualAlloc(NULL, sizeof(buffer), MEM_COMMIT, PAGE_EXECUTE_READWRITE);
memcpy(p_function, buffer, sizeof(buffer));
CreateThread( 0, (LPTHREAD_START_ROUTINE)p_function, p_data, 0, NULL);
in DEBUG mode: Works fine
in RELEASE mode: The spun up thread receives a null as its input data. Verified through debugging that when I call createThread the pointer is correct
if I switch the VirtualAlloc's around, so that I allocate the function space before the data space, then both DEBUG and RELEASE mode work fine.
Any ideas why? I've verified all my VS build settings are the same between DEBUG/RELEASE
After copying assembly code into a memory buffer, you can't just jump straight into that buffer. You need to flush CPU caches and the like or it will not work. You can use FlushInstructionCache to do this.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms679350%28v=vs.85%29.aspx
It's hard to say exactly why reordering the allocations would fix the issue, but if you copied the instructions into their buffer and then did a lot of work before jumping into the buffer, that would likely improve the odds of "getting away with it," as the CPU caches would have more of an opportunity to get flushed out by other means.
I am currently evaluating embOS from SEGGER running on Cortex M4F. It has 128 kilobytes of internal RAM, and 2 megabytes of external RAM, so I know I have plenty of memory.
My program uses some dynamic allocations (yes, I am aware that is not recommended on embedded systems).
When starting my task, I am trying to call malloc/OS_malloc, where the OS_malloc is the thread-safe version provided from embOS. In both cases, malloc failed and returned NULL pointer.
When doing the same malloc/OS_malloc before the OS starts, it works correctly:
**//Malloc here does not fail**
OS_IncDI(); /* Initially disable interrupts */
**//Malloc here does not fail**
OS_InitKern(); /* Initialize OS */
**//Malloc here does fail !!**
OS_InitHW(); /* Initialize Hardware for OS */
OS_CREATETASK(&TCBHP, "My Task", HPTask, 50, StackHP); //**<--And off course malloc failes inside teh task also**
OS_Start();
I went and tried using uCOS from MICRIUM, and I see the same behavior. Any ideas why this happens?
I think that i am on my way to fix the problem .
it seems that setting in the linker script :
_Min_Heap_Size = 0x19000; /* required amount of heap */
_Min_Stack_Size = 0x200; /* required amount of stack */
instead of :
_Min_Heap_Size = 0x00; /* required amount of heap */
_Min_Stack_Size = 0x200; /* required amount of stack */
malloc may returns fail in following condition
1)Running out of memory but as you said i have plenty of memory so this is not the case.
2)malloc is not able to allocate contiguous memory of requested size.
I guess option 2 is responsible for your case.
Having read this interesting article outlining a technique for debugging heap corruption, I started wondering how I could tweak it for my own needs. The basic idea is to provide a custom malloc() for allocating whole pages of memory, then enabling some memory protection bits for those pages, so that the program crashes when they get written to, and the offending write instruction can be caught in the act. The sample code is C under Linux (mprotect() is used to enable the protection), and I'm curious as to how to apply this to native C++ and Windows. VirtualAlloc() and/or VirtualProtect() look promising, but I'm not sure how a use scenario would look like.
Fred *p = new Fred[100];
ProtectBuffer(p);
p[10] = Fred(); // like this to crash please
I am aware of the existence of specialized tools for debugging memory corruption in Windows, but I'm still curious if it would be possible to do it "manually" using this approach.
EDIT: Also, is this even a good idea under Windows, or just an entertaining intellectual excercise?
Yes, you can use VirtualAlloc and VirtualProtect to set up sections of memory that are protected from read/write operations.
You would have to re-implement operator new and operator delete (and their [] relatives), such that your memory allocations are controlled by your code.
And bear in mind that it would only be on a per-page basis, and you would be using (at least) three pages worth of virtual memory per allocation - not a huge problem on a 64-bit system, but may cause problems if you have many allocations in a 32-bit system.
Roughly what you need to do (you should actually find the page-size for the build of Windows - I'm too lazy, so I'll use 4096 and 4095 to represent pagesize and pagesize-1 - you also will need to do more error checking than this code does!!!):
void *operator new(size_t size)
{
Round size up to size in pages + 2 pages extra.
size_t bigsize = (size + 2*4096 + 4095) & ~4095;
// Make a reservation of "size" bytes.
void *addr = VirtualAlloc(NULL, bigsize, PAGE_NOACCESS, MEM_RESERVE);
addr = reinterpret_cast<void *>(reinterpret_cast<char *>(addr) + 4096);
void *new_addr = VirtualAlloc(addr, size, PAGE_READWRITE, MEM_COMMIT);
return new_addr;
}
void operator delete(void *ptr)
{
char *tmp = reinterpret_cast<char *>(ptr) - 4096;
VirtualFree(reinterpret_cast<void*>(tmp));
}
Something along those lines, as I said - I haven't tried compiling this code, as I only have a Windows VM, and I can't be bothered to download a compiler and see if it actually compiles. [I know the principle works, as we did something similar where I worked a few years back].
This is what Gaurd Pages are for (see this MSDN tutorial), they raise a special exception when the page is accessed the first time, allowing you to do more than crash on the first invalid pages access (and catch bad read/writes as opposed to NULL pointers etc).
I'm new with OpenCL and have some problems with the array additions
I use the code provided in the link below
http://code.google.com/p/opencl-book-samples/source/browse/#svn%2Ftrunk%2Fsrc%2FChapter_2%2FHelloWorld%253Fstate%253Dclosed
and I added some parts to measure the performance of the GPU
clFinish(commandQueue);
// Queue the kernel up for execution across the array
cl_ulong start, end; cl_event k_events;
errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,
globalWorkSize, localWorkSize,
0, NULL, &k_events);
clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &start, NULL);
clWaitForEvents(1 , &k_events);
clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &start, NULL);
float GPUTime = (end - start);
And this to measure the CPU time
LARGE_INTEGER CPUstart, finish, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&CPUstart);
for (int i=0;i<ARRAY_SIZE;i++){
result[i]=a[i]+b[i];
}
QueryPerformanceCounter(&finish);
double timeCPU=(finish.QuadPart - CPUstart.QuadPart) /((double)freq.QuadPart)/1000000000.0) ;
The first problem I encountered is the array size ; it can't go beyond 10000 ; if I do this ; it just crash . How to fix it ?
The second problem is the performance ; the GPU/CPU ratio range is too wide ; from 13% to 210%(ish) . Why does this happen and can you suggest a fix ?
Edit : I figured out the 2nd ; the lag was caused by the power saving mode ; it set the core/mem to much lower than default . Just use a program to lock it ; and the performance are rocking stable at ~150-300 % (GPU/CPU)
Good case
GPU time :632667 nanosecs.
CPU time : 990023 nanosecs.
GPU/CPU ratio : 156.484 percent.
And bad one :
GPU time :6.83267e+006 nanosecs.
CPU time : 1.00756e+006 nanosecs.
GPU/CPU ratio : 14.7462 percent.
Any ideas will be appreciated . Thank you :D
PS : The CPU is core i3-370M ; GPU : HD5470 . I use VS2008 on windows 7
One possible (and most probable) reason that your program crashes with bigger array sizes is due to the following code in main.cpp (lines 274-276 in the original code):
float result[ARRAY_SIZE];
float a[ARRAY_SIZE];
float b[ARRAY_SIZE];
These are automatic arrays and space for them is allocated on the stack of the main function. The total space required is 3*ARRAY_SIZE*sizeof(float) which equals 12*ARRAY_SIZE. The default stack size on Windows is 1 MiB which means ARRAY_SIZE could be up to 87380. This is the upper limit given the default stack size and since the stack is also used for other things too, the real value would be somewhat lower.
You can increase the stack size on the Linker -> System page of your VS project properties. Or better allocate those arrays on the heap using malloc() or new[].
Here's a good answer that helps you with why you are reaching your limit
Is there a max array length limit in C++?
If you can figure out a way to create memory management in your code that might help alleviate some of your problems.
Btw, I'd look to other OSs, like a linux environment that might be able to help run your code. Windows is full of memory hogging services and might be a factor in your problem. Or you can just get better hardware.
A few things:
If your local work size does not round into your global work size, you may end up with a small leftover fraction. I.E.: local size is 100 and global size 1050 -> 50 extra. This bit IIRC still gets processed. A fix to that problem is to a) make sure you round evenly, or b) check a guard variable in the kernel and abort if it is outside the range.
Secondly, I noticed some strangeness with clGetEventProfilingInfo where sometimes it would be quite accurate and sometimes it would be quite inaccurate. I ended up using clFinish and QueryPerformanceCounter to benchmark my CL code.
You can use clGetDeviceInfo API call to determine two key parameters for your OpenCL device
CL_DEVICE_MAX_MEM_ALLOC_SIZE and CL_DEVICE_GLOBAL_MEM_SIZE
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html
these determine how much Global memory you are able to use and how much you can allocate.
I'm getting some strange, intermittent, data aborts (< 5% of the time) in some of my code, when calling memset(). The problem is that is usually doesn't happen unless the code is running for a couple days, so it's hard to catch it in the act.
I'm using the following code:
char *msg = (char*)malloc(sizeof(char)*2048);
char *temp = (char*)malloc(sizeof(char)*1024);
memset(msg, 0, 2048);
memset(temp, 0, 1024);
char *tempstr = (char*)malloc(sizeof(char)*128);
sprintf(temp, "%s %s/%s %s%s", EZMPPOST, EZMPTAG, EZMPVER, TYPETXT, EOL);
strcat(msg, temp);
//Add Data
memset(tempstr, '\0', 128);
wcstombs(tempstr, gdevID, wcslen(gdevID));
sprintf(temp, "%s: %s%s", "DeviceID", tempstr, EOL);
strcat(msg, temp);
As you can see, I'm not trying to use memset with a size larger that what's originally allocated with malloc()
Anyone see what might be wrong with this?
malloc can return NULL if no memory is available. You're not checking for that.
There's a couple of things. You're using sprintf which is inherently unsafe; unless you're 100% positive that you're not going to exceed the size of the buffer, you should almost always prefer snprintf. The same applies to strcat; prefer the safer alternative strncat.
Obviously this may not fix anything, but it goes a long way in helping spot what might otherwise be very annoying to spot bugs.
malloc can return NULL if no memory is
available. You're not checking for
that.
Right you are... I didn't think about that as I was monitoring the memory and it there was enough free. Is there any way for there to be available memory on the system but for malloc to fail?
Yes, if memory is fragmented. Also, when you say "monitoring memory," there may be something on the system which occasionally consumes a lot of memory and then releases it before you notice. If your call to malloc occurs then, there won't be any memory available. -- Joel
Either way...I will add that check :)
wcstombs doesn't get the size of the destination, so it can, in theory, buffer overflow.
And why are you using sprintf with what I assume are constants? Just use:
EZMPPOST" " EZMPTAG "/" EZMPVER " " TYPETXT EOL
C and C++ combines string literal declarations into a single string.
Have you tried using Valgrind? That is usually the fastest and easiest way to debug these sorts of errors. If you are reading or writing outside the bounds of allocated memory, it will flag it for you.
You're using sprintf which is
inherently unsafe; unless you're 100%
positive that you're not going to
exceed the size of the buffer, you
should almost always prefer snprintf.
The same applies to strcat; prefer the
safer alternative strncat.
Yeah..... I mostly do .NET lately and old habits die hard. I likely pulled that code out of something else that was written before my time...
But I'll try not to use those in the future ;)
You know it might not even be your code... Are there any other programs running that could have a memory leak?
It could be your processor. Some CPUs can't address single bytes, and require you to work in words or chunk sizes, or have instructions that can only be used on word or chunk aligned data.
Usually the compiler is made aware of these and works around them, but sometimes you can malloc a region as bytes, and then try to address it as a structure or wider-than-a-byte field, and the compiler won't catch it, but the processor will throw a data exception later.
It wouldn't happen unless you're using an unusual CPU. ARM9 will do that, for example, but i686 won't. I see it's tagged windows mobile, so maybe you do have this CPU issue.
Instead of doing malloc followed by memset, you should be using calloc which will clear the newly allocated memory for you. Other than that, do what Joel said.
NB borrowed some comments from other answers and integrated into a whole. The code is all mine...
Check your error codes. E.g. malloc can return NULL if no memory is available. This could be causing your data abort.
sizeof(char) is 1 by definition
Use snprintf not sprintf to avoid buffer overruns
If EZMPPOST etc are constants, then you don't need a format string, you can just combined several string literals as STRING1 " " STRING2 " " STRING3 and strcat the whole lot.
You are using much more memory than you need to.
With one minor change, you don't need to call memset in the first place. Nothing
really requires zero initialisation here.
This code does the same thing, safely, runs faster, and uses less memory.
// sizeof(char) is 1 by definition. This memory does not require zero
// initialisation. If it did, I'd use calloc.
const int max_msg = 2048;
char *msg = (char*)malloc(max_msg);
if(!msg)
{
// Allocaton failure
return;
}
// Use snprintf instead of sprintf to avoid buffer overruns
// we write directly to msg, instead of using a temporary buffer and then calling
// strcat. This saves CPU time, saves the temporary buffer, and removes the need
// to zero initialise msg.
snprintf(msg, max_msg, "%s %s/%s %s%s", EZMPPOST, EZMPTAG, EZMPVER, TYPETXT, EOL);
//Add Data
size_t len = wcslen(gdevID);
// No need to zero init this
char* temp = (char*)malloc(len);
if(!temp)
{
free(msg);
return;
}
wcstombs(temp, gdevID, len);
// No need to use a temporary buffer - just append directly to the msg, protecting
// against buffer overruns.
snprintf(msg + strlen(msg),
max_msg - strlen(msg), "%s: %s%s", "DeviceID", temp, EOL);
free(temp);