Out of boundary detection in OpenCL

Out of boundary detection in OpenCL - c++

I was experimenting with OpenCL (C++ interface) and, without noticing it, I created a buffer for 10 integers using buffer size equal to 10, instead of 10 * sizeof(int), but the code run apparently without issues.
Now, I believe that this could be possible because I created the buffer with CL_MEM_USE_HOST_PTR flag and this made possible to access out of boundary memory (though I am not sure of this).
So, my question is: is it possible to enforce out of bounds error checking in OpenCL, so that any access outsize a given area is reported?

As other posters have already indicated, out-of-bound checking is currently not supported by OpenCL drivers. While tools like the WebCL Validator are promising in this area, I would like to mention another path, based on existing tools, which has helped me in the past. By using the FreeOCL CPU driver, which relies on a standard C++ compiler to compile your kernels (after a source-to-source translation step) you can use a tool like valgrind on your final program and get the typical valgrind error messages like this:
==5863== Thread 6:
==5863== Invalid write of size 1
==5863== at 0xD61FA5D: __FCL_kernel_krnl_route_pkt (filehFymmN:27)
You can then directly refer to the C++ version of the kernel (/tmp/filehFymmN line 27 in the example) to find where the offending operation happened.

You might want to try the WebCL Validator: https://github.com/KhronosGroup/webcl-validator
It's a command-line tool that instruments your OpenCL kernel source code with run-time checks for out-of-bounds memory accesses. It's still work in progress, so any feedback would be much appreciated.

Short answer: no.
Long answer: ATI and NVIDIA are very forgiving for accessing memory out-of-bounds, but Intel will crash (haven't tested AMD cpus).
For something like an Anisotropic filter where you are accessing n, n + 1, and n - 1, you should either use global offsets to avoid accessing memory out of bounds or check in the kernel with if statements. Global Offset is nice, but NVIDIA doesn't support it, so there is that.
Unfortunately, using try / catch on the host code doesn't seem to work either since there is magic being done in the OpenCL.dll you are linking to.
Note: This is on SDK and code about 4 months old, dunno if it has changed since then.

Related

Converting a 32bit directx9 app to be large address aware

We are running into issues with an old closed-source game engine failing to compile shaders when memory nears 2GB.
The issue is usually with D3DXCreateEffect. Usually it returns HResult "out of memory", sometimes d3dx9_25.dll prints random errors in a popup, or it just outright segfault.
I believe the issue is lack of Large Address Awareness: I noticed one of the d3dx9_25.dll crashes doing something that would hint as such. It took a valid pointer that looked like 0x8xxxxxx3, checked that bits 0x80000003 are lit and if yes, it bit inverts the pointer and derefs it. The resulting pointer pointed to unallocated memory. Forcing the engine to malloc 2GB before compilation makes the shaders fail to compile every time.
Unfortunately our knowledge of DX9 is very limited, I've seen that DX9 has a flag D3DXCONSTTABLE_LARGEADDRESSAWARE but I'm not sure where exactly Its supposed to go. The only API call the game uses that I can find relies on it is D3DXGetShaderConstantTable, but the issues happen before it is ever called. Injecting the flag (1 << 17) = 0x20000 to D3DXCreateEffect makes the shader fail compilation in another way.
Is D3DXCreateEffect supposed to accept the Large Address Aware flag? I found a wine test using it, but digging into DX9 assembly, the error it throws is caused by an internal function returning HResult Invalid Call when any bit out of FFFFF800 in flags is set, which leads me to believe CreateEffect is not supposed to accept this flag.
Is there anywhere else I should be injecting the Large Address Aware flag before this? I understand that a call to D3DXGetShaderConstantTable will need to be fixed to use D3DXGetShaderConstantTableEx, but its not even reached yet.

LargeAddressAware is a bit of a hack, so it may or may not help your case. It really only helps if your application needs a little more room close to 2GB of VA, not if if needs a lot more.
A key problem with the legacy DirectX SDK Direct3D 9 era effects system is that it assumed the high-bit of the effect "handle" was free so it could use it, and without the bit the handle was an address to a string. This assumption is not true for LargeAddressAware.
To enable this, you define D3DXFX_LARGEADDRESS_HANDLE before including d3dx9.h headers. You then must use the D3DXFX_LARGEADDRESSAWARE flag when creating all effects. You must also not use the alias trick where you can use a "string name" instead of a "handle" on all the effect methods. Instead you have to use GetParameterByName to get the handle and use that instead.
What I can't remember is when the LAA flag was added to Effects for Direct3D 9.
If you are using d3dx9_25.dll then that's the April 2005 release of the DirectX SDK. If you are using "Pixel Shader Model 1.x" then you can't use any version newer than d3dx9_31.dll (October 2006)--later versions of the DirectX SDK let you use D3DXSHADER_USE_LEGACY_D3DX9_31_DLL which just passed through shader compilation to the older version for this scenario.
A key reason that many 32-bit games would fail and then work with LAA enabled was because of virtual memory fragmentation. Improving your VA memory layout can making your allocations more uniform can help too.

The issue we were having with CreateEffect not accepting the LargeAddressAware flag is pretty obvious in hindsight, the dx9 version the engine is using (d3dx9_25.dll) simply did not have this feature yet.
Our options, other than optimizing our memory usage are:
Convert all our pixel shaders 1.x to 2.0 and force the engine to load a newer version of d3dx9, hope the engine is not relying on bugs of d3dx9_25.dll or the alias trick, then inject the LargeAddressAware flag bit there.
Wrap malloc, either avoiding giving handles large addresses (I am unsure if this is also required inside the dll as well) or stick enough other data in large addresses so dx9 related mallocs don't reach it.

C++ Program freezes esoterically

I wrote a C++ CLI program with MS VC++ 2010 and GCC 4.2.1 (for Mac OS X 10.6 64 bit, in Eclipse).
The program works well under GCC+OS X and most times under Windows. But sometimes it silently freezes. The command line cursor keeps blinking, but the program refuses to continue working.
The following configurations work well:
GCC with 'Release' and 'Debug' configuration.
VC++ with 'Debug' configuration
The error only occurs in the configuration 'VC++ with 'Release' configuration' under Win 7 32 bit and 64 bit. Unfortunately this is the configuration my customer wants to work with ;-(
I already checked my program high and low and fixed all memory leaks. But this error still occurs. Do you have any ideas how I can find the error?

Use logging to narrow down which part of code the program is executing when it crashes. Keep adding log until you narrow it down enough to see the issue.
Enable debug information in the release build (both compiler and linker); many variables won't show up correctly, but it should at least give you sensible backtrace (unless the freeze is due to stack smashing or stack overflow), which is usually enough if you keep functions short and doing just one thing.
Memory leaks can't cause freezes. Other forms of memory misuse are however quite likely to. In my experience overrunning a buffer often cause freezes when that buffer is freed as the free function follows the corrupted block chains. Also watch for any other kind of Undefined Behaviour. There is a lot of it in C/C++ and it usually behaves as you expect in debug and completely randomly when optimized.
Try building and running the program under DUMA library to check for buffer overruns. Be warned though that:
It requires a lot of memory. I mean easily like thousand times more. So you can only test on simple cases.
Microsoft headers tend to abuse their internal allocation functions and mismatch e.g. regular malloc and internal __debug_free (or the other way 'round). So might get a few cases that you'll have to carefully workaround by including those system headers into the duma one before it redefines the functions.
Try building the program for Linux and run it under Valgrind. That will check more problems in addition to buffer overruns and won't use that much memory (only twice as normal, but it is slower, approximately 20 times).

Debug versions usually initialize all allocated memory (MSVC fills them with 0xCD with the debug configuration). Maybe you have some uninitialized values in your classes, with the GCC configurations and MSVC Debug configuration it gets a "lucky" value, but in MSVC Release it doesn't.
Here are the rest of the magic numbers used by MSVC.
So look for uninitialized variables, attributes and allocated memory blocks.

Thank you all, especially Cody Gray and MikMik, I found it!
As some of you recommended I told VS to generate debug information and disabled optimizations in the release configuration. Then I started the program and paused it. Alternatively I remotely attached to the running process. This helped me finding the region where the error was.
The reasons were infinite loops, caused by reads behind the boundaries of an array and a missing exclusion of an invalid case. Both led to unreachable stopping conditions at runtime. The esoteric part came from the fact, that my program uses some randomized values.
That's life...

CUDA debugging procedure for non-deterministic output

I'm debugging my CUDA 4.0/Thrust-based image reconstruction code on my Ubuntu 10.10 64-bit system and I've been trying to figure out how to debug this run-time error I have in which my output images appear to some random "noise." There is no random number generator output in my code, so I expect the output to be consistent between runs, even if it's wrong. However, it's not...
I was just wondering if any one has a general procedure for debugging CUDA runtime errors such as these. I'm not using any shared memory in my cuda kernels. I've taken pains to avoid any race conditions involving global memory, but I could have missed something.
I've tried using gpu ocelot, but it has problems recognizing some of my CUDA and CUSPARSE function calls.
Also, my code generally works. It's just when I change this one setting that I get these non-deterministic results. I've checked all code associated with that setting, but I can't figure out what I'm doing wrong. If I can distill it to something that I can post here, I might do that, but at this point it's too complicated to post here.

Are you sure all of your kernels have proper blocksize/remainder handling? The one place we have seen non-deterministic results occurred when we had data elements at the end of the array not being processed.
Our kernels were originally were intended for data that was known to be an integer multiple of 256 elements. So we used a blocksize of 256, and did a simple division to get the number of blocks. When the data was then changed to be any length, the leftover 255 or less elements never got processed. Those spots in the output then had random data.

Boggling Direct3D9 dynamic vertex buffer Lock crash/post-lock failure on Intel GMA X3100

For starters I'm a fairly seasoned graphics programmer but as wel all know, everyone makes mistakes. Unfortunately the codebase is a bit too large to start throwing sensible snippets here and re-creating the whole situation in an isolated CPP/codebase is too tall an order -- for which I am sorry, do not have the time. I'll do my best to explain.
B.t.w, I will of course supply specific pieces of code if someone wonders how I'm handling this-or-that!
As with all resources in the D3DPOOL_DEFAULT pool, when the device context is taken away from you you'll sooner or later will have to reset your resources. I've built a mechanism to handle this for all relevant resources that's been working for years; but that fact nothingwithstanding I've of course checked, asserted and doubted any assumption since this bug came to light.
What happens is as follows: I have a rather large dynamic vertex buffer, exact size 18874368 bytes. This buffer is locked (and discarded fully using the D3DLOCK_DISCARD flag) each frame prior to generating dynamic geometry (isosurface-related, f.y.i) to it. This works fine, until, of course, I start to reset. It might take 1 time, it might take 2 or it might take 5 resets to set off a bug that causes an access violation either on the pointer returned by the Lock() operation on the renewed resource or a plain crash -- regarding a somewhat similar address, but without the offset that it has tacked on to it in the first case because in that case we're somewhere halfway writing -- iside the D3D9 dll Lock() call.
I've tested this on other hardware, upgraded my GMA X3100 drivers (using a MacBook with BootCamp) to the latest ones, but I can't reproduce it on any other machine and I'm at a loss about what's wrong here. I have tried to reproduce a similar situation with a similar buffer (I've got a large scratch pad of the same type I filled with quads) and beyond a certain amount of bytes it started to behave likewise.
I'm not asking for a solution here but I'm very interested if there are other developers here who have battled with the same foe or maybe some who can point me in some insightful direction, maybe ask some questions that might shed a light on what I may or may not be overlooking.
Thanks and any corrections are more than welcome.
Niels
p.s - A friend of mine raised the valid point that it is a huge buffer for onboard video RAM and it's being at least double or triple buffered internally due to it's dynamic nature. On the other hand, the debug output (D3D9 debug DLL + max. warning output) remains silent.
p.s 2 - Had it tested on more machines and still works -- it's probably a matter of circumstance: the huge dynamic, internally double/trippled buffered buffer, not a lot of memory and drivers that don't complain when they should..
p.s 3 - Just got told (should've known) that the Lock and Unlock do a full copy of the 18MB -- that's not too smart either, but still :) (I do use that saner strategy for the renderer's generic dynamic VB/IBs).
Unless someone has a better suggestion; I'd still love to hear it :)

Program crashes when run outside IDE

I'm currently working on a C++ program in Windows XP that processes large sets of data. Our largest input file causes the program to terminate unexpectedly with no sort of error message. Interestingly, when the program is run from our IDE (Code::Blocks), the file is processed without any such issues.
As the data is being processed, it's placed into a tree structure. After we finish our computations, the data is moved into a C++ STL vector before being sent off to be rendered in OpenGL.
I was hoping to gain some insight into what might be causing this crash. I've already checked out another post which I can't post a link to since I'm a new user. The issue in the post was quite similar to mine and resulted from an out of bounds index to an array. However, I'm quite sure no such out-of-bounds error is occurring.
I'm wondering if, perhaps, the size of the data set is leading to issues when allocating space for the vector. The systems I've been testing the program on should, in theory, have adequate memory to handle the data (2GB of RAM with the data set taking up approx. 1GB). Of course, if memory serves, the STL vectors simply double their allocated space when their capacity is reached.
Thanks, Eric

The fact that the code works within the IDE (presumably running within a debugger?), but not standalone suggests to me that it might be an initialisation issue.

Compiler with the warning level set to max.
Then check all your warning. I would guess it is an uninitialized variable (that in debug mode is being initialized to NULL/0).
Personally I have set my templates so that warnings are always at max and that warnings are flagged as errors so that compilation will fail.

You'd probably find it helpful to configure the O/S to create a crash dump (maybe, I don't know, still by using some Windows system software called "Dr Watson"), to which you can then attach a debugger after the program has crashed (assuming that it is crashing).
You should also trap the various ways in which a program might exit semi-gracefully without a crash dump: atexit, set_unexpected, set_terminate and maybe others.

What does your memory model look like? Are you banging up against an index limit (i.e. sizeof int)?

As it turns out, our hardware is reaching its limit. The program was hitting the system's memory limit and failing miserably. We couldn't even see the error statements being produced until I hooked cerr into a file from the command line (thanks starko). Thanks for all the helpful suggestions!

Sounds like your program is throwing an exception that you are not catching. The boost test framework has some exception handlers that could be a quick way to localise the exception location.
Are there indices in the tree structure that could overflow? Are you using indexes into the vector that are beyond the current size of the vector?
new vector...
vector.push_back()
vector.push_back()
vector[0] = xyz
vector[1] = abc
vector[2] = slsk // Uh,oh, outside vector
How large is your largest input set? Do you end up allocating size*size elements? If so, is your largest input set larger than 65536 elements (65536*65536 == MAX_INT)?
I agree the most likely reason that the IDE works fine when standalone does not is because the debugger is wiping memory to 0 or using memory guards around allocated memory.
Failing anything else, is it possible to reduce the size of your data set until you find exactly the size that works, and a slightly large example that fails - that might be informative.

I'd recommend to try to determine approximately which line of your code does causes the crash.
Since this only happen outside your IDE you can use OutputDebugString to output the current position, and use DebugView.
Really the behavior of a program compiled for debug inside and outside of IDE can be completely different. They can use a different set of runtime libraries when a program is loaded from the IDE.
Recently I was bitten by a timing bug in my code, somehow when debugging from the IDE the timing was always good an the bug was not observed, but in release mode bam the bug was there. This kinda of bug are really a PITA to debug.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js