CUDA ptxas Error "function uses too much shared data" - c++

I have never used CUDA or C++ before, but I am trying to get Ramses GPU from (http://www.maisondelasimulation.fr/projects/RAMSES-GPU/html/download.html running.
Due to an error in the autogen.sh I used ./configure and got this one working.
So the makefile produced contains the following NVCC flags
NVCCFLAGS = -gencode=arch=compute_10,code=sm_10 -gencode=arch=compute_11,code=sm_11 -gencode=arch=compute_13,code=sm_13 -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20 -use_fast_math -O3
But when I try to compile the program using make, I get multiple ptxas Errors:
Entry function '_Z30kernel_viscosity_forces_3d_oldPfS_S_S_iiiiiffff' uses too much shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z26kernel_viscosity_forces_3dPfS_S_S_iiiiiffff' uses too much shared data (0x70d0 bytes + 0x10 bytes system, 0x4000 max)
Entry function '_Z32kernel_viscosity_forces_3d_zslabPfS_S_S_iiiiiffff9ZslabInfo' uses too much shared data (0x70e0 bytes + 0x10 bytes system, 0x4000 max)
I'm trying to compile this code on Linux with Kernel 2.6 and CUDA 4.2 (I try to do it in my University and they are not upgrading stuff regularly.) on two NVIDIDA C1060. I tried replacing the sm_10, sm_11 and sm_13 by sm_20, (I saw this fix here: Entry function uses too much shared data (0x8020 bytes + 0x10 bytes system, 0x4000 max) - CUDA error) but that didn't fix my problem.
Do you have any suggestions? I can upload the Makefile as well as everything else, if you need it.
Thank you for your help!

The code you are compiling requires a static allocation of 28880 bytes (0x70d0) of shared memory per block. For compute capability 2.x and newer GPUs, this is no problem because they support up to 48kb of shared memory. However, for compute capability 1.x devices, the shared memory limit is 16kb (and up to 256 bytes of that can be consumed by kernel arguments). Because of this, the code cannot be compiled for compute 1.x devices and the compiler is generating an error telling you this. So the error comes from specifying sm_13/compute_13 to compiler. You can removed that and the build should work.
However, it gets worse. The Tesla C1060 is a compute capability 1.3 device. As a result, you will not be able to compile and run those kernels on your GPUs. There is no solution short of omitting those kernels from the build (if you don't need them), or back porting the code to the compute 1.x architecture. I have no idea whether that is feasible or not. Or finding more modern hardware to run the code on.

Related

avcodec_encode_video2 crash on MinGW32 (but not on MinGW64)

I have been testing for years a software that I am developing, compiled only for 64-bit systems through Qt-MinGW64, without experiencing any kind of issue regarding video encoding (which is one of the features of such application). Recently, I have been attempting to build the corresponding x86 version of my software by compiling it with Qt-MingGW32.
However, after bulding the same ffmpeg and x264 library versions to create the 32-bit versions and successfully linking them to my project, the application keeps crashing after few frames are encoded, due to segmentation fault error. This is very strange because, as I indicated before, it works flawlessly when compiled for 64-bit systems.
I have also lost a considerable amount hours trying to combine a big amount of different versions of both ffmpeg and x264 libraries, but with no luck either. Neither it works when disabling threads both for x264 and ffmpeg libraries, hence it does not seem to be a win32 threading issue. Therefore, I have concluded that the error is most likely in my code, which by some chance beyond my comprehension tells ffmpeg to allocate the right amount of memory in the x64 version, but not in the x86 version.
It must be pointed out also that before the avcodec_encode_video2 call, I do the following calls, among others, in order to allocate the memory associated to the corresponding items (AVFrame, AVCodec, etc.), such as
avcodec_open2( my_codec_context, my_codec, &opt );
my_av_frame=av_frame_alloc();
More precisely, the details involving the code structure that I am using can be found here.
Therefore, the error appears to be more subtle than just issues regarding uninitialized memory.
Many thanks in advance.
UPDATE:
I have discovered the focus of the problem. For some reason, FFmpeg/x264 libraries behave abnormally in Win32 GUI applications compiled with Qt-MinGW32, while they run correctly in Win32 console applications, also compiled with Qt-WinGW32. I have proven this claim performing two dummy tests, in which the exact same piece of code is run through a console application and on a GUI application, succeeding in the first case and failing in the latter one. The code for such tests can be found below, along with the x264 and FFmpeg libraries used in my projects, together with instructions to build them in msys2 with MinGW32:
https://takeafile.com/?f=hisusinivi
I have no idea whether it can be solved by simply tweaking the code or it is a serious bug involving an incompatibility issue. If it is the latter, should it be reported to the Qt/FFmpeg/x264 staff as a major bug?
Looks like you are going out memory (virtual address space available to 32-bit app) at least that is what happens with your QT GUI-test app. Your settings for encoding YUV 4:4:4 FullHD video need around 1.3 GB of memory and this should be available for 32-bit app at 64-bit OS by default (and it is for your console test). But for some reason your QT GUI-test starts to fail after allocating only 1 GB of memory. I am dunno if this 1 GB limit is by QT or Windows for any GUI app. If you make you video resolution 960x540 instead of 1920x1080 than it should work (as it needs less than 1 GB of memory). Otherwise you should set LARGE_ADDRESS_ AWARE flag in PE header by specifying -Wl,--large-address-aware to linker and than 4 GB of memory should be available to 32-bit app on 64-bit OS.
UPDATE
Looks like QT GUI-test have less memory than console test because it links also to Qt5Guid.dll and Qt5Widgetsd.dll which takes additional 450 MB of address space in addition to other libraries which also links in console app and so only 1 GB of free address space from 2 GB available remains to memory heap.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

Compilation hitting virtual memory limitation in g++ 4.7.1?

I'm compiling some code that makes a heave use of templates (its based on boost::msm framework). When compiled with g++ 4.7.1 the cc1plus process reaches about 2.4 Gb of RAM size and than fails with "virtual memory exhausted: Cannot allocate memory" error.
I'm using a 32-bit compiler (switching to 64 bit is not an option ATM), the machine itself is a 64-bit Ubuntu with 16Gb of RAM, the compilation is performed under a 64-bit chroot of Debian wheezy distribution. At the time of compilation there is plenty of RAM available, so if the compilation is to fail because of lack of physically available RAM it is to reach 4Gb first. I tried playing with "ulimit -m" options, setting to different values and setting it to smaller sizes causes the compiler to fail earlier but when left to "unlimited" it fails at the above mentions 2+ Gb.
So I guess something else must be limiting me. Maybe someone encountered a similar issue and knows a way to change the limitation?
In 32-bit application (including compilers), you typically get somewhere between 2 and 3GB that is available for usermode in virtual space. This is caused by a combination of memory space being reserved, memory space fragmentation (there is virtual memory available, just not a big enough chunk to hold whatever size block that new or malloc is requesting), and "memory reservation", where process has allocated a fairly large chunk of memory, but it's not actually using all of it, so it's not "populated".
Any particular reason you can't use a 64-bit GCC to generate 32-bit code - using -M32? That would be my solution.

valgrind returning an unhandled instruction on Raspberry Pi

I've been trying to debug a segmentation fault recently using valgrind on my raspberry Pi (model b), running Debian GNU/Linux7.0 (wheezy). Every time I run valgrind on a compiled C++ program, I get something like the following:
disInstr(arm): unhandled instruction: 0xF1010200
cond=15(0xF) 27:20=16(0x10) 4:4=0 3:0=0(0x0)
valgrind: Unrecognized instruction at address 0x4843638.
at 0x4843638: ??? (in /usr/lib/arm-linux-gnueabihf/libconfi_rpi.so)
Then the normal valgrind stuff, causing a SIGILL and terminating my program. At first I assumed there was some memory leak in my program that was causing it to execute a piece of non-instruction memory as an instruction, but then I ran the following hello world code, and got the same result.
#include <iostream>
using namespace std;
int main() {
cout<<"Hello World"<<endl;
return 0;
}
There can't possibly be a memory leak/segfault in that, so why is it giving me this error?
I'm pretty new with valgrind, but I ran it with the most basic valgrind ./a.out.
From your code (a simple hello world), It complain about an Unrecognized instruction at address 0x4843638. My guess is:
Since valgrind need to intercept your malloc system call function (c standard library). This allow valgrind to check how many resource you did allocated/free which is used for memory leak detection (for example). If valgrind does not recognize your standard library environement (or the instruction language of your processor), It may does not behave as expecting, which would be the cause of your crash. You should check valgrind version and download the one fitted for your platform.
EDIT :
http://valgrind.org/docs/manual/faq.html
3.3. My program dies, printing a message like this along the way:
vex x86->IR: unhandled instruction bytes: 0x66 0xF 0x2E 0x5
One possibility is that your program has a bug and erroneously jumps
to a non-code address, in which case you'll get a SIGILL signal.
Memcheck may issue a warning just before this happens, but it might
not if the jump happens to land in addressable memory.
Another possibility is that Valgrind does not handle the instruction.
If you are using an older Valgrind, a newer version might handle the
instruction. However, all instruction sets have some obscure, rarely
used instructions. Also, on amd64 there are an almost limitless number
of combinations of redundant instruction prefixes, many of them
undocumented but accepted by CPUs. So Valgrind will still have
decoding failures from time to time. If this happens, please file a
bug report.
EDIT2 :
From wikipedia, the Raspberry Pi CPU:
700 MHz ARM1176JZF-S core (ARM11 family, ARMv6 instruction set)[3]
2.11. Limitations
On ARM, essentially the entire ARMv7-A instruction set is supported,
in both ARM and Thumb mode. ThumbEE and Jazelle are not supported.
NEON, VFPv3 and ARMv6 media support is fairly complete.
Your program/library just happened to have some instruction which is not supported yet.
On Raspberry Pi 3 with NOOBS install of Raspian, implement ayke's answer by doing the following in a terminal window:
cd /etc
sudo nano ld.so.preload
Remove the line that includes "libarmmem.so" (/usr/lib/arm-linux-gnueabihf/libarmmem.so)
Save and exit ld.so.preload
Run valgrind
Place the line back into ld.so.preload after valgrind testing is complete
The pre-loaded "libarmmem.so" contains the instruction "setend" in the "memcmp" function which causes the unhandled instruction error. The standard library (used when the pre-loaded "libarmmem.so" library is not loaded) does not include the "setend" instruction in "memcmp".
TL;DR: remove the package raspi-copies-and-fills if you're using Raspbian. It may also work in some other Linux variations like NOOBS.
As Phong already noted, this instruction is not supported by Valgrind.
There is a bug report which explains the issue:
This keeps cropping up, for example most recently in bug 366464.
Maybe I should explain more why this isn't supported. It's because we
don't have a feasible way to do it. Valgrind's JIT instruments code
blocks as they are first visited, and the endianness of the current
blocks are "baked in" to the instrumentation. So there are two
options:
(1) when a SETEND instruction is executed, throw away all the JITted
code
that Valgrind has created, and JIT new code blocks with the new endianness.
(2) JIT code blocks in an endian-agnostic way and have a runtime test
for each memory access, to decide on whether to call a big or little
endian instrumentation helper function.
(1) gives zero performance overhead for code that doesn't use SETEND
but a gigantic (completely infeasible) hit for code that does.
(2) makes endian changes free, but penalises all memory traffic
regardless of whether SETEND is actually used.
So I don't find either of those acceptable. And I can't think of any
other way to implement it.
In other words, it is hard to implement this instruction in valgrind.
Summarizing the thread: the most common cause of this instruction are some faster memory management functions which the Raspberry Pi ships (memcmp, memset, etc.).
I solved it by (temporarily) removing raspi-copies-and-fills from my Raspbian install.
Valgrind is apparently problematic on Raspberry Pi:
https://web.archive.org/web/20131003042418/http://www.raspberrypisoft.com/tag/valgrind/
I suggest using other tools to find the seg fault.

Qt Creator - calloc fails with large memory

I have a problem with Qt Creator, or one of its components.
I have a program which needs lots of memory (about 4 GBytes) and I use calloc to allocate it. If I compile the C code with mingw/gcc (without using the Qt-framework) it works, but if I compile it within the Qt Creator (with the C code embedded in the Qt framework using C++), using the mingw/gcc toolchain, calloc returns a null-pointer.
I already searched and found the qt-pro option QMAKE_LFLAGS += -Wl,--large-address-aware, which worked for some cases (around 3.5GBytes), but if I go above 4GBytes, it only works with the C code compiled with gcc, not with Qt.
How can I allocate the needed amount of memory using calloc when compiling with Qt Creator?
So your cigwin tool chain builds 64-bit applications for your. Possible size of memory, that can be allocated by 64-bit application is 264 bytes that far exceeds 4Gb. But Qt Creator (if you installed it from QtSDK and not reconfigured it manually) uses Qt's tool chain, that builds 32 bit applications. You theoretically can allocate 4Gb of memory by 32 bit application, but do not forget, that all libraries will be also loaded into this memory. In practice, you are possible to allocate about 3 Gb of memory and not in one continuous chunk.
You have 3 ways to solve your problem:
reconsider your algorithm. Do not allocate 4Gb of RAM, use smarter data structures, or use disk cache etc. I believe if your problem would actually require more then 4 GB of memory to solve, you wouldn't ask this question.
separate your Qt code from your C program. Then, you can still use 64-bit-target-compiler for C program and 32-bit-target-compiler for Qt/C++ part. You can communicate with your C program through any interprocess communication mechanism. (Actually standard input/output streams are often enough)
Move to 64 bit. I mean, use 64-bit-target-compiler for both C and C++ code. But it is not so simple, as one could think. You'll need to rebuild Qt in 64 bit mode. It is possible with some modules turned off and some code fixups (I've tried once), but Windows 64 bit officially not supported.