FFT 2D kernel runtime =0 in OpenCL - c++

I’m working on a homework project compare performance of Fast Fourier Transform on CPU vs GPU . I’m done with the CPU part , but with GPU , I have a problem.
The trouble is the kernel runtime is zero , the input is the same as the output image . I use VS2010 on win7 with AMD APP SDK . Here is the host code , the kernel , an addition header to handle the image , they can be found in The OpenCL Programming Book (Ryoji Tsuchiyama…)
My guess the error is in the phase where we pass values from the image pixels to the cl_float2 *xm (line 169-174 in the host code). I can’t access the vector component to check it either , the compiler ain’t accept .sX or .xy , throws an error about it . Other parts –kernel,header…- looks fine with me .
for (i=0; i < n; i++) {
for (j=0; j < n; j++) {
((float*)xm)[(2*n*j)+2*i+0] = (float)ipgm.buf[n*j+i]; //real
((float*)xm)[(2*n*j)+2*i+1] = (float)0; //imag
So hope you guys help me out . Any ideas will be appreciated .

OpenCL provides a lot of different error codes.
You already retrieve them by doing ret = clInstruction(); on each call, but you are not analysing it.
Please check on each call if this value is equal to CL_SUCCESS.
It may always happen, that the memory is not sufficient, the hardware is already in use or there is a simple error in your source code. The return value will tell you.
Also: Please check your cl_context, cl_program, etc. for NULL values.


Linking Cuda (cudart.lib) makes DXGI DuplicateOutput1() fail

For an obscure reason my call to IDXGIOutput5::DuplicateOutput1() fail with error 0x887a0004 (DXGI_ERROR_UNSUPPORTED) after I added cudart.lib in my project.
I work on Visual Studio 2019, my code for monitor duplication is the classic :
hr = output5->DuplicateOutput1(this->dxgiDevice, 0, sizeof(supportedFormats) / sizeof(DXGI_FORMAT), supportedFormats, &this->dxgiOutputDuplication);
And the only thing I tried to do with cuda at the moment is simply to list the Cuda devices :
int nDevices = 0;
cudaError_t error = cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
LOG_FUNC_DEBUG("Graphic adapter : Descripion: %s, Memory Clock Rate : %d kHz, Memory Bus Width : %u bits",
Moreover this piece of code is called far later after I try to start monitor duplication with DXGI.
Every thing seems correct in my application : I do a call to SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2), and I'm not running on e discrete GPU (see [https://support.microsoft.com/en-us/help/3019314/error-generated-when-desktop-duplication-api-capable-application-is-ru][1])
And by the way it used to work, and it works again if I just remove the "so simple" Cuda call and the cudart.lib from the linker input !
I really don't understand what can cause this strange behavior, any idea ?
...after I added cudart.lib in my project
When you link CUDA library you force your application to run on discrete GPU. You already know this should be avoided, but you still force it through this link.
...and I'm not running on e discrete GPU...
You are, static link to CUDA is a specific case which hints to use dGPU.
There are systems where Desktop Duplication is not working against dGPU and yours seems to be one of those. Even though unobvious, you are seeing behavior by [NVIDIA] design.
(There are however also other systems where Desktop Duplication is working against dGPU and is not working against iGPU).
Your potential solution is along this line:
Application is not directly linked against cuda.lib or cudart.lib or LoadLibrary to dynamically load the nvcuda.dll or cudart*.dll and uses GetProcAddress to retrieve function addresses from nvcuda.dll or cudart*.dll.

Malloc Error: OpenCV/C++ while push_back Vector

I try to create a Descriptor using FAST for the Point detection and SIFT for building the Descriptor. For that purpose I use OpenCV. While I use OpenCV's FAST I just use parts of the SIFT code, because I only need the Descriptor. Now I have a really nasty malloc Error and I don't know, how to solve it. I posted my code into GitHub because it is big and I dont really know where the Error comes from. I just know, that it is created at the end of the DO-WHILE-Loop:
}while(candidates.size() > 100);
As you can see in the code of GitHub I already tried to release Memory of the Application. Xcode Analysis says, that my Application uses 9 Mb memory. I tried to debug the Error but It was very complicated and I haven't found any clue where the Error comes from.
I wondered if this Error could occur because I try to access the Image Pixel Value passed to calcOrientationHist(...) with img.at<sift_wt>(...) where typdef float sift_wt at Line 56, and 57 in my code, because normally the Patch I pass outputs the type 0 which means it is a CV_8UC1 But well, I copied this part from the sift.cpp at Line 330 and 331 Normally the SIFT Descriptor should also have a Grayscale image or not?
After changing the type in the img.at<sift_wt>(...)Position nothing changed. So I googled Solutions and landed at the GuardMalloc feature from XCode. Enabling it showed me a new Error which is probably the Reason I get the Malloc Error. In line 77 of my Code. The Error it gives me at this line is EXC_BAD_ACCESS (Code=1, address=....) There are the following lines:
for( k = 0; k < len; k ++){
int bin = cvRound((n/360.f)+Ori[k]);
if(bin >= n)
bin -=n;
if(bin < 0 )
bin +=n;
temphist[bin] += W[k]*Mag[k];
The Values of the mentioned Variables are the following:
bin = 52, len = 169, n = 36, k = 0, W, Mag, Ori and temphist are not shown.
Here the GuadMalloc Output (sorry but I dont really understand what exactly it wants)
GuardMalloc[Test-1935]: Allocations will be placed on 16 byte boundaries.
GuardMalloc[Test-1935]: - Some buffer overruns may not be noticed.
GuardMalloc[Test-1935]: - Applications using vector instructions (e.g., SSE) should work.
GuardMalloc[Test-1935]: version 108
Test(1935,0x102524000) malloc: protecting edges
Test(1935,0x102524000) malloc: enabling scribbling to detect mods to free blocks
Answer is simpler as thought...
The Problem was, that in the calculation of Bin in the For-loop the wrong value came out. Instead of adding ori[k] it should be a multiplication with ori[k].
The mistake there resulted in a bin value of 52. But the Length of the Array that temphist is pointing to is 38.
For all who have similar Errors I really recomment to use GuardMalloc or Valgrind to debug Malloc Errors.

Eigen code fail in release mode but work in debug mode

Hi everyone who use Eigen, I encountered a strange question here.
I implemented a Unscented Kalman Filter with Eigen.
It works very well on my PC, but the same piece of code will generate segmentation fault on my embedded system, Odroid XU (Armv7 Architecture).
After hours of debugging, I found the problem was with this part:
m_r = qrSolver.matrixQR().triangularView<Upper>();
S_pre = m_r.block(0,0,n,n).transpose();
if (w_c0 < 0)
sqrt(-w_c0)*(sigmaPoints.col(0) - state_pre),
sqrt(w_c0)*(sigmaPoints.col(0) - state_pre),
where I first compute the QR decomposition of matrix OS (dimension n-by-3n), and then perform rank update of its R component (dimension n-by-n). internal::llt_inplace::rankUpdate is a function in Eigen library which is not documented. It just perform rank-1 update to its first argument. This function can be found in ~/path_to_Eigen/Cholesky/LLT.h
The most strange thing of this piece of code is, with -DCMAKE_BUILD_TYPE=Debug it works perfectly, while if I compile with -DCMAKE_BUILD_TYPE=Release, this code fails.
I would like to ask can anyone understand this or does anyone have similar issue before. Please help, thanks a lot.

VST on XCode 4.6 - Plugin gives high output directly when loaded

I'm programming a Steinberg VST-Plugin in XCode 4.6.
I've already implemented a Highpass-filter which works correctly. Now I'm trying to do some nonlinear distortion with a quadratic function. After I implemented the few lines below and loaded the plugin into the host, I get immediatly an Output from the plugin - you can hear nothing, but the meter is up high.
I really can't imagine why. The processReplacing function where the math takes place should only be called when playing sound, not when the plugin is loaded. When I remove the few lines of code below, everything is okay and sounds right, so I assume it has nothing to do with the rest of the plugin-code.
The problem takes place in two hosts, so its probably not a VST-bug.
Has anybody ever experienced a similar problem?
Many Thanks,
void Exciter::processReplacing(float** inputs, float** outputs, VstInt32 sampleFrames){
for(int i = 0; i < sampleFrames; i++) {
tempsample = inputs[0][i];
//Exciter - Transformation in positive region, quadratic distortion and backscaling
tempsample = tempsample + 1.0f;
tempsample = powf(tempsample, 2.0f);
tempsample = tempsample / 2.0f;
tempsample -= 1.0f;
//Mix-Knob: Dry/Wet ------------------------------------------------
outputs[0][i] = mix*(tempsample) + (1-mix)*inputs[0][i];
EDIT: I added logfile-outputs to each function and it occurs, that the processReplacing function is called permanently, not only when playback is turned on ... But why?
You pretty much answered the question yourself with your edit. processReplacing is called repeatedly. This is part of the VST specification.
VST plug-ins are targeted for real time effects processing. Don't confuse or misinterpret this as lookahead. By real time, I mean inserting the plug-in into a channel and playing an instrument while the DAW is recording. So you can see that in order to mitigate latency, the host is always sending the plug-in an audio buffer (whether it's silence or not).

cvHaarDetectObjects(): "Stack aound the variable 'seq_thread' was corrupted."

I have been looking in to writing my own implementation of Haar Cascaded face detection for some time now, and have begun with diving in to the OpenCV 2.0 implementation.
Right out of the box, running in debug mode, Visual Studio breaks on cvhaar.cpp:1518, informing me:
Run-Time Check Failure #2 - Stack aound the variable seq_thread was corrupted.
It seems odd to me that OpenCV ships with a simple array out-of-bounds problem. Running the release works without any problems, but I suspect that it is merely not performing the check and the array is exceeding the bounds.
Why am I receiving this error message? Is it a bug in OpenCV?
A little debugging revealed the culprit, I believe. I "fixed" it, but this all still seems odd to me.
An array of size CV_MAX_THREADS is created on cvhaar.cpp:868:
CvSeq* seq_thread[CV_MAX_THREADS] = {0};
On line 918 it proceeds to specify max_threads:
max_threads = cvGetNumThreads();
In various places, seq_thread is looped using the following for statement:
for( i = 0; i < max_threads; i++ ) {
CvSeq* s = seq_thread[i];
// ...
However, cxmisc.h:108 declares CV_MAX_THREADS:
#define CV_MAX_THREADS 1
Hence, the declaration of seq_thread must never be allowed to exceed size 1, yet cvGetNumThreads() returns 2 (I assume this reflects the number of cores in my machine).
I resolved the problem by adding the following simple little statement:
if (max_threads > CV_MAX_THREADS) max_threads = CV_MAX_THREADS;
Does any of this make sense?