newComputePipelineStateWithFunction failed - c++

I am trying to let a neural net run on metal.
The basic idea is that of data duplication. Each gpu thread runs one version of the net for random data points.
I have written other shaders that work fine.
I also tried my code in a c++ command line app. No errors there.
There is also no compile error.
I used the apple documentation to convert to metal c++, since not everything from c++11 is supported.
It crashes after it loads the kernel function and when it tries to assign newComputePipelineStateWithFunction to the metal device. This means there is a problem with the code that isn't caught at compile time.
MCVE:
kernel void net(const device float *inputsVector [[ buffer(0) ]], // layout of net *
uint id [[ thread_position_in_grid ]]) {
uint floatSize = sizeof(tempFloat);
uint inputsVectorSize = sizeof(inputsVector) / floatSize;
float newArray[inputsVectorSize];
float test = inputsVector[id];
newArray[id] = test;
}
Update
It has everything to do with dynamic arrays.
Since it fails to create the pipeline state and doesn't crash running the actual shader it must be a coding issue. Not an input issue.
Assigning values from a dynamic array to a buffer makes it fail.

The real problem:
It is a memory issue!
To all the people saying that it was a memory issue, you were right!
Here is some pseudo code to illustrate it. Sorry that it is in "Swift" but easier to read. Metal Shaders have a funky way of coming to life. They are first initialised without values to get the memory. It was this step that failed because it relied on a later step: setting the buffer.
It all comes down to which values are available when. My understanding of newComputePipelineStateWithFunction was wrong. It is not simply getting the shader function. It is also a tiny step in the initialising process.
class MetalShader {
// buffers
var aBuffer : [Float]
var aBufferCount : Int
// step One : newComputePipelineStateWithFunction
memory init() {
// assign shader memory
// create memory for one int
let aStaticValue : Int
// create memory for one int
var aNotSoStaticValue : Int // this wil succeed, assigns memory for one int
// create memory for 10 floats
var aStaticArray : [Float] = [Float](count: aStaticValue, repeatedValue: y) // this will succeed
// create memory for x floats
var aDynamicArray : [Float] = [Float](count: aBuffer.count, repeatedValue: y) // this will fail
var aDynamicArray : [Float] = [Float](count: aBufferCount, repeatedValue: y) // this will fail
let tempValue : Float // one float from a loop
}
// step Two : commandEncoder.setBuffer()
assign buffers (buffers) {
aBuffer = cpuMemoryBuffer
}
// step Three : commandEncoder.endEncoding()
actual init() {
// set shader values
let aStaticValue : Int = 0
var aNotSoStaticValue : Int = aBuffer.count
var aDynamicArray : [Float] = [Float](count: aBuffer.count, repeatedValue: 1) // this could work, but the app already crashed before getting to this point.
}
// step Four : commandBuffer.commit()
func shaderFunction() {
// do stuff
for i in 0..<aBuffer.count {
let tempValue = aBuffer[i]
}
}
}
Fix:
I finally realised that buffers are technically dynamic arrays and instead of creating arrays inside the shader, I could also just add more buffers. This obviously works.

I think your problem is with this line :
uint schemeVectorSize = sizeof(schemeVector) / uintSize;
Here schemeVector is dynamic so as in classic C++ you cannot use sizeof on a dynamic array to get number of elements. sizeof would only work on arrays you would have defined locally/statically in the metal shader code.
Just imagine how it works internally : at compile time, the Metal compiler is supposed to transform the sizeof call into a constant ... but he can't since schemeVector is a parameter of your shader and thus can have any size ...
So for me the solution would be to compute schemeVectorSize in the C++/ObjectiveC/Swift part of your code, and pass it as a parameter to the shader (as a uniform in OpenGLES terminology ...).

Related

glfwSwapBuffers slow (>3s)

The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

Problem testing DTid.x Direct3D ComputeShader HLSL

I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}
I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.

Communicating an array of bvec2 between host->shader and/or shader->shader

I need to communicate two boolean values per entry of an array in a compute shader. For now, I'm getting them from the cpu, but later I will want to generate these values from another compute shader that runs before that one. I got this working as follows:
Using glm::bvec2 I can place the booleans relatively packed into memory (the bvec stores one bool per byte. could be nicer, but will do for now, I can always manually pack this). Then, I use vkMapMemory to place the data into a Vulkan buffer (I then copy it to a device local buffer, but that's probably irrelevant here).
GLSL's bvec2 is not equivalent to that, unfortunately (or at least it won't give me the expected values if I use it, maybe I'm doing it wrong? Using bvec2 changes[] yields wrong results in the following code. I suspect an alignment mismatch). Because of that, the compute shader accesses this array as follows:
layout (binding = 2, scalar) buffer Changes
{
uint changes[];
};
void main() {
//uint is 4 byte, glm::bvec2 is 2 byte
uint changeIndex = gl_GlobalInvocationID.x / 2;
//order seems to be reversed in memory:
//v1(x1, y1) followed by v2(x2, y2) is stored as: 0000000(y2) 0000000(x2) 0000000(y1) 0000000(x1)
uint changeOffset = (gl_GlobalInvocationID.x % 2) * 16;
uint maskx = 1 << (changeOffset + 0);
uint masky = 1 << (changeOffset + 8);
uint uchange = changes[changeIndex];
bvec2 change = bvec2(uchange & maskx, uchange & masky);
}
This works. Took a bit of trial and error but there we go. I have two questions now:
Is there a more elegant way to do this?
When generating the values via compute shaders, I would not be using glm::bvec2. Should I perhaps just manually pack the booleans - one per bit - into uints, or is there a better way?
Performance is pretty important to me in this application, as I'm trying to benchmark things. Memory usage optimizations are secondary, but also worth considering. Being relatively inexperienced with optimizing GLSL, I'm happy about any advice you can give me.
Since glm::bvec2 stores a boolean as two bytes, perhaps the explicitly 8-bit unsigned integer vector type u8vec2 provided by the GL_EXT_shader_8bit_storage extension would be more convenient here? I don't know if the Vulkan driver you're using will support the necessary feature (I assume it's storageBuffer8BitAccess), though.
The comment by Andrea mentions a useful extension: GL_EXT_shader_8bit_storage.
As the writing access to my booleans is done in parallel, the only option for tight packing is atomics. I've chosen to trade memory efficiency for performance by storing two booleans in one byte, "wasting" 6 bits. The code is as follows:
#extension GL_EXT_shader_8bit_storage : enable
void getBools(in uint data, out bool split, out bool merge) {
split = (data & 1) > 0;
merge = (data & 2) > 0;
}
uint setBools(in bool split, in bool merge) {
uint result = 0;
if (split) result = result | 1;
if (merge) result = result | 2;
return result;
}
//usage:
layout (binding = 4, scalar) buffer ChangesBuffer
{
uint8_t changes[];
};
//[...]
bool split, merged;
getBools(uint(changes[invocationIdx]), split, merged);
//[...]
changes[idx] = uint8_t(setBools(split, merge));
Note the constructors, the data types provided by the extension do not provide any arithmetic operations and must be converted before use.

Animations producing intel_drm errors

I'm working on implementing animations within my model loader which uses Assimp; C++/OpenGL for rendering. I've been following this tutorial: http://ogldev.atspace.co.uk/www/tutorial38/tutorial38.html extensively. Suffice it to say that I did not follow the tutorial completely as there were some bits that I disagreed with code-wise, so I adapted it. Mind you, I don't use none of the maths components the author there uses, so I used glm. At any rate, the problem is that sometimes my program runs, and on other times it doesn't. When I run my program it would run and then crash instantly, and on other times it would simply run as normal.
A few things to take into account:
Before animations/loading bones were added, the model loader worked completely fine and models were loaded without causing no crash whatsoever;
Models with NO bones still load just as fine; it only becomes a problem when models with bones are being loaded.
Please note that NOTHING from the bones is being rendered. I haven't even started allocating the bones to vertex attributes; not even the shaders are modified for this.
Everything is being run on a single thread; there is no multi-threading... yet.
So, naturally I took to this bit of code which actually loaded the bones. I've debugged the application and found that the problems lie mostly around here:
Mesh* processMesh(uint meshIndex, aiMesh *mesh)
{
vector<VertexBoneData> bones;
bones.resize(mesh->mNumVertices);
// .. getting other mesh data
if (pAnimate)
{
for (uint i = 0; i < mesh->mNumBones; i++)
{
uint boneIndex = 0;
string boneName(mesh->mBones[i]->mName.data);
auto it = pBoneMap.find(boneName);
if (it == pBoneMap.end())
{
boneIndex = pNumBones;
++pNumBones;
BoneInfo bi;
pBoneInfo.push_back(bi);
auto tempMat = mesh->mBones[i]->mOffsetMatrix;
pBoneInfo[boneIndex].boneOffset = to_glm_mat4(tempMat);
pBoneMap[boneName] = boneIndex;
}
else boneIndex = pBoneMap[boneName];
for (uint j = 0; j < mesh->mBones[i]->mNumWeights; j++)
{
uint vertexID = mesh->mBones[i]->mWeights[j].mVertexId;
float weit = mesh->mBones[i]->mWeights[j].mWeight;
bones.at(vertexID).addBoneData(boneIndex, weit);
}
}
}
}
In the last line the author used a [] operator to access elements, but I decided to use '.at for range-checking. The function to_glm_mat4 is defined thus:
glm::mat4 to_glm_mat4(const aiMatrix4x4 &m)
{
glm::mat4 to;
to[0][0] = m.a1; to[1][0] = m.a2;
to[2][0] = m.a3; to[3][0] = m.a4;
to[0][1] = m.b1; to[1][1] = m.b2;
to[2][1] = m.b3; to[3][1] = m.b4;
to[0][2] = m.c1; to[1][2] = m.c2;
to[2][2] = m.c3; to[3][2] = m.c4;
to[0][3] = m.d1; to[1][3] = m.d2;
to[2][3] = m.d3; to[3][3] = m.d4;
return to;
}
I also had to change VertexBoneData since it used raw arrays which I thought flawed:
struct VertexBoneData
{
vector boneIDs;
vector weights;
VertexBoneData()
{
reset();
boneIDs.resize(NUM_BONES_PER_VERTEX);
weights.resize(NUM_BONES_PER_VERTEX);
}
void reset()
{
boneIDs.clear();
weights.clear();
}
void addBoneData(unsigned int boneID, float weight)
{
for (uint i = 0; i < boneIDs.size(); i++)
{
if (weights.at(i) == 0.0) // SEG FAULT HERE
{
boneIDs.at(i) = boneID;
weights.at(i) = weight;
return;
}
}
assert(0);
}
};
Now, I'm not entirely sure what is causing the crash, but what baffles me most is that sometimes the program runs (implying that the code isn't necessarily the culprit). So I decided to do a debug-smashdown which involved me inspecting each bone (I skipped some; there are loads of bones!) and found that AFTER all the bones have been loaded I would get this very strange error:
No source available for "drm_intel_bo_unreference() at 0x7fffec369ed9"
and sometimes I would get this error:
Error in '/home/.../: corrupted double-linked list (not small): 0x00000 etc ***
and sometimes I would get a seg fault from glm regarding a vec4 instantiation;
and sometimes... my program runs without ever crashing!
To be fair, implementing animations may just about be harsh on my laptop so maybe it's a CPU/GPU problem as in it's unable to process so much data in one gulp, which is resulting in this crash. My theory is that since it's unable to process that much data, that data is never allocated to vectors.
I'm not using any multi-threading whatsoever, but it has crossed my mind. I figure that it may be the CPU being unable to process so much data hence the chance-run. If I implemented threading, such that the bone-loading is done on another thread; or better, use a mutex because what I found is that by debugging the application slowly the program runs, which makes sense because each task is being broken down into chunks; and that is what a mutex technically does, per se.
For the sake of the argument, and no mockery avowed, my technical specs:
Ubuntu 15.04 64-bit
Intel i5 dual-core
Intel HD 5500
Mesa 10.5.9 (OpenGL 3.3)
Programming on Eclipse Mars
I thus ask, what the hell is causing these intel_drm errors?
I've reproduced this issue and found it may have been a problem with the lack of multi-threading when it comes to loading bones. I decided to move the loading bone errata into its own function as prescribed in the foresaid tutorial. What I later did was:
if (pAnimate)
{
std::thread t1[&] {
loadBones(meshIndex, mesh, bones);
});
t1.join();
}
The lambda function above has the [&] to indicate we're passing everything as a reference to ensure no copies are created. To prevent any external forces from 'touching' the data within the loadBones(..) function, I've installed a mutex within the function like so:
void ModelLoader::loadBones(uint meshIndex, const aiMesh *mesh, std::vector<VertexBoneData> &bones)
{
std::mutex mut;
std::lock_guard<std::mutex> lock(mut);
// load bones
}
This is only a quick and dirty fix. It might not work for everyone, and there's no guarantee the program will run crash-less.
Here are some testing results:
Sans threading & mutex: program runs 0 out of 3 times in a row
With threading; sans mutex: program runs 2 out of 3 times in a row
With threading & mutex: program runs 3 out of 3 times in a row
If you're using Linux, remember to link pthread as well as including <thread> and <mutex>. Suggestions on thread-optimisation are welcome!

OpenCL struct values correct on CPU but not on GPU

I do have a struct in a file wich is included by the host code and the kernel
typedef struct {
float x, y, z,
dir_x, dir_y, dir_z;
int radius;
} WorklistStruct;
I'm building this struct in my c++ host code and passing it via a buffer to the OpenCL kernel.
If I'm choosing an CPU device for computation I will get the following result:
printf ( "item:[%f,%f,%f][%f,%f,%f]%d,%d\n", item.x, item.y, item.z, item.dir_x, item.dir_y,
item.dir_z , item.radius ,sizeof(float));
Host:
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
Device (CPU):
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
And if I choose a GPU device (AMD) for computation weird things are happening:
Host:
item:[58.406261,57.786015,58.137501][2.000000,2.000000,2.000000]2,4
Device (GPU):
item:[58.406261,2.000000,0.000000][0.000000,0.000000,0.000000]0,0
Notable is that the sizeof(float) is garbage on the gpu.
I assume there is a problem with the layouts of floats on different devices.
Note: the struct is contained in an array of structs of this type and every struct in this array is garbage on GPU
Anyone does have an idea why this is the case and how I can predict this?
EDIT I added an %d at the and and replaced it by an 1, the result is:1065353216
EDIT: here two structs wich I'm using
typedef struct {
float x, y, z,//base coordinates
dir_x, dir_y, dir_z;//directio
int radius;//radius
} WorklistStruct;
typedef struct {
float base_x, base_y, base_z; //base point
float radius;//radius
float dir_x, dir_y, dir_z; //initial direction
} ReturnStruct;
I tested some other things, it looks like a problem with printf. The values seems to be right. I passed the arguments to the return struct, read them and these values were correct.
I don't want to post all of the related code, this would be a few hundred lines.
If noone has an idea I would compress this a bit.
Ah, and for printing I'm using #pragma OPENCL EXTENSION cl_amd_printf : enable.
Edit:
Looks really like a problem with printf. I simply don't use it anymore.
There is a simple method to check what happens:
1 - Create host-side data & initialize it:
int num_points = 128;
std::vector<WorklistStruct> works(num_points);
std::vector<ReturnStruct> returns(num_points);
for(WorklistStruct &work : works){
work = InitializeItSomehow();
std::cout << work.x << " " << work.y << " " << work.z << std::endl;
std::cout << work.radius << std::endl;
}
// Same stuff with returns
...
2 - Create Device-side buffers using COPY_HOST_PTR flag, map it & check data consistency:
cl::Buffer dev_works(..., COPY_HOST_PTR, (void*)&works[0]);
cl::Buffer dev_rets(..., COPY_HOST_PTR, (void*)&returns[0]);
// Then map it to check data
WorklistStruct *mapped_works = dev_works.Map(...);
ReturnStruct *mapped_rets = dev_rets.Map(...);
// Output values & unmap buffers
...
3 - Check data consistency on Device side as you did previously.
Also, make sure that code (presumably - header), which is included both by kernel & host-side code is pure OpenCL C (AMD compiler sometimes can "swallow" some errors) and that you've imported directory for includes searching, when building OpenCL kernel ("-I" flag at clBuildProgramm stage)
Edited:
At every step, please collect return codes (or catch exceptions). Beside that, "-Werror" flag at clBuildProgramm stage can also be helpfull.
It looks like I used the wrong OpenCL headers for compiling. If I try the code on the Intel platform(OpenCL 1.2) everything is fine. But on my AMD platform (OpenCL 1.1) I get weird values.
I will try other headers.