Unexpeced value upon accessing an SSBO float - c++

I am trying to calculate a morph offset for a gpu driven animation.
To that effect I have the following function (and SSBOS):
layout(std140, binding = 7) buffer morph_buffer
{
vec4 morph_targets[];
};
layout(std140, binding = 8) buffer morph_weight_buffer
{
float morph_weights[];
};
vec3 GetMorphOffset()
{
vec3 offset = vec3(0);
for(int target_index=0; target_index < target_count; target_index++)
{
float w1 = morph_weights[1];
offset += w1 * morph_targets[target_index * vertex_count + gl_VertexIndex].xyz;
}
return offset;
}
I am seeing strange behaviour so I opened renderdoc to trace the state:
As you can see, index 1 of the morph_weights SSBO is 0. However if I step over in the built in debugger for renderdoc I obtain:
Or in short, the variable I get back is 1, not 0.
So I did a little experiment and changed one of the values and now the SSBO looks like this:
And now I get this:
So my SSBO of type float is being treated like an ssbo of vec4's it seems. I am aware of alignment issues with vec3's, but IIRC floats are fair game. What is happenning?

Upon doing a little bit of asking around.
The issue is the SSBO is marked as std140, the correct std for a float array is std430.
For the vulkan GLSL dialect, an alternative is to use the scalar qualifier.

Related

GLSL Compute Shader Setting buffer with lookup table results in no data written, setting the same buffer with other data works

I am attempting to implement a slightly modified version of this standard marching cubes algorithm in a compute shader.
I have reached the stage at which triTable is used to insert the correct vertex indices into a buffer and have modified the table to be 1 dimensional (const int triTable[4096]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,8,3...})
The following code shows the error that I am experiencing (this does not implement the algorithm however it demonstrates the current issue fully):
layout(binding=1) buffer Grid
{
float GridData[]; //contains 512*512*512 data volume previously generated, unused in this test case
};
uniform uint marchableCount;
uniform uint pointCount;
layout(std430, binding = 4) buffer X {uvec4 marchableList[];}; //format is x,y,z,cubeIndex
layout(std430, binding = 5) buffer v {vec4 vertices[];};
layout(std430,binding = 6) buffer n {vec4 normals[];};
layout(binding = 7) uniform atomic_uint triCount;
void main()
{
uvec3 gid = marchableList[gl_GlobalInvocationID.x].xyz; //xyz of grid cell
int E = int(edgeTable[marchableList[gl_GlobalInvocationID.x].w]);
if (E != 0)
{
uint cubeIndex = marchableList[gl_GlobalInvocationID.x].w;
uint index = atomicCounterIncrement(triCount);
int tCount = 0;//unused in this test, used for iteration in actual algorithm
int tGet = tCount+16*int(cubeIndex); //correction from converting 2d array to 1d array
vertices[index] = vec4(tGet);
}
}
This code produces expected values: the vertices buffer is filled with data and the atomic counter increments
changing this line:
vertices[index] = vec4(tGet);
to
vertices[index] = vec4(triTable[tGet]);
or
vertices[index] = vec4(triTable[tGet]+1);
(demonstrating that triTable is not coincidentally returning zeros)
results in what appears to be a complete failure of the shader: the buffer is filled with zeros and the atomic counter does not increment. No error messages are output when the shader is compiled. tGet is less than 4096.
The following test cases also produce the correct output:
vertices[index] = vec4(triTable[3]); //-1
vertices[index] = vec4(triTable[4095]); //also -1
showing that triTable is in fact implemented correctly
What causes the shader to have issues in these very specific cases?
I'm more surprised that const int triTable[4096] = {...}; compiles at all. That array, if it is actually needed, is 16KB in size. That's a lot for a shader, even if the array lives in shared memory.
What is most likely happening is that, whenever the compiler detects usage of this array that it can't optimize it out to a simple value (triTable[3] will always be 1, so the compiler doesn't need to store the whole table), the compilation either fails or results in a non-functional shader.
It would be best to make this table a uniform buffer. An SSBO might work too, but some hardware implements uniform blocks through specialized memory rather than with a global memory fetch.

OpenGL 4.5 - Shader storage buffer objects layout

I'm trying my hand at shader storage buffer objects (aka Buffer Blocks) and there are a couple of things I don't fully grasp. What I'm trying to do is to store the (simplified) data of an indeterminate number of lights n in them, so my shader can iterate through them and perform calculations.
Let me start by saying that I get the correct results, and no errors from OpenGL. However, it bothers me not to know why it is working.
So, in my shader, I got the following:
struct PointLight {
vec3 pos;
float intensity;
};
layout (std430, binding = 0) buffer PointLights {
PointLight pointLights[];
};
void main() {
PointLight light;
for (int i = 0; i < pointLights.length(); i++) {
light = pointLights[i];
// etc
}
}
and in my application:
struct PointLightData {
glm::vec3 pos;
float intensity;
};
class PointLight {
// ...
PointLightData data;
// ...
};
std::vector<PointLight*> pointLights;
glGenBuffers(1, &BBO);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, BBO);
glNamedBufferStorage(BBO, n * sizeof(PointLightData), NULL, GL_DYNAMIC_STORAGE_BIT);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, BBO);
...
for (unsigned int i = 0; i < pointLights.size(); i++) {
glNamedBufferSubData(BBO, i * sizeof(PointLightData), sizeof(PointLightData), &(pointLights[i]->data));
}
In this last loop I'm storing a PointLightData struct with an offset equal to its size times the number of them I've already stored (so offset 0 for the first one).
So, as I said, everything seems correct. Binding points are correctly set to the zeroeth, I have enough memory allocated for my objects, etc. The graphical results are OK.
Now to my questions. I am using std430 as the layout - in fact, if I change it to std140 as I originally did it breaks. Why is that? My hypothesis is that the layout generated by std430 for the shader's PointLights buffer block happily matches that generated by the compiler for my application's PointLightData struct (as you can see in that loop I'm blindingly storing one after the other). Do you think that's the case?
Now, assuming I'm correct in that assumption, the obvious solution would be to do the mapping for the sizes and offsets myself, querying opengl with glGetUniformIndices and glGetActiveUniformsiv (the latter called with GL_UNIFORM_SIZE and GL_UNIFORM_OFFSET), but I got the sneaking suspicion that these two guys only work with Uniform Blocks and not Buffer Blocks like I'm trying to do. At least, when I do the following OpenGL throws a tantrum, gives me back a 1281 error and returns a very weird number as the indices (something like 3432898282 or whatever):
const char * names[2] = {
"pos", "intensity"
};
GLuint indices[2];
GLint size[2];
GLint offset[2];
glGetUniformIndices(shaderProgram->id, 2, names, indices);
glGetActiveUniformsiv(shaderProgram->id, 2, indices, GL_UNIFORM_SIZE, size);
glGetActiveUniformsiv(shaderProgram->id, 2, indices, GL_UNIFORM_OFFSET, offset);
Am I correct in saying that glGetUniformIndices and glGetActiveUniformsiv do not apply to buffer blocks?
If they do not, or the fact that it's working is like I imagine just a coincidence, how could I do the mapping manually? I checked appendix H of the programming guide and the wording for array of structures is somewhat confusing. If I can't query OpenGL for sizes/offsets for what I'm tryind to do, I guess I could compute them manually (cumbersome as it is) but I'd appreciate some help in there, too.

How to extend vertex shader capabalities for GPGPU

I'm trying to implement Scrypt hasher (for LTC miner) on GLSL (don't ask me why).
And, actually, I'm stucked with HMAC SHA-256 algorithm. Despite I've implemented SHA-256 correctly (it retuns corrent hash for input), fragment shader stops to compile when I add the last step (hashing previous hash concated with oKey).
The shader can't do more than three rounds of SHA-256. It just stops to compile. What are the limits? It doesn't use much memory, 174 vec2 objects in total. It seems, it doesn't relate to memory, because any extra SHA256 round doesn't require new memory. And it seems, it doesn't relate to viewport size. It stops to work on both 1x1 and 1x128 viewports.
I've started to do miner on WebGL, but after limit appeared, I've tried to run the same shader in the Qt on the full featured OpenGL. In result, desktop OpenGL allows one SHA256 round lesser then OpenGL ES in WebGL (why?).
Forgot to mention. Shader fails on the linkage stage. The shader compiles well itself, but the program linkage fails.
I don't use any textures, any extensions, slow things etc. Just simple square (4 vec2 vertecies) and several uniforms for fragment shader.
Input data is just 80 bytes, the result of fragment shader is binary (black or white), so the task ideally fits the GLSL principes.
My videocard is Radeon HD7970 with plenty of VRAM, which is able to fit hundreds of scrypt threads (scrypt uses 128kB per hash, but I can't achieve just HMAC-SHA-256). My card supports OpenGL 4.4.
I'm newbie in OpenGL, and may understand something wrong. I understand that fragment shader runs for each pixel separately, but if I have 1x128 viewport, there are only 128x348 bytes used. Where is the limit of fragment shader.
Here is the common code I use to let you understand, how I'm trying to solve the problem.
uniform vec2 base_nonce[2];
uniform vec2 header[20]; /* Header of the block */
uniform vec2 H[8];
uniform vec2 K[64];
void sha256_round(inout vec2 w[64], inout vec2 t[8], inout vec2 hash[8]) {
for (int i = 0; i < 64; i++) {
if( i > 15 ) {
w[i] = blend(w[i-16], w[i-15], w[i-7], w[i-2]);
}
_s0 = e0(t[0]);
_maj = maj(t[0],t[1],t[2]);
_t2 = safe_add(_s0, _maj);
_s1 = e1(t[4]);
_ch = ch(t[4], t[5], t[6]);
_t1 = safe_add(safe_add(safe_add(safe_add(t[7], _s1), _ch), K[i]), w[i]);
t[7] = t[6]; t[6] = t[5]; t[5] = t[4];
t[4] = safe_add(t[3], _t1);
t[3] = t[2]; t[2] = t[1]; t[1] = t[0];
t[0] = safe_add(_t1, _t2);
}
for (int i = 0; i < 8; i++) {
hash[i] = safe_add(t[i], hash[i]);
t[i] = hash[i];
}
}
void main () {
vec2 key_hash[8]; /* Our SHA-256 hash */
vec2 i_key[16];
vec2 i_key_hash[8];
vec2 o_key[16];
vec2 nonced_header[20]; /* Header with nonce */
set_nonce_to_header(nonced_header);
vec2 P[32]; /* Padded SHA-256 message */
pad_the_header(P, nonced_header);
/* Hash HMAC secret key */
sha256(P, key_hash);
/* Make iKey and oKey */
for(int i = 0; i < 16; i++) {
if (i < 8) {
i_key[i] = xor(key_hash[i], vec2(Ox3636, Ox3636));
o_key[i] = xor(key_hash[i], vec2(Ox5c5c, Ox5c5c));
} else {
i_key[i] = vec2(Ox3636, Ox3636);
o_key[i] = vec2(Ox5c5c, Ox5c5c);
}
}
/* SHA256 hash of iKey */
for (int i = 0; i < 8; i++) {
i_key_hash[i] = H[i];
t[i] = i_key_hash[i];
}
for (int i = 0; i < 16; i++) { w[i] = i_key[i]; }
sha256_round(w, t, i_key_hash);
gl_FragColor = toRGBA(i_key_hash[0]);
}
What solutions can I use to improve the situation? Is there something cool in OpenGL 4.4, in OpenGL ES 3.1? Is it even possible to do such calculations and keep so much (128kB) in fragment shader? What are limits for the vertex shader? Can I do the same on the vertex shader instead the fragment?
I try to answer on the my own question.
Shader is a small processor with limited registers and cache memory. Also, there are limit for instruction execution. So, the whole architecture to fit all into one fragment shader is wrong.
On another way, you can change your shader programs during render tens or hundreds times. It is normal practice.
It is necessary to divide big computation into smaller parts and render them separately. Use render-to-texture to save your work.
Due to the webgl statistic, 96.5% of clients has MAX_TEXTURE_SIZE eq 4096. It gives you 32 megabytes of memory. It can contain the draft data for 256 threads of scrypt computation.

OpenGL - Calling glBindBufferBase with index = 1 breaks rendering (Pitch black)

There's an array of uniform blocks in my shader which is defined as such:
layout (std140) uniform LightSourceBlock
{
int shadowMapID;
int type;
vec3 position;
vec4 color;
float dist;
vec3 direction;
float cutoffOuter;
float cutoffInner;
float attenuation;
} LightSources[12];
To be able to bind my buffer objects to each LightSource, I've bound each uniform to a uniform block index:
for(unsigned int i=0;i<12;i++)
glUniformBlockBinding(program,locLightSourceBlock[i],i); // locLightSourceBlock contains the locations of each element in LightSources[]
When rendering, I'm binding my buffers to the respective index using:
glBindBufferBase(GL_UNIFORM_BUFFER,i,buffer);
This works fine, as long as I only bind a single buffer to the binding index 0. As soon as there's more, everything is pitch black (Even things that use entirely different shaders). (glGetError returns no errors)
If I change the block indices range from 0-11 to 2-13 (Skipping index 1), everything works as it should. I figured if I use index 1, I'm overwriting something, but I don't have any other uniform blocks in my shader, and I'm not using glUniformBlockBinding or glBindBufferBase anywhere else in my code, so I'm not sure.
What could be causing such behavior? Is the index 1 reserved for something?
1) Dont use multiple blocks. Use one block with array. Something like this:
struct Light{
...
}
layout(std430, binding=0) uniform lightBuffer{
Light lights[42];
}
skip glUniformBlockBinding and only glBindBufferBase to index specified in shader
2) Read up on alignment for std140, std430. In short, buffer variable are aligned so they dont cross 128bit boundaries. So in your case position would start at byte 16 (not 8). This results in mismatch of CPU/GPU side access. (Reorder variables or add padding)

GLSL Channel Selection

I have a GLSL shader that reads from one of the channels (e.g. R) of an input texture and then writes to the same channel in an output texture. This channel has to be selected by the user.
What I can think of right now is to just use an int uniform and tons of if-statements:
uniform sampler2D uTexture;
uniform int uChannelId;
varying vec2 vUv;
void main() {
//read in data from texture
vec4 t = texture2D(uTexture, vUv);
float data;
if (uChannelId == 0) {
data = t.r;
} else if (uChannelId == 1) {
data = t.g;
} else if (uChannelId == 2) {
data = t.b;
} else {
data = t.a;
}
//process the data...
float result = data * 2; //for example
//write out
if (uChannelId == 0) {
gl_FragColor = vec4(result, t.g, t.b, t.a);
} else if (uChannelId == 1) {
gl_FragColor = vec4(t.r, result, t.b, t.a);
} else if (uChannelId == 2) {
gl_FragColor = vec4(t.r, t.g, result, t.a);
} else {
gl_FragColor = vec4(t.r, t.g, t.b, result);
}
}
Is there any way of doing something like a dictionary access such as t[uChannelId]?
Or perhaps I should have 4 different versions of the same shader, each of which processes a different channel, so that I can avoid all the if-statements?
What is the best way to do this?
EDIT: To be more specific, I am using WebGL (Three.js)
There is such a way, and it is as simple as you actually wrote it in the question. Just use t[channelId]. To quote the GLSL Spec (This is from Version 3.30, Section 5.5, but applies to other versions as well):
Array subscripting syntax can also be applied to vectors to provide numeric indexing. So in
vec4 pos;
pos[2] refers to the third element of pos and is equivalent to pos.z. This allows variable indexing into a
vector, as well as a generic way of accessing components. Any integer expression can be used as the
subscript. The first component is at index zero. Reading from or writing to a vector using a constant
integral expression with a value that is negative or greater than or equal to the size of the vector is illegal.
When indexing with non-constant expressions, behavior is undefined if the index is negative, or greater
than or equal to the size of the vector.
Note that for the first part of your code, you use this to access a specific channel of a texture. You could also use the ARB_texture_swizzle functionality. In that case, you would just use a fxied channel, say r, for access in the shader and what swizzle the actual texture channels so that wahtever channel you want to access becomes r.
Update: as the target platform turned out to be webgl, these suggestions are not available. However, a simple solution would be to use a vec4 uniform in place of uChannelID which is 1.0 for the selected component and 0.0 for all others. Say this variable is called uChannelSel. You could use data=dot(t, uChannelSel) in the first part and gl_FragColor=(vec4(1.0)-uChannelSel) * t + uChannelSel*result for the second part.
as i'm sure you know, branching can be expensive in shaders. however, it sounds like it'll always be the same channel in a pass (yes?), so you might maintain enough cohesion to see good performance.
it's been a good while since i've used GLSL, but if you're using a newer version, maybe you could do some bitwise shifting (<< or >>) magic? you would read the texture into int instead of vec4, then shift it a number of bits depending on which channel you want to read.