Aliasing a SSBO by binding it multiple times in the same shader

Aliasing a SSBO by binding it multiple times in the same shader - c++

Playing around with bindless rendering, I have one big static SSBO that holds my vertex data. The vertices are packed in memory as a contiguous array where each vertex has the following layout:
| Position (floats) | Normal (snorm shorts) | Pad |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| P.x | P.y | P.z | N.x | N.y | N.z | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| float | float | float | uint | uint |
Note how each vertex is 20 bytes / 5 "words" / 1.25 vec4s. Not exactly a round number for a GPU. So instead of doing a bunch of padding and using uneccessary memory, I have opted to unpack the data "manually".
Vertex shader:
...
layout(std430, set = 0, binding = 1)
readonly buffer FloatStaticBuffer
{
float staticBufferFloats[];
};
layout(std430, set = 0, binding = 1) // Using the same binding?!
readonly buffer UintStaticBuffer
{
uint staticBufferUInts[];
};
...
void main()
{
const uint vertexBaseDataI = gl_VertexIndex * 5u;
// Unpack position
const vec3 position = vec3(
staticBufferFloats[vertexBaseDataI + 0u],
staticBufferFloats[vertexBaseDataI + 1u],
staticBufferFloats[vertexBaseDataI + 2u]);
// Unpack normal
const vec3 normal = vec3(
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 3u]),
unpackSnorm2x16(staticBufferUInts[vertexBaseDataI + 4u]).x);
...
}
It is awfully convenient to be able to "alias" the buffer as both float and uint data.
The question: is "aliasing" a SSBO this way a terrible idea, and I'm just getting lucky, or is this actually a valid option that would work across platforms?
Alternatives:
Use just one buffer, say staticBufferUInts, and then use uintBitsToFloat to extract the positions. Not a big deal, but might have a small performance cost?
Bind the same buffer twice on the CPU to two different bindings. Again, not a big deal, just slightly annoying.

Vulkan allows incompatible resources to alias in memory as long as no malformed values are read from it. (Actually, I think it's allowed even when you read from the invalid sections - you should just get garbage. But I can't find the section of the standard right now that spells this out. The Vulkan standard is way too complicated.)
From the standard, section "Memory Aliasing":
Otherwise, the aliases interpret the contents of the memory
differently, and writes via one alias make the contents of memory
partially or completely undefined to the other alias. If the first alias is a host-accessible subresource, then the bytes affected are those written by the memory operations according to its addressing scheme. If the first alias is not host-accessible, then the bytes
affected are those overlapped by the image subresources that were
written. If the second alias is a host-accessible subresource, the
affected bytes become undefined. If the second alias is not
host-accessible, all sparse image blocks (for sparse
partially-resident images) or all image subresources (for non-sparse
image and fully resident sparse images) that overlap the affected
bytes become undefined.
Note that the standard talks about bytes being written and becoming undefined in aliasing resources. It's not the entire resource that becomes invalid.
Let's see it this way: You have two aliasing SSBOs (in reality just one that's bound twice) with different types (float, short int). Any bytes that you wrote floats into became valid in the "float view" and invalid in the "int view" the moment you wrote into the buffer. The same goes for the ints: The bytes occupied by them have become valid in the int view but invalid in the float view. According to the standard, this means that both views have invalid sections in them; however, neither of them is fully invalid. In particular, the sections you care about are still valid and may be read from.
In short: It's allowed.

Related

What is the correct way to pass a frame buffer from Swift to c++?

I've a c++ function (wrapped in obj-c files in Xcode):
int64_t* findEdges(int64_t* pixels, int width, int height);
that I'd like to call from Swift 3 and pass in a buffer full of picture data. After hunting around I'm calling it with:
var ptr = (NSData(data: imageRep.tiffRepresentation!).bytes).bindMemory(to: Int64.self, capacity: 4 * height * width).pointee
let processor = findEdges(&ptr, width, height)
But after accessing around 30 or 40 addresses in the c++ file I get a EXC_BAD_ACCESS crash.
Is the problem that I'm passing unsafe pointers from Swift? What would be the correct call procedure?

Here are at least some of the problems with this approach. There may be more, since I don't know what exactly findEdges() input means and how that function finds the edges.
NSData's bytes property is a raw pointer not bound to any type. The call to bindMemory then indicates that the NSData's content is to be treated as a buffer of 4 * height * width of 8-byte integers, i.e. a buffer of 32 * height * width bytes. I haven't worked with the TIFF format lately, but I strongly suspect that a TIFF representation of an image of width x height size would contain a lot less bytes than that, so even if the buffer was successfully passed to findEdges, trying to treat it as much larger than it is would lead to an access violation.
The first 8 bytes of the image's TIFF representation are treated as an Int64 and copied to the ptr variable, the address of which is then passed to findEdges, which treats it as the address of a buffer of 4 * height * width Int64 values. However, only the 1st Int64 in the buffer has anything to do with the image (it contains the first 8 bytes of its TIFF representation). When findEdges accesses the 2nd Int64 in the pixels array, it accesses memory having nothing to do with the image. It may be lucky accessing a few more (garbage) Int64 values, but will eventually try to access something it can't.
Solution depends on whether pixels required by findEdges contains the exact same byte sequence as the image's TIFF representation, or some transformation is required. In other words, can we say that the first 8 bytes of the TIFF representation form the first element of the pixels Int64 array, the 2nd 8 bytes - the 2nd element etc.
Assuming the buffer can be passed to the C++ function as is, here is a brief simplified example, which you can adapt to your needs. Let's say the C++ function takes an array of shorts with its size and looks like this:
void processBuffer(int16_t * buf, int count) {
...
}
We want to pass the contents of a Data from Swift as a buffer. Here is how one might go about doing that:
var myData = ...
myData.withUnsafeMutableBytes({(p: UnsafeMutablePointer<Int16>) -> Void in
processBuffer(p, Int32(myData.count / 2))
})
Please note that the buffer can be modified, not just read, in the C++ code, and the changes will be reflected in myData.

GL_SHADER_STORAGE_BUFFER memory limitations

I'm writing ray-tracing on OGL computing shaders, to pass data to and from shaders I use buffers.
When size of vec2 output buffer (which is equal to number of rays multiplied by number of faces) reaches ~30Mb attempt of mapping buffer is stable returning NULL pointer. Range mapping also fails.
I can't find any info about GL_SHADER_STORAGE_BUFFER limitations in ogl documentation, but maybe someone can help me, is ~30Mb limit or this mapping-fail may happen because of something different?
And is there any way to avoid this except for calling shader multiple times?
Data declaration in shader:
#version 440
layout(std430, binding=0) buffer rays{
vec4 r[];
};
layout(std430, binding=1) buffer faces{
vec4 f[];
};
layout(std430, binding=2) buffer outputs{
vec2 o[];
};
uniform int face_count;
uniform vec4 origin;
Calling code (using some Qt5 wrappers):
QOpenGLBuffer ray_buffer;
QOpenGLBuffer face_buffer;
QOpenGLBuffer output_buffer;
QVector<QVector2D> output;
output.resize(rays[r].size()*faces.size());
if(!ray_buffer.create()) { /*...*/ }
if(!ray_buffer.bind()) { /*...*/ }
ray_buffer.allocate(rays.data(), rays.size()*sizeof(QVector4D));
if(!face_buffer.create()) { /*...*/ }
if(!face_buffer.bind()) { /*...*/ }
face_buffer.allocate(faces.data(), faces.size()*sizeof(QVector4D));
if(!output_buffer.create()) { /*...*/ }
if(!output_buffer.bind()) { /*...*/ }
output_buffer.allocate(output.size()*sizeof(QVector2D));
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ray_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, face_buffer.bufferId());
ogl->glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, output_buffer.bufferId());
int face_count = faces.size();
compute.setUniformValue("face_count", face_count);
compute.setUniformValue("origin", pos);
ogl->glDispatchCompute(rays.size()/256, faces.size(), 1);
ray_buffer.destroy();
face_buffer.destroy();
QVector2D* data = (QVector2D*)output_buffer.map(QOpenGLBuffer::ReadOnly);

First of all, you have to understand that the OpenGL specification defines minimum maxima for a variety of values (the ones starting with a MAX_{*} prefix). That means that implementations are required to at least provide the specified amount as the maximum value, but are free to increase the limit as implementors see fit. This way, developers can at least rely on some upper bound, but can still make provisions for possibly larger values.
Section 23 - State Tables summarizes what has been previously specified in the corresponding sections. The information you were looking for is found in table 23.64 - Implementation Dependent Aggregate Shader Limits (cont.). If you want to know about which state belongs where (because there is per-object state, quasi-global state, program state and so on), you go to section 23.
The minimum maximum size of a shader storage buffer is represented by the symbolic constant MAX_SHADER_STORAGE_BLOCK_SIZE as per section 7.8 of the core OpenGL 4.5 specification.
Since their adoption into core, the required size (i.e. the minimum maximum) has been significantly increased. In core OpenGL 4.3 and 4.4, the minimum maximum was pow(2, 24) (or 16MB with 1 byte basic machine units and 1MB = 1024^2 bytes) - in core OpenGL 4.5 this value is now pow(2, 27) (or 128MB)
Summary: When in doubt about OpenGL state, refer to section 23 of the core specification.

From OpenGL Wiki:
SSBOs can be much larger. The OpenGL spec guarantees that UBOs can be
up to 16KB in size (implementations can allow them to be bigger). The
spec guarantees that SSBOs can be up to 128MB. Most implementations
will let you allocate a size up to the limit of GPU memory.
OpenGL < 4.5 guarantees only 16MiB (OpenGL 4.5 increased the minimum to 128MiB) , you can try using glGet() to query if you can bind more.
GLint64 max;
glGetInteger64v(GL_MAX_SHADER_STORAGE_BLOCK_SIZE, &max);

In fact problem seems to be in Qt wrappers. Didn't look in-depth, but when I've changed QOpenGLBuffer's create(), bind(), allocate() and map() to glCreateBuffers(), glBindBuffer(), glNamedBufferData() and glMapNamedBuffer(), all called through QOpenGLFunctions_4_5_Core, memory problem was gone until I reached 2Gb (which is GPU physical memory limit).
Second error I've made was not using glMemoryBarrier(), but it didn't help while QOpenGLBuffer was in use.

Uniform buffered array elements are incorrect

I've been writing something using GL3.3 which takes a uniform buffer, and uses the information from it to select sprite tiles in a frag shader. It's working on my desktop, with a Nvidia GTX780, but my AMD based laptop (A6-4455M) has some issues with it. Both are on the latest (or very recent) drivers.
Back to the code, It first of all sets up a uniform buffer, which consists of two uints, and a uint array. They then get filled, and are accessed in the shader. At first I got a GL error on the laptop because I was not allocating enough, but a temporary change taking padding into account has sorted that out, and now data is actually being buffered.
The first two uints are no problem. I've also got the array somewhat readable in the shader, there is just one problem; The data is multiplied by four! At the moment the array is just some test data, initialized to its index, so spriteArr[1] == 1, spriteArr[34] == 34, etc. However, Accessing it in the shader, spriteArr[10] gives 40. This goes all the way up to spriteArr[143] == 572. Beyond this and it's something else. I don't know exactly why this is, but it would appear to be an incorrect offset.
I am using the shared uniform layout, and getting the uniform offsets from GL itself, so they should be correct. I did notice that the offsets on the AMD card are much larger, as if it is adding more padding. They are always 0,4,8 on the desktop, but 0,16,32 on the laptop.
If it makes any difference, there is another UBO (binding point 0), which is used for the view and projection matrices. These work as intended. However it is not used in the fragment shader. It is also created before this UBO.
UBO initialisation code:
GLuint spriteUBO;
glGenBuffers(1, &spriteUBO);
glBindBuffer(GL_UNIFORM_BUFFER, spriteUBO);
unsigned maxsize = (2 + 576 + 24) * sizeof(GLuint);
/*Bad I know, but temporary. AMD's driver adds 24 bytes of padding. Nvidias has none.
Not the cause of this problem. At least ensures we have enough allocated. */
glBufferData(GL_UNIFORM_BUFFER, maxsize, NULL, GL_STATIC_DRAW);
glBindBuffer(GL_UNIFORM_BUFFER, 0);
//Set binding point
GLuint spriteUBOIndex = glGetUniformBlockIndex(programID, "SpriteMatchData");
glUniformBlockBinding(programID, spriteUBOIndex, 1);
static const GLchar *unames[] =
{
"width", "height",
//"size",
"spriteArr"
};
GLuint uindices[3];
GLint offsets[3];
glGetUniformIndices(programID,3,unames,uindices);
glGetActiveUniformsiv(programID, 3, uindices, GL_UNIFORM_OFFSET, offsets);
//buffer stuff
glBindBufferBase(GL_UNIFORM_BUFFER, 1, spriteUBO);
glBufferSubData(GL_UNIFORM_BUFFER,offsets[0], sizeof(GLuint), tm.getWidth());
glBufferSubData(GL_UNIFORM_BUFFER, offsets[1], sizeof(GLuint), tm.getHeight());
glBufferSubData(GL_UNIFORM_BUFFER, offsets[2], tm.getTileCount() * sizeof(GLuint), tm.getSpriteArray());
Fragment Shader:
layout (shared) uniform SpriteMatchData{
uint width, height;
uint spriteArr[576];};
Then later on I experiment with the array with something like this:
if(spriteArr[10] == uint(40))
{
debug_colour = vec4(0.0,1.0,0.0,0.0);//green
}
else
{
debug_colour = vec4(1.0,0.0,0.0,0.0); //red
}
With debug_colour turning green in this instance.
Is there any way to sort this out with something that works with both systems? Why is the AMD driver handling this so differently? Could it be a bug in the way it deals with uniform uint arrays?

Why is the AMD driver handling this so differently?
Because that's what you asked for:
layout (shared) uniform SpriteMatchData
You explicitly asked for shared layout. That layout is implementation defined. Therefore, two different implementations are allowed to give you two different layouts. As such, if you want to use SpriteMatchData in a platform-independent way, you must query its layout from the program after linking it.
While you did query the offsets for the values, you did not query the array stride: the byte offset from element to element within the array. There is nothing in the specification that requires that shared layouts tightly pack arrays.
Really though, there's pretty much no reason not to use std140 layout. You can avoid all of this querying of offsets and simply design C++ structs that can be directly consumed by GLSL.

glDrawArrays access violation writing location

I'm trying to visualize very large point cloud (700 mln points) and on glDrawArrays call debugger throws access violation writing location exception. I'm using the same code to render smaller clouds (100 mln) and everything works fine. I also have enough RAM memory (32GB) to store the data.
To store point cloud I'm using std::vector<Point3D<float>> where Point3D is
template <class T>
union Point3D
{
T data[3];
struct{
T x;
T y;
T z;
};
}
Vertex array and buffer initialization:
glBindVertexArray(pxCloudHeader.uiVBA);
glBindBuffer(GL_ARRAY_BUFFER, pxCloudHeader.xVBOs.uiVBO_XYZ);
glBufferData(GL_ARRAY_BUFFER, pxCloudHeader.iPointsCount * sizeof(GLfloat) * 3, &p3DfXYZ->data[0], GL_STREAM_DRAW);
glVertexAttribPointer((GLuint)0, 3, GL_FLOAT, GL_FALSE, 0, 0);
glEnableVertexAttribArray(0);
glBindVertexArray(0);
Drawing call:
glBindVertexArray(pxCloudHeader.uiVBA);
glDrawArrays(GL_POINTS, 0, pxCloudHeader.iPointsCount); // here exception is thrown
glBindVertexArray(0);
I also checked if there was OpenGL error thrown but I haven't found any.

I suspect your problem is due to the size of GLsizeiptr.
This is the data type used to represent sizes in OpenGL buffer objects, and it is typically 32-bit.
700 million vertices * 4-bytes per-component * 3-components = 8,400,000,000 bytes
There is a serious issue with trying to allocate that many bytes in GL if it is using 32-bit pointers:
8400000000 & 0xFFFFFFFF = 4,105,032,704 (half as many bytes as you actually need)
If sizeof (GLsizeiptr) on your implementation is 4 then you will have no choice but to split your array up. A 32-bit GLsizeiptr only allows you to store 4 contiguous GiB of memory, but you can work around this if you use 3 single-component arrays instead. Using a vertex shader you can reconstruct these 3 separate (small enough) arrays like so:
#version 330
layout (location = 0) in float x; // Vertex Attrib Ptr. 0
layout (location = 1) in float y; // Vertex Attrib Ptr. 1
layout (location = 2) in float z; // Vertex Attrib Ptr. 2
void main (void)
{
gl_Position = vec4 (x,y,z,1.0);
}
Performance is going to be awful, but that is one way to approach the problem with minimal effort.
By the way, the amount of system memory here (32 GiB) is not your biggest issue. You should be thinking in terms of the amount of VRAM on your GPU because ideally buffer objects are designed to be stored on the GPU. Any part of the buffer object that is too large to be stored in GPU memory will have to be transferred over the PCIe (these days) bus when it is used.

You could draw the data in smaller batches. While there is no predefined upper limit for the size of a buffer, storing 8 GBytes of data in a single buffer is a lot. I'm not very surprised that something would blow up.
I would probably start with storing something like 1 million, or at most a few million, points in each buffer. Then use a pool of buffers with this fixed size, enough to accommodate all your data points.
This might even be beneficial for you performance, because it allows you start submitting draw calls before copying all your data into buffers. This will give you better overlap between CPU and GPU work.
With the amount of data you are shuffling around, you may also want to look into using glMapBuffer()/glUnmapBuffer() instead of glBufferData(). This generally avoids one copy operation for the data.

Issue with glBindBufferRange() OpenGL 3.1

My vertex shader is ,
uniform Block1{ vec4 offset_x1; vec4 offset_x2;}block1;
out float value;
in vec4 position;
void main()
{
value = block1.offset_x1.x + block1.offset_x2.x;
gl_Position = position;
}
The code I am using to pass values is :
GLfloat color_values[8];// contains valid values
glGenBuffers(1,&buffer_object);
glBindBuffer(GL_UNIFORM_BUFFER,buffer_object);
glBufferData(GL_UNIFORM_BUFFER,sizeof(color_values),color_values,GL_STATIC_DRAW);
glUniformBlockBinding(psId,blockIndex,0);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,0,16);
glBindBufferRange(GL_UNIFORM_BUFFER,0,buffer_object,16,16);
Here what I am expecting is, to pass 16 bytes for each vec4 uniform. I get GL_INVALID_VALUE error for offset=16 , size = 16.
I am confused with offset value. Spec says it is corresponding to "buffer_object".

There is an alignment restriction for UBOs when binding. Any glBindBufferRange/Base's offset must be a multiple of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT. This alignment could be anything, so you have to query it before building your array of uniform buffers. That means you can't do it directly in compile-time C++ logic; it has to be runtime logic.
Speaking of querying things at runtime, your code is horribly broken in many other ways. You did not define a layout qualifier for your uniform block; therefore, the default is used: shared. And you cannot use `shared* layout without querying the layout of each block's members from OpenGL. Ever.
If you had done a query, you would have quickly discovered that your uniform block is at least 32 bytes in size, not 16. And since you only provided 16 bytes in your range, undefined behavior (which includes the possibility of program termination) results.
If you want to be able to define C/C++ objects that map exactly to the uniform block definition, you need to use std140 layout and follow the rules of std140's layout in your C/C++ object.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js