OpenGL compute shader normal map generation poor performance - c++

I have an height cube map and I want to generate a normal cube map texture from it. My height cube map is just a 2048x2048 image that I load at the beginning of the application for each face of the cube, and I can modify in real time a "maximum height" value which is used as a multiplicator when retrieving a pixel in the height map.
Initially I was calculating the normals in the vertex shader, but it gave me bad lighting results so I decided to move the calculations in the fragment shader.
As the height map does not change every frame (only when I modify the "maximum height" value), I want to generate a normal map texture from it, using a compute shader because I don't need any rasterization, but it gives me very poor performances.
With the fragment shader I ran at 200FPS but using the compute shader I run at 40 FPS.
Here is how I bind my images and start the compute work:
_computeShaderProgram.use();
glUniform1f(_computeShaderProgram.getUniformLocation("maxHeight"), maxHeight);
glBindImageTexture(
0,
static_cast<GLuint>(heightMap),
0,
GL_TRUE,
0,
GL_READ_ONLY,
GL_RGBA32F
);
glBindImageTexture(
1,
static_cast<GLuint>(normalMap),
0,
GL_TRUE,
0,
GL_WRITE_ONLY,
GL_RGBA32F
);
// Start compute work
// I only compute for one face of the cube map
glDispatchCompute(normalMap.getWidth() / 16, normalMap.getWidth() / 16, 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
And the compute shader:
#version 430 core
#extension GL_ARB_compute_shader : enable
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
layout(rgba32f, binding = 0) readonly uniform imageCube heightMap;
layout(rgba32f, binding = 1) writeonly uniform imageCube normalMap;
uniform float maxHeight;
float getHeight(ivec3 heightMapCoord) {
vec4 heightMapValue = imageLoad(heightMap, heightMapCoord);
return heightMapValue.r * maxHeight;
}
void main() {
ivec3 textCoord = ivec3(gl_GlobalInvocationID);
// Calculate height of neighbors
float leftCubePosHeight = getHeight(textCoord + ivec3(-1, 0, 0));
float rightCubePosHeight = getHeight(textCoord + ivec3(1, 0, 0));
float topCubePosHeight = getHeight(textCoord + ivec3(0, -1, 0));
float bottomCubePosHeight = getHeight(textCoord + ivec3(0, 1, 0));
// Calculate normal using central differences method
vec3 horizontal = vec3(2.0, rightCubePosHeight - leftCubePosHeight, 0.0);
vec3 vertical = vec3(0.0, bottomCubePosHeight - topCubePosHeight, 2.0);
vec3 normal = normalize(cross(vertical, horizontal));
imageStore(normalMap, textCoord, vec4(normal, 1.0));
}
I tried with different work groups sizes (width, width / 8, width / 16, width / 32) and local sizes (1, 8, 16, 32) but the performance is always poor, around 40 FPS or 20 FPS for work group with a size of the full width.
I know I can use shared memory for threads in the same work group to prevent fetching the same texture coordinate 4 times but later I will have height map generated procedurally and will be larger than 2048x2048 I think.
What is the difference between the fragment shader and the compute shader that make it so slow ? Am I doing something wrong ?
Is there any other solutions to generate this normal map ?
EDIT:
The fps I gave above are not right because I was generating 1/16 of the normal map (when I had 40FPS), and I also used the central differences technique to calculate the normals, which is cheap but does not give good lighting results, so I switched to Sobel technique, which is a little more expensive.
I made some tests to know which technique could give the best performance.
Each frame I generate the normal map (this will not be the case later, but it's just to test the performance). Here are my tests:
CPU side single thread: 1.5FPS
Compute shader with local sizes of 1 and one worker group for each image pixel: 4FPS
Compute shader with local sizes of 16 and one worker group for each 16x16 image pixels block: 11FPS
Fragment shader using framebuffer and MRT with 6 color attachments (one for each face of the normal map): 12.5FPS
This is a little laggy when I modify the max height (which generate the normal map again), but I think it's okay as I won't modify it a lot.

Related

How to convert large arrays of quad primitives to triangle primitives?

I have an existing system, which provides 3D meshes. The provided data are an array of vertex coordinates with 3 components (x, y, z) and an index list.
The issue is that the index list is a consecutive array of quad primitives.
The system has to be make runnable with a core profile OpenGL Context first, and later with OpenGL ES 3.x, too.
I know that all the quads have all the same winding order (counter clockwise), but I have no further information about the quads. I don't know anything about their relation or adjacencies.
Since I want to use core profile Context for rendering, I cannot use the GL_QUAD primitive type. I have to convert the quads to triangles.
Of course the array of quad indices can easily be converted to an array of triangle indices:
std::vector<unsigned int> triangles;
triangles.reserve( no_of_indices * 6 / 4 );
for ( int i = 0; i < no_of_indices; i += 4 )
{
int tri[] = { quad[i], quad[i+1], quad[i+2], quad[i], quad[i+2], quad[i+3] };
triangles.insert(triangles.end(), tri, tri+6 );
}
If that has to be done only once, then that would be the solution. But the mesh data are not static. The data can change dynamically.
The data do not change continuously and every time, but the data change unpredictably and randomly.
An other simple solution would be to create an vertex array object, which directly refers to an element array buffer with the quads and draw them in a loop with the GL_TRIANGLE_FAN primitive type:
for ( int i = 0; i < no_of_indices; i += 4 )
glDrawElements( GL_TRIANGLE_FAN, 4, GL_UNSIGNED_INT, (void*)(sizeof(unsigned int) * 4) );
But I hope there is a better solution. I'm searching for a possibility to draw the quads with one single draw call, or to transform the quads to triangles on the GPU.
If that has to be done only once, then that would be the solution. But the mesh data are not static.
The mesh data may be dynamic, but the topology of that list is the same. Every 4 vertices is a quad, so every 4 vertices represents the triangles (0, 1, 2) and (0, 2, 3).
So you can build an arbitrarily large static index buffer containing an ever increasing series of these numbers (0, 1, 2, 0, 2, 3, 4, 5, 6, 4, 6, 7, etc). You can even use baseVertex rendering to offset them to render different serieses of quads using the same index buffer.
My suggestion would be to make the index buffer use GLushort as the index type. This way, your index data only takes up 12 bytes per quad. Using shorts gives you a limit of 16384 quads in a single drawing command, but you can reuse the same index buffer to draw multiple serieses of quads with baseVertex rendering:
constexpr GLushort batchSize = 16384;
constexpr unsigned int vertsPerQuad = 6;
void drawQuads(GLuint quadCount)
{
//Assume VAO is set up.
int baseVertex = 0;
while(quadCount > batchSize)
{
glDrawElementsBaseVertex(GL_TRIANGLES​, batchSize * vertsPerQuad, GL_UNSIGNED_SHORT, 0, baseVertex​ * 4);
baseVertex += batchSize;
quadCount -= batchSize;
}
glDrawElementsBaseVertex(GL_TRIANGLES​, quadCount * vertsPerQuad, GL_UNSIGNED_SHORT, 0, baseVertex​ * 4);
}
If you want slightly less index data, you can use primitive restart indices. This allows you to designate an index to mean "restart the primitive". This allows you to use a GL_TRIANGLE_STRIP primitive and break the primitive up into pieces while still only having a single draw call. So instead of 6 indices per quad, you have 5, with the 5th being the restart index. So now your GLushort indices only take up 10 bytes per quad. However, the batchSize now must be 16383, since the index 0xFFFF is reserved for restarting. And vertsPerQuad must be 5.
Of course, baseVertex rendering works just fine with primitive restarting, so the above code works too.
First I want to mention that this is not a question which I want to answer myself, but I want to provide my current solution to this issue.
This means, that I'm still looking for "the" solution, the perfectly acceptable solution.
In my solution, I decided to use Tessellation. I draw patches with a size of 4:
glPatchParameteri( GL_PATCH_VERTICES, self.__patch_vertices )
glDrawElements( GL_PATCHES, no_of_indices, GL_UNSIGNED_INT, 0 )
The Tessellation Control Shader has a default behavior. The patch data is passed directly from the Vertex Shader invocations to the tessellation primitive generation. Because of that it can be omitted completely.
The Tessellation Evaluation Shader uses a quadrilateral patch (quads) to create 2 triangles:
#version 450
layout(quads, ccw) in;
in TInOut
{
vec3 pos;
} inData[];
out TInOut
{
vec3 pos;
} outData;
uniform mat4 u_projectionMat44;
void main()
{
const int inx_map[4] = int[4](0, 1, 3, 2);
float i_quad = dot( vec2(1.0, 2.0), gl_TessCoord.xy );
int inx = inx_map[int(round(i_quad))];
outData.pos = inData[inx].pos;
gl_Position = u_projectionMat44 * vec4( outData.pos, 1.0 );
}
An alternative solution would be to use a Geometry Shader. The input primitive type lines_adjacency provides 4 vertices, which can be mapped to 2 triangles (triangle_strip). Of course this seems to be a hack, since a lines adjacency is something completely different than a quad, but it works anyway.
glDrawElements( GL_LINES_ADJACENCY, no_of_indices, GL_UNSIGNED_INT, 0 );
Geometry Shader:
#version 450
layout( lines_adjacency ) in;
layout( triangle_strip, max_vertices = 4 ) out;
in TInOut
{
vec3 pos;
} inData[];
out TInOut
{
vec3 col;
} outData;
uniform mat4 u_projectionMat44;
void main()
{
const int inx_map[4] = int[4](0, 1, 3, 2);
for ( int i=0; i < 4; ++i )
{
outData.pos = inData[inx_map[i]].pos;
gl_Position = u_projectionMat44 * vec4( outData.pos, 1.0 );
EmitVertex();
}
EndPrimitive();
}
An improvement would be to use Transform Feedback to capture new buffers, containing triangle primitives.

Can an OpenGL shader do a mix of nearest and linear scaling?

I'm porting some old OpenGL 1.2 bitmap font rendering code to modern OpenGL (at least OpenGL 3.2+), and I'm wondering if I can use a GLSL shader to achieve what I've been doing manually.
When I want to draw the string "123", scaled to particular size, I do the following steps with the sprites below.
I draw the sprite to the screen, scaled 2x with GL_NEAREST. However, to get a black outline, I actually draw the sprite several times.
x + 1, y + 0, BLACK
x + 0, y + 1, BLACK
x - 1, y + 0, BLACK
x + 0, y - 1, BLACK
x + 0, y + 0, COLOR (RED)
After the sprites have been drawn to the screen, I copy the screen to a texture, via glCopyTexSubImage2D.
I draw that texture back to the screen, but with GL_LINEAR.
The end result is a more visually appealing form of scaling pixel sprites. When upscaling small pixel sprites to arbitrary dimensions, using just GL_NEAREST (bottom-right) or just GL_LINEAR (bottom-left) gives an effect I don't like. Pixel doubling with GL_NEAREST, and then do the remaining scaling with GL_LINEAR, gives a result that I prefer (top).
I'm pretty sure GLSL can do the black outline (thus saving me from having to do lots of draws), but could it also do the combination of GL_NEAREST and GL_LINEAR scaling?
You could achieve the effect of "2x nearest-neighbour upscaling followed by linear sampling" by pretending to sample a 4-texel neighbourhood from the upscaled texture while in reality sampling them from the original one. Then you'll have to implement bilinear interpolation manually. If you were targeting OpenGL 4+, textureGather() would be useful, though do keep this issue in mind. In my proposed solution below, I'll be using 4 texelFetch() calls, rather than textureGather(), as textureGather() would complicate things quite a bit.
Suppose you have an unscaled texture with black borders around the glyphs already present. Let's assume you have a normalized texture coordinate of vec2 pn = ... into that texture, where pn.x and pn.y are between 0 and 1. The following code should achieve the desired effect, though I haven't tested it:
ivec2 origTexSize = textureSize(sampler, 0);
int upscaleFactor = 2;
// Floating point texel coordinate into the upscaled texture.
vec2 ptu = pn * vec2(origTexSize * upscaleFactor);
// Decompose "ptu - 0.5" into the integer and fractional parts.
vec2 ptuf;
vec2 ptui = modf(ptu - 0.5, ptuf);
// Integer texel coordinates into the upscaled texture.
ivec2 ptu00 = ivec2(ptui);
ivec2 ptu01 = ptu00 + ivec2(0, 1);
ivec2 ptu10 = ptu00 + ivec2(1, 0);
ivec2 ptu11 = ptu00 + ivec2(1, 1);
// Integer texel coordinates into the original texture.
ivec2 pt00 = clamp(ptu00 / upscaleFactor, ivec2(0), origTexSize - 1);
ivec2 pt01 = clamp(ptu01 / upscaleFactor, ivec2(0), origTexSize - 1);
ivec2 pt10 = clamp(ptu10 / upscaleFactor, ivec2(0), origTexSize - 1);
ivec2 pt11 = clamp(ptu11 / upscaleFactor, ivec2(0), origTexSize - 1);
// Sampled colours.
vec4 clr00 = texelFetch(sampler, pt00, 0);
vec4 clr01 = texelFetch(sampler, pt01, 0);
vec4 clr10 = texelFetch(sampler, pt10, 0);
vec4 clr11 = texelFetch(sampler, pt11, 0);
// Bilinear interpolation.
vec4 clr0x = mix(clr00, clr01, ptuf.y);
vec4 clr1x = mix(clr10, clr11, ptuf.y);
vec4 clrFinal = mix(clr0x, clr1x, ptuf.x);

Mipmapping with compute shader

I have 3D textures of colors, normals and other data of my voxelized scene and because some of this data can't be just averaged i need to calculate mip levels by my own. The 3D texture sizes are (128+64) x 128 x 128, the additional 64 x 128 x 128 are for mip levels.
So when i take the first mip level, which is at (0, 0, 0) with a size of 128 x 128 x 128 and just copy voxels to the second level, which is at (128, 0, 0) the data appears there, but as soon as i copy the second level at (128, 0, 0) to the third at (128, 0, 64) the data doesn't appear at the 3rd level.
shader code:
#version 450 core
layout (local_size_x = 1,
local_size_y = 1,
local_size_z = 1) in;
layout (location = 0) uniform unsigned int resolution;
layout (binding = 0, rgba32f) uniform image3D voxel_texture;
void main()
{
ivec3 index = ivec3(gl_WorkGroupID);
ivec3 spread_index = index * 2;
vec4 voxel = imageLoad(voxel_texture, spread_index);
imageStore(voxel_texture, index + ivec3(resolution, 0, 0), voxel);
// This isn't working
voxel = imageLoad(voxel_texture, spread_index +
ivec3(resolution, 0, 0));
imageStore(voxel_texture, index + ivec3(resolution, 0, 64), voxel);
}
The shader program is dispatched with
glUniform1ui(0, OCTREE_RES);
glBindImageTexture(0, voxel_textures[0], 0, GL_TRUE, 0, GL_READ_WRITE,
GL_RGBA32F);
glDispatchCompute(64, 64, 64);
I don't know if i missed some basic thing, this is my first compute shader. I also tried to use memory barriers but it didn't change a thing.
Well you can't expect your second imageLoad to read texels that you just wrote in your first store like that.
And there is no way to synchronize access outside of the 'local' workgroup.
You'll need either :
To use multiple invocation of your kernel to do each layer
To rewrite your shader logic so that you always fetch from the 'original' zone.

Wrapping texture co-ordinates on a variable-size quad?

Here's my situation: I need to draw a rectangle on the screen for my game's Gui. I don't really care how big this rectangle is or might be, I want to be able to handle any situation. How I'm doing it right now is I store a single VAO that contains only a very basic quad, then I re-draw this quad using uniforms to modify the size and position of it on the screen each time.
The VAO contains 4 vec4 vertices:
0, 0, 0, 0;
1, 0, 1, 0;
0, 1, 0, 1;
1, 1, 1, 1;
And then I draw it as a GL_TRIANGLE_STRIP. The XY of each vertex is it's position, and the ZW is it's texture co-ordinates*. I pass in the rect for the gui element I'm currently drawing as a uniform vec4, which offsets the vertex positions in the vertex shader like so:
vertex.xy *= guiRect.zw;
vertex.xy += guiRect.xy;
And then I convert the vertex from screen pixel co-ordinates into OpenGL NDC co-ordinates:
gl_Position = vec4(((vertex.xy / screenSize) * 2) -1, 0, 1);
This changes the range from [0, screenWidth | screenHeight] to [-1, 1].
My problem comes in when I want to do texture wrapping. Simply passing vTexCoord = vertex.zw; is fine when I want to stretch a texture, but not for wrapping. Ideally, I want to modify the texture co-ordinates such that 1 pixel on the screen is equal to 1 texel in the gui texture. Texture co-ordinates going beyond [0, 1] is fine at this stage, and is in fact exactly what I'm looking for.
I plan to implement texture atlasses for my gui textures, but managing the offsets and bounds of the appropriate sub-texture will be handled in the fragment shader - as far as the vertex shader is concerned, our quad is using one solid texture with [0, 1] co-ordinates, and wrapping accordingly.
*Note: I'm aware that this particular vertex format isn't neccesarily useful for this particular case, I could be using vec2 vertices instead. For the sake of convenience I'm using the same vertex format for all of my 2D rendering, and other objects ie text actually do need those ZW components. I might change this in the future.
TL/DR: Given the size of the screen, the size of a texture, and the location/size of a quad, how do you calculate texture co-ordinates in a vertex shader such that pixels and texels have a 1:1 correspondence, with wrapping?
That is really very easy math: You just need to relate the two spaces in some way. And you already formulated a rule which allows you to do so: a window space pixel is to map to a texel.
Let's assume we have both vec2 screenSize and vec2 texSize which are the unnormalized dimensions in pixels/texels.
I'm not 100% sure what exactly you wan't to achieve. There is something missing: you actaully did not specify where the origin of the texture shall lie. Should it always be but to the bottom left corner of the quad? Or should it be just gloablly at the bottom left corner of the viewport? I'll assume the lattter here, but it should be easy to adjust this for the first case.
What we now need is a mapping between the [-1,1]^2 NDC in x and y to s and t. Let's first map it to [0,1]^2. If we have that, we can simply multiply the coords by screenSize/texSize to get the desired effect. So in the end, you get
vec2 texcoords = ((gl_Position.xy * 0.5) + 0.5) * screenSize/texSize;
You of course already have caclulated (gl_Position.xy * 0.5) + 0.5) * screenSize implicitely, so this could be changed to:
vec2 texcoords = vertex.xy / texSize;

Opengl pixel perfect 2D drawing

I'm working on a 2d engine. It already works quite good, but I keep getting pixel-errors.
For example, my window is 960x540 pixels, I draw a line from (0, 0) to (959, 0). I would expect that every pixel on scan-line 0 will be set to a color, but no: the right-most pixel is not drawn. Same problem when I draw vertically to pixel 539. I really need to draw to (960, 0) or (0, 540) to have it drawn.
As I was born in the pixel-era, I am convinced that this is not the correct result. When my screen was 320x200 pixels big, I could draw from 0 to 319 and from 0 to 199, and my screen would be full. Now I end up with a screen with a right/bottom pixel not drawn.
This can be due to different things:
where I expect the opengl line primitive is drawn from a pixel to a pixel inclusive, that last pixel just is actually exclusive? Is that it?
my projection matrix is incorrect?
I am under a false assumption that when I have a backbuffer of 960x540, that is actually has one pixel more?
Something else?
Can someone please help me? I have been looking into this problem for a long time now, and every time when I thought it was ok, I saw after a while that it actually wasn't.
Here is some of my code, I tried to strip it down as much as possible. When I call my line-function, every coordinate is added with 0.375, 0.375 to make it correct on both ATI and nvidia adapters.
int width = resX();
int height = resY();
for (int i = 0; i < height; i += 2)
rm->line(0, i, width - 1, i, vec4f(1, 0, 0, 1));
for (int i = 1; i < height; i += 2)
rm->line(0, i, width - 1, i, vec4f(0, 1, 0, 1));
// when I do this, one pixel to the right remains undrawn
void rendermachine::line(int x1, int y1, int x2, int y2, const vec4f &color)
{
... some code to decide what std::vector the coordinates should be pushed into
// m_z is a z-coordinate, I use z-buffering to preserve correct drawing orders
// vec2f(0, 0) is a texture-coordinate, the line is drawn without texturing
target->push_back(vertex(vec3f((float)x1 + 0.375f, (float)y1 + 0.375f, m_z), color, vec2f(0, 0)));
target->push_back(vertex(vec3f((float)x2 + 0.375f, (float)y2 + 0.375f, m_z), color, vec2f(0, 0)));
}
void rendermachine::update(...)
{
... render target object is queried for width and height, in my test it is just the back buffer so the window client resolution is returned
mat4f mP;
mP.setOrthographic(0, (float)width, (float)height, 0, 0, 8000000);
... all vertices are copied to video memory
... drawing
if (there are lines to draw)
glDrawArrays(GL_LINES, (int)offset, (int)lines.size());
...
}
// And the (very simple) shader to draw these lines
// Vertex shader
#version 120
attribute vec3 aVertexPosition;
attribute vec4 aVertexColor;
uniform mat4 mP;
varying vec4 vColor;
void main(void) {
gl_Position = mP * vec4(aVertexPosition, 1.0);
vColor = aVertexColor;
}
// Fragment shader
#version 120
#ifdef GL_ES
precision highp float;
#endif
varying vec4 vColor;
void main(void) {
gl_FragColor = vColor.rgb;
}
In OpenGL, lines are rasterized using the "Diamond Exit" rule. This is almost the same as saying that the end coordinate is exclusive, but not quite...
This is what the OpenGL spec has to say:
http://www.opengl.org/documentation/specs/version1.1/glspec1.1/node47.html
Also have a look at the OpenGL FAQ, http://www.opengl.org/archives/resources/faq/technical/rasterization.htm, item "14.090 How do I obtain exact pixelization of lines?". It says "The OpenGL specification allows for a wide range of line rendering hardware, so exact pixelization may not be possible at all."
Many will argue that you should not use lines in OpenGL at all. Their behaviour is based on how ancient SGI hardware worked, not on what makes sense. (And lines with widths >1 are nearly impossible to use in a way that looks good!)
Note that OpenGL coordinate space has no notion of integers, everything is a float and the "centre" of an OpenGL pixel is really at the 0.5,0.5 instead of its top-left corner. Therefore, if you want a 1px wide line from 0,0 to 10,10 inclusive, you really had to draw a line from 0.5,0.5 to 10.5,10.5.
This will be especially apparent if you turn on anti-aliasing, if you have anti-aliasing and you try to draw from 50,0 to 50,100 you may see a blurry 2px wide line because the line fell in-between two pixels.