I am using Nvidia CG and Direct3D9 and have the question about the following code.
It compiles, but doesn't "loads" (using cgLoadProgram wrapper) and the resulting failure is described simplyas D3D failure happened.
It's a part of the pixel shader compiled with shader model set to 3.0
What may be interesting is that this shader loads fine in the following cases:
1) Manually unrolling the while statement (to many if { } statements).
2) Removing the line with the tex2D function in the loop.
3) Switching to shader model 2_X and manually unrolling the loop.
Problem part of the shader code:
float2 tex = float2(1, 1);
float2 dtex = float2(0.01, 0.01);
float h = 1.0 - tex2D(height_texture1, tex);
float height = 1.00;
while ( h < height )
{
height -= 0.1;
tex += dtex;
// Remove the next line and it works (not as expected,
// of course)
h = tex2D( height_texture1, tex );
}
If someone knows why this can happen or could test the similiar code in non-CG environment or could help me in some other way, I'm waiting for you ;)
Thanks.
I think you need to determine the gradients before the loop using ddx/ddy on the texture coordinates and then use tex2D(sampler2D samp, float2 s, float2 dx, float2 dy)
The GPU always renders quads not pixels (even on pixel borders - superfluous pixels are discarded by the render backend). This is done because it allows it to always calculate the screen space texture derivates even when you use calculated texture coordinates. It just needs to take the difference between the values at the pixel centers.
But this doesn't work when using dynamic branching like in the code in the question, because the shader processors at the individual pixels could diverge in control flow. So you need to calculate the derivates manually via ddx/ddy before the program flow can diverge.
Related
So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.
A quick summary:
I've a simple Quad tree based terrain rendering system that builds terrain patches which then sample a heightmap in the vertex shader to determine the height of each vertex.
The exact same calculation is done on the CPU for object placement and co.
Super straightforward, but now after adding some systems to procedurally place objects I've discovered that they seem to be misplaced by just a small amount. To debug this I render a few crosses as single models over the terrain. The crosses (red, green, blue lines) represent the height read from the CPU. While the terrain mesh uses a shader to translate the vertices.
(I've also added a simple odd/even gap over each height value to rule out a simple offset issue. So those ugly cliffs are expected, the submerged crosses are the issue)
I'm explicitly using GL_NEAREST to be able to display the "raw" height value:
As you can see the crosses are sometimes submerged under the terrain instead of representing its exact height.
The heightmap is just a simple array of floats on the CPU and on the GPU.
How the data is stored
A simple vector<float> which is uploaded into a GL_RGB32F GL_FLOAT buffer. The floats are not normalized and my terrain usually contains values between -100 and 500.
How is the data accessed in the shader
I've tried a few things to rule out errors, the inital:
vec2 terrain_heightmap_uv(vec2 position, Heightmap heightmap)
{
return (position + heightmap.world_offset) / heightmap.size;
}
float terrain_read_height(vec2 position, Heightmap heightmap)
{
return textureLod(heightmap.heightmap, terrain_heightmap_uv(position, heightmap), 0).r;
}
Basics of the vertex shader (the full shader code is very long, so I've extracted the part that actually reads the height):
void main()
{
vec4 world_position = a_model * vec4(a_position, 1.0);
vec4 final_position = world_position;
// snap vertex to grid
final_position.x = floor(world_position.x / a_quad_grid) * a_quad_grid;
final_position.z = floor(world_position.z / a_quad_grid) * a_quad_grid;
final_position.y = terrain_read_height(final_position.xz, heightmap);
gl_Position = projection * view * final_position;
}
To ensure the slightly different way the position is determined I tested it using hardcoded values that are identical to how C++ reads the height:
return texelFetch(heightmap.heightmap, ivec2((position / 8) + vec2(1024, 1024)), 0).r;
Which gives the exact same result...
How is the data accessed in the application
In C++ the height is read like this:
inline float get_local_height_safe(uint32_t x, uint32_t y)
{
// this macro simply clips x and y to the heightmap bounds
// it does not interfer with the result
BB_TERRAIN_HEIGHTMAP_BOUND_XY_TO_SAFE;
uint32_t i = (y * _size1d) + x;
return buffer->data[i];
}
inline float get_height_raw(glm::vec2 position)
{
position = position + world_offset;
uint32_t x = static_cast<int>(position.x);
uint32_t y = static_cast<int>(position.y);
return get_local_height_safe(x, y);
}
float BB::Terrain::get_height(const glm::vec3 position)
{
return heightmap->get_height_raw({position.x / heightmap_unit_scale, position.z / heightmap_unit_scale});
}
What have I tried:
Comparing the Buffers
I've dumped the first few hundred values from the vector. And compared it with the floating point buffer uploaded to the GPU using Nvidia Nsight, they are equal, rounding/precision errors there.
Sampling method
I've tried texture, textureLod and texelFetch to rule out some issue there, they all give me the same result.
Rounding
The super strange thing, when I round all the height values. They are perfectly aligned which just screams floating point precision issues.
Position snapping
I've tried rounding, flooring and ceiling the position, to ensure the position always maps to the same texel. I also tried adding an epsilon offset to rule out a positional precision error (probably stupid because the terrain is stable...)
Heightmap sizes
I've tried various heightmaps, also of different sizes.
Heightmap patterns
I've created a heightmap containing a pattern to ensure the position is not just offsetet.
I am rendering a point cloud using OSG. I followed the example in the OSG cookbook titled "Rendering point cloud data with draw instancing" that shows how to make one point with many instances and then transfer the point locations to the graphics card via a texture. It then uses a shader to pull the points out of the texture and move each instance to the right location. There appear to be two problems with what is getting rendered.
First, the points aren't in the right location compared to a more straight forward, working approach to rendering. It looks like they are roughly scaled from zero wrong, some kind of multiplicative factor on position.
Second, the imagery is blurry. Points tend to be generally in the right place; there are many points in the place where a large object should be. However, I can't tell what the object. Data rendered with my working (but slower) rendering method looks sharp.
I have verified that I have the same input data going into the texture and draw list in both methods so it seems it has to be something with the rendering.
Here is the code to set up the Geometry which is nearly directly copied from the text book.
osg::Geometry* geo = new osg::Geometry;
osg::ref_ptr<osg::Image> img = new osg::Image;
img->allocateImage(w,h, 1, GL_RGBA, GL_FLOAT);
osg::BoundingBox box;
float* data = (float*)img->data();
for (unsigned long int k=0; k<NPoints; k++)
{
*(data++) = cloud->x[k];
*(data++) = cloud->y[k];
*(data++) = cloud->z[k];
*(data++) = cloud->meta[0][k];
box.expandBy(cloud->x[k],cloud->y[k],cloud->z[k]);
}
geo->setUseDisplayList(false);
geo->setUseVertexBufferObjects(true);
geo->setVertexArray( new osg::Vec3Array(1));
geo->addPrimitiveSet( new osg::DrawArrays(GL_POINTS, 0, 1, stop) );
geo->setInitialBound(box);
osg::ref_ptr<osg::Texture2D> tex = new osg::Texture2D;
tex->setImage( img);
tex->setInternalFormat( GL_RGBA32F_ARB );
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::LINEAR);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::LINEAR);
And here is the shader code.
void main () {
float row;
row = float(gl_InstanceID) / float(width);
vec2 uv = vec2( fract(row), floor(row) / float(height) );
vec4 texValue = texture2D(defaultTex,uv);
vec4 pos = gl_Vertex + vec4(texValue.xyz, 1.0);
gl_Position = gl_ModelViewProjectionMatrix * pos;
}
After a bunch of experimenting, I found that the example code from the OSG Cookbook has some problems.
The scale issue (the first problem) is in the shader.
vec4 pos = gl_Vertex + vec4(texValue.xyz, 1.0);
Should be
vec4 pos = gl_Vertex + vec4(texValue.xyz, 0.0);
This is because the gl_Vertex is a 3-vector with an extra 1 element to aide with matrix transformation. That element should always be 1. The example created another 3+1 vector and added it to gl_Vertex making it a 2. Replace the 1 with a zero and the scale problem goes away.
The blurriness (the second problem) was caused by texture interpolation.
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::LINEAR);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::LINEAR);
needs to be
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::NEAREST);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::NEAREST);
so that the interpolator will just take the values from the texture instead of interpolating them from neighboring texture pixels which may be points on the other side of the point cloud. After fixing these two issues, the example works as advertised and seems to be a bit faster in my limited testing.
(Edit) I made working geometry picking with framebuffer. My goal is draw huge scene in one draw call, but I need to draw to multisample color texture attachment (GL_COLOR_ATTACHMENT0) and draw to (eddited) non-multisample picking texture attachment (GL_COLOR_ATTACHMENT1). The problem is if I use multisample texture to pick, picking is corrupted because of multi-sampling.
I write geometry ID to fragment shader like this:
//...
// Given geometry id
uniform int in_object_id;
// Drawed to screen (GL_COLOR_ATTACHMENT0)
out vec4 out_frag_color0;
// Drawed to pick texture (GL_COLOR_ATTACHMENT1)
out vec4 out_frag_color1;
// ...
void main() {
out_frag_color0 = ...; // Calculating lighting and other stuff
//...
const int max_byte1 = 256;
const int max_byte2 = 65536;
const float fmax_byte = 255.0;
int a1 = in_object_id % max_byte1;
int a2 = (in_object_id / max_byte1) % max_byte1;
int a3 = (in_object_id / max_byte2) % max_byte1;
//out_frag_color0 = vec4(a3 / fmax_byte, a2 / fmax_byte, a1 / fmax_byte, 1);
out_frag_color1 = vec4(a3 / fmax_byte, a2 / fmax_byte, a1 / fmax_byte, 1);
}
(Point of that code is use RGB space for store geometry ID which is then read back a using for changing color of cube)
This happens when I move cursor by one pixel to left:
Because of alpha value of cube pixel:
Without multisample is works well. But multisampling multiplies my output color and geometry id is then corrupted, so it selects random cube with multiplied value.
(Edit) I can't attach one multisample texture target to color0 and non-multisample texture target to color1, it's not supported. How can I do this in one draw call?
Multisampling is not my friend I am not sure If I understand it well (whole framebuffering). Anyway, this way to pick geometries looks horrible for me (I meant calculating ID to color). Am I doing it well? How can I solve multisample problem? Is there better way?
PS: Sorry for low english. :)
Thanks.
You can't do multisampled and non-multisampled rendering in a single draw call.
As you already found, using two color targets in an FBO, with only one of them being multisampled, is not supported. From the "Framebuffer Completeness" section in the spec:
The value of RENDERBUFFER_SAMPLES is the same for all attached renderbuffers; the value of TEXTURE_SAMPLES is the same for all attached textures; and, if the attached images are a mix of renderbuffers and textures, the value of RENDERBUFFER_SAMPLES matches the value of TEXTURE_SAMPLES.
You also can't render to multiple framebuffers at the same time. There is always one single current framebuffer.
The only reasonable option I can think of is to do picking in a separate pass. Then you can easily switch the framebuffer/attachment to a non-multisampled renderbuffer, and avoid all these issues.
Using a separate pass for picking seems cleaner to me anyway. This also allows you to use a specialized shader for each case, instead of always producing two outputs even if one of them is mostly unused.
I think it is posible...
You have to set the picking texture to multisampled and after rendering the scene, you can render 2 triangles over the screen and inside another fragmentshader you can readout each sample... to do that you have to use the GLSL command:
texelFetch(sampler, pixelposition/*[0-texturesize]*/, /*important*/layernumber);
Then you can render it into a single-sampled texture and read the color via glReadPixel.
I haven't tested it now, but I think it works
I had some fun making my first shaders and my first test subject was a 100x100 quad faced picture.
I thought I would learn how to use TRIANGLE_STRIP so I switched it, moved one of the vertex calls so it would look square again. Turned my shader on and there was a duplicate right behind it of only one face but it had the entire texture on it. I have only one set of draw calls for this shape....
Heres my shape code:
glBegin(GL_TRIANGLE_STRIP);
float vx;
float vy;
for(float x=0; x<100; x++){
for(float y=0; y<100; y++){
float vx=x/5.0;
float vy=y/5.0;
glTexCoord2f(0.01*x, 0.01*y);
glVertex3f(vx, vy, 0);
glTexCoord2f(0.01+0.01*x, 0.01*y);
glVertex3f(.2+vx, vy, 0);
glTexCoord2f(0.01*x, 0.01+0.01*y);
glVertex3f(vx, .2+vy, 0);
glTexCoord2f(0.01+0.01*x, 0.01+0.01*y);
glVertex3f(.2+vx, .2+vy, 0);
}}
glEnd();
And my (vertex) shader code:
uniform float uTime,uWaveintensity,uWavespeed;
uniform float uZwave1,uZwave2,uXwave,uYwave;
void main(){
vec4 position = gl_Vertex;
gl_TexCoord[0] = gl_MultiTexCoord0;
position.z=((sin(position.x+uTime*uWavespeed)*uZwave1)+(sin(position.y+uTime*uWavespeed))*uZwave2)*uWaveintensity;
position.x=position.x+(sin(position.x+uTime*uWavespeed)*uXwave)*uWaveintensity;
position.y=position.y+(sin(position.y+uTime*uWavespeed)*uYwave)*uWaveintensity;
gl_Position = gl_ModelViewProjectionMatrix * position;
}
If anyone has any info on drawing more efficiently with shared vertices(triangle_strips) I've googled but I don't understand any so far XD. I wanna know.
screenshot(s):
with 8x8 faces
same thing same angle,lines=ghost
I see whats happening now, but I don't know how to fix it.
I don't think you can create a 100x100 quad plane with triangle strips this way. Now you're going by rows and columns just in one direction, which means that the last 2 vertices of first row will create a triangle with the first vertex of the second row and that's not what you want.
I'd suggest you to start with 2x2 pattern just to learn how triangle strips work, then move to 3x3 and 4x4 to see what is a difference between odd and even situations. When you have some understanding of the problems you can create universal algorithm and change your size to 100.
After this all you can focus on the vertex shader to make it waving.
And for the future: never start from big data if you're learning how the things work. :)
EDIT:
Since I wrote this answer I learned that you already CAN make two dimmensional grid with one tri-strip, using degenerate triangles :).
When a triangle uses the same vertex twice it will be ignored by the rasterizer during rendering, so at the end of your first strip you can create a degenerate triangle using last vertex of first strip and first vertex of the second strip. It doesn't matter which of the two vertexes you'll use as the 3rd one, as long as they are in the correct order (e.g. 1,1,2 or 1,2,2). This way you've created a triangle that won't be drawn, but it will move the next 'starting' point to beginning of your 2nd strip, where you can continue building your mesh.
The drawback is that you create some triangles, that will be transformed but not drawn (there will be not many of them), but the advantage is that you run just one 'draw strip' command to GPU which is much faster.