I tried to create a shader that paints all edges black as you might know from cel shading. I've googled a lot and found many articles and source code how to create black outlines. Unfortunately, I do not understand most of them:
I found this article about feature edge rendering and tried it like this. Unfortunately, only the silhouette is black but not the edges that lie in the mesh. The same counts for this article.
Then I found this article about Frei-Chen edge detector but I have no idea how this whole thing works, even after studying the description for quite a long while.
Could someone give me some help how to program such a shader?
EDIT: I do not use textures for my meshes.
Since I got a few downvotes for being too unspecific, I want to refer to Frei-Chen Edge detector. Here's the fragment shader code from Rastergrid:
#version 330 core
uniform sampler2D image;
out vec4 color;
void main(void)
mat3 I;
float cnv[9];
vec3 sample;
/* fetch the 3x3 neighbourhood and use the RGB vector's length as intensity value */
for (int i=0; i<3; i++)
for (int j=0; j<3; j++) {
sample = texelFetch( image, ivec2(gl_FragCoord) + ivec2(i-1,j-1), 0 ).rgb;
I[i][j] = length(sample);
/* calculate the convolution values for all the masks */
for (int i=0; i<9; i++) {
float dp3 = dot(G[i][0], I[0]) + dot(G[i][1], I[1]) + dot(G[i][2], I[2]);
cnv[i] = dp3 * dp3;
float M = (cnv[0] + cnv[1]) + (cnv[2] + cnv[3]);
float S = (cnv[4] + cnv[5]) + (cnv[6] + cnv[7]) + (cnv[8] + M);
color = vec4(sqrt(M/S));
I skipped the G[9] matrix since this would blow up the code too much.
So I would very thankful if somebody could tell me how the assignment of
color = vec4(sqrt(M/S));
should work since sqrt(M/S) returns a single float to a vec4()? Thanks!

This is discussed if you read the GLSL specification. Construction of a vec4 using a single scalar constructs a vec4 with each component set to the scalar.
5.4.2 Vector and Matrix Constructors
Constructors can be used to create vectors or matrices from a set of scalars, vectors, or matrices. This includes the ability to shorten vectors.
If there is a single scalar parameter to a vector constructor, it is used to initialize all components of the constructed vector to that scalar’s value
How this is useful, I could not say. Duplicating data across multiple channels of an image is a big waste of memory bandwidth...


Stuck trying to optimize complex GLSL fragment shader

So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
vec4 location;
vec4 rangeAndstartColor;
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
Light lights[MaxLightsCount];
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
color = vec4(resultColor, 1.0);
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.

Rendering point cloud data with draw instancing from OSG Cookbook not working

I am rendering a point cloud using OSG. I followed the example in the OSG cookbook titled "Rendering point cloud data with draw instancing" that shows how to make one point with many instances and then transfer the point locations to the graphics card via a texture. It then uses a shader to pull the points out of the texture and move each instance to the right location. There appear to be two problems with what is getting rendered.
First, the points aren't in the right location compared to a more straight forward, working approach to rendering. It looks like they are roughly scaled from zero wrong, some kind of multiplicative factor on position.
Second, the imagery is blurry. Points tend to be generally in the right place; there are many points in the place where a large object should be. However, I can't tell what the object. Data rendered with my working (but slower) rendering method looks sharp.
I have verified that I have the same input data going into the texture and draw list in both methods so it seems it has to be something with the rendering.
Here is the code to set up the Geometry which is nearly directly copied from the text book.
osg::Geometry* geo = new osg::Geometry;
osg::ref_ptr<osg::Image> img = new osg::Image;
img->allocateImage(w,h, 1, GL_RGBA, GL_FLOAT);
osg::BoundingBox box;
float* data = (float*)img->data();
for (unsigned long int k=0; k<NPoints; k++)
*(data++) = cloud->x[k];
*(data++) = cloud->y[k];
*(data++) = cloud->z[k];
*(data++) = cloud->meta[0][k];
geo->setVertexArray( new osg::Vec3Array(1));
geo->addPrimitiveSet( new osg::DrawArrays(GL_POINTS, 0, 1, stop) );
osg::ref_ptr<osg::Texture2D> tex = new osg::Texture2D;
tex->setImage( img);
tex->setInternalFormat( GL_RGBA32F_ARB );
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::LINEAR);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::LINEAR);
And here is the shader code.
void main () {
float row;
row = float(gl_InstanceID) / float(width);
vec2 uv = vec2( fract(row), floor(row) / float(height) );
vec4 texValue = texture2D(defaultTex,uv);
vec4 pos = gl_Vertex + vec4(, 1.0);
gl_Position = gl_ModelViewProjectionMatrix * pos;
After a bunch of experimenting, I found that the example code from the OSG Cookbook has some problems.
The scale issue (the first problem) is in the shader.
vec4 pos = gl_Vertex + vec4(, 1.0);
Should be
vec4 pos = gl_Vertex + vec4(, 0.0);
This is because the gl_Vertex is a 3-vector with an extra 1 element to aide with matrix transformation. That element should always be 1. The example created another 3+1 vector and added it to gl_Vertex making it a 2. Replace the 1 with a zero and the scale problem goes away.
The blurriness (the second problem) was caused by texture interpolation.
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::LINEAR);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::LINEAR);
needs to be
tex->setFilter( osg::Texture2D::MIN_FILTER, osg::Texture2D::NEAREST);
tex->setFilter( osg::Texture2D::MAG_FILTER, osg::Texture2D::NEAREST);
so that the interpolator will just take the values from the texture instead of interpolating them from neighboring texture pixels which may be points on the other side of the point cloud. After fixing these two issues, the example works as advertised and seems to be a bit faster in my limited testing.

Calculate surface normals from depth image using neighboring pixels cross product

As the title says I want to calculate the surface normals of a given depth image by using the cross product of neighboring pixels. I would like to use Opencv for that and avoid using PCL however, I do not really understand the procedure, since my knowledge is quite limited in the subject. Therefore, I would be grateful is someone could provide some hints. To mention here that I do not have any other information except the depth image and the corresponding rgb image, so no K camera matrix information.
Thus, lets say that we have the following depth image:
and I want to find the normal vector at a corresponding point with a corresponding depth value like in the following image:
How can I do that using the cross product of the neighbouring pixels? I do not mind if the normals are not highly accurate.
Ok, I was trying to follow #timday's answer and port his code to Opencv. With the following code:
Mat depth = <my_depth_image> of type CV_32FC1
Mat normals(depth.size(), CV_32FC3);
for(int x = 0; x < depth.rows; ++x)
for(int y = 0; y < depth.cols; ++y)
float dzdx = (<float>(x+1, y) -<float>(x-1, y)) / 2.0;
float dzdy = (<float>(x, y+1) -<float>(x, y-1)) / 2.0;
Vec3f d(-dzdx, -dzdy, 1.0f);
Vec3f n = normalize(d);<Vec3f>(x, y) = n;
imshow("depth", depth / 255);
imshow("normals", normals);
I am getting the correct following result (I had to replace double with float and Vecd to Vecf, I do not know why that would make any difference though):
You don't really need to use the cross product for this, but see below.
Consider your range image is a function z(x,y).
The normal to the surface is in the direction (-dz/dx,-dz/dy,1). (Where by dz/dx I mean the differential: the rate of change of z with x). And then normals are conventionally normalized to unit length.
Incidentally, if you're wondering where that (-dz/dx,-dz/dy,1) comes from... if you take the 2 orthogonal tangent vectors in the plane parellel to the x and y axes, those are (1,0,dzdx) and (0,1,dzdy). The normal is perpendicular to the tangents, so should be (1,0,dzdx)X(0,1,dzdy) - where 'X' is cross-product - which is (-dzdx,-dzdy,1). So there's your cross product derived normal, but there's little need to compute it so explicitly in code when you can just use the resulting expression for the normal directly.
Pseudocode to compute a unit-length normal at (x,y) would be something like
magnitude=sqrt(direction.x**2 + direction.y**2 + direction.z**2)
Depending on what you're trying to do, it might make more sense to replace the NaN values with just some large number.
Using that approach, from your range image, I can get this:
(I'm then using the normal directions calculated to do some simple shading; note the "steppy" appearance due to the range image's quantization; ideally you'd have higher precision than 8-bit for the real range data).
Sorry, not OpenCV or C++ code, but just for completeness: the complete code which produced that image (GLSL embedded in a Qt QML file; can be run with Qt5's qmlscene) is below. The pseudocode above can be found in the fragment shader's main() function:
import QtQuick 2.2
Image {
source: 'range.png' // The provided image
ShaderEffect {
anchors.fill: parent
blending: false
property real dx: 1.0/parent.width
property real dy: 1.0/parent.height
property variant src: parent
vertexShader: "
uniform highp mat4 qt_Matrix;
attribute highp vec4 qt_Vertex;
attribute highp vec2 qt_MultiTexCoord0;
varying highp vec2 coord;
void main() {
fragmentShader: "
uniform highp float dx;
uniform highp float dy;
varying highp vec2 coord;
uniform sampler2D src;
void main() {
highp float dzdx=( texture2D(src,coord+vec2(dx,0.0)).x - texture2D(src,coord+vec2(-dx,0.0)).x )/(2.0*dx);
highp float dzdy=( texture2D(src,coord+vec2(0.0,dy)).x - texture2D(src,coord+vec2(0.0,-dy)).x )/(2.0*dy);
highp vec3 d=vec3(-dzdx,-dzdy,1.0);
highp vec3 n=normalize(d);
highp vec3 lightDirection=vec3(1.0,-2.0,3.0);
highp float shading=0.5+0.5*dot(n,normalize(lightDirection));
The code (matrix calculation) I think is right:
def normalization(data):
mo_chang =np.sqrt(np.multiply(data[:,:,0],data[:,:,0])+np.multiply(data[:,:,1],data[:,:,1])+np.multiply(data[:,:,2],data[:,:,2]))
mo_chang = np.dstack((mo_chang,mo_chang,mo_chang))
return data/mo_chang
f= pts_3d_world[:,1:height-1,2:width]-pts_3d_world[:,1:height-1,1:width-1]
t= pts_3d_world[:,2:height,1:width-1]-pts_3d_world[:,1:height-1,1:width-1]
alpha = np.full((height-2,width-2,1), (1.), dtype="float32")
We should use the camera intrinsics named 'K'. I think the value f and t is based on 3D points in camera coordinate.
For the normal vector, the (-1,-1,100) and (255,255,100) are the same color in 8 bites images but they are totally different normal. So we should map the normal values to (0,1) by normal_map=normal_map*0.5+0.5.
Welcome to communication.

Marching Cubes Issues

I've been trying to implement the marching cubes algorithm with C++ and Qt. Anyway, so far all the steps have been written, but I'm getting a really bad result. I'm looking for orientation or advices about what can be going wrong. I suspect one of the problems may be with the voxel conception, specifically about which vertex goes in which corner (0, 1, ..., 7). Also, I'm not a 100% sure about how to interpret the input for the algorithm (I'm using datasets). Should I read it in the ZYX order and move the marching cube in the same way or it doesn't matter at all? (Leaving aside the fact that no every dimension has to have the same size).
Here is what I'm getting against what it should look like...
Paul Bourke. "Overview and source code". Qt/OpenGL example courtesy Dr. Klaus Miltenberger.
The example requires boost, but looks like it probably should work.
In his example, it has in marchingcubes.cpp, a few different methods for calculating the marching cubes: vMarchCube1 and vMarchCube2.
In the comments it says vMarchCube2 performs the Marching Tetrahedrons algorithm on a single cube by making six calls to vMarchTetrahedron.
Below is the source for the first one vMarchCube1:
//vMarchCube1 performs the Marching Cubes algorithm on a single cube
GLvoid GL_Widget::vMarchCube1(const GLfloat &fX, const GLfloat &fY, const GLfloat &fZ, const GLfloat &fScale, const GLfloat &fTv)
GLint iCorner, iVertex, iVertexTest, iEdge, iTriangle, iFlagIndex, iEdgeFlags;
GLfloat fOffset;
GLvector sColor;
GLfloat afCubeValue[8];
GLvector asEdgeVertex[12];
GLvector asEdgeNorm[12];
//Make a local copy of the values at the cube's corners
for(iVertex = 0; iVertex < 8; iVertex++)
afCubeValue[iVertex] = (this->*fSample)(fX + a2fVertexOffset[iVertex][0]*fScale,fY + a2fVertexOffset[iVertex][1]*fScale,fZ + a2fVertexOffset[iVertex][2]*fScale);
//Find which vertices are inside of the surface and which are outside
iFlagIndex = 0;
for(iVertexTest = 0; iVertexTest < 8; iVertexTest++)
if(afCubeValue[iVertexTest] <= fTv) iFlagIndex |= 1<<iVertexTest;
//Find which edges are intersected by the surface
iEdgeFlags = aiCubeEdgeFlags[iFlagIndex];
//If the cube is entirely inside or outside of the surface, then there will be no intersections
if(iEdgeFlags == 0)
//Find the point of intersection of the surface with each edge
//Then find the normal to the surface at those points
for(iEdge = 0; iEdge < 12; iEdge++)
//if there is an intersection on this edge
if(iEdgeFlags & (1<<iEdge))
fOffset = fGetOffset(afCubeValue[ a2iEdgeConnection[iEdge][0] ],afCubeValue[ a2iEdgeConnection[iEdge][1] ], fTv);
asEdgeVertex[iEdge].fX = fX + (a2fVertexOffset[ a2iEdgeConnection[iEdge][0] ][0] + fOffset * a2fEdgeDirection[iEdge][0]) * fScale;
asEdgeVertex[iEdge].fY = fY + (a2fVertexOffset[ a2iEdgeConnection[iEdge][0] ][1] + fOffset * a2fEdgeDirection[iEdge][1]) * fScale;
asEdgeVertex[iEdge].fZ = fZ + (a2fVertexOffset[ a2iEdgeConnection[iEdge][0] ][2] + fOffset * a2fEdgeDirection[iEdge][2]) * fScale;
vGetNormal(asEdgeNorm[iEdge], asEdgeVertex[iEdge].fX, asEdgeVertex[iEdge].fY, asEdgeVertex[iEdge].fZ);
//Draw the triangles that were found. There can be up to five per cube
for(iTriangle = 0; iTriangle < 5; iTriangle++)
if(a2iTriangleConnectionTable[iFlagIndex][3*iTriangle] < 0) break;
for(iCorner = 0; iCorner < 3; iCorner++)
iVertex = a2iTriangleConnectionTable[iFlagIndex][3*iTriangle+iCorner];
vGetColor(sColor, asEdgeVertex[iVertex], asEdgeNorm[iVertex]);
glColor4f(sColor.fX, sColor.fY, sColor.fZ, 0.6);
glNormal3f(asEdgeNorm[iVertex].fX, asEdgeNorm[iVertex].fY, asEdgeNorm[iVertex].fZ);
glVertex3f(asEdgeVertex[iVertex].fX, asEdgeVertex[iVertex].fY, asEdgeVertex[iVertex].fZ);
UPDATE: Github working example, tested
Hope that helps.
Finally, I found what was wrong.
I use a VBO indexer class to reduce the ammount of duplicated vertices and make the render faster. This class is implemented with a std::map to find and discard already existing vertices, using a tuple of < vec3, unsigned short >. As you may imagine, a marching cubes algorithm generates structures with thousands if not millions of vertices. The highest number a common unsigned short can hold is 65536, or 2^16. So, when the output geometry had more than that, the map index started to overflow and the result was a mess, since it started to overwrite vertices with the new ones. I just changed my implementation to draw with common VBO and not indexed while I fix my class to support millions of vertices.
The result, with some minor vertex normal issues, speaks for itself:

Linear sampled Gaussian blur quality issue

I recently implemented a linear sampled gaussian blur based on this article: Linear Sampled Gaussian Blur
It generally came out well, however it appears there is slight aliasing on text and thinner collections of pixels. I'm pretty stumped as to what is causing this, is it an issue with my shader or weight calculations or is it an inherit draw back of using this method?
I'd like to add that I don't run into this issue when I sample each pixel regularly instead of using bilinear filtering.
Any insights are much appreciated. Here's a code sample of how I work out my weights:
int support = int(sigma * 3.0f);
float total = 0.0f;
total += weights.back();
for (int i = 1; i <= support; i++)
float w1 = exp(-(i*i)/(2*sigma*sigma))/(sqrt(2*constants::pi)*sigma);
float w2 = exp(-((i+1)*(i+1))/(2*sigma*sigma))/(sqrt(2*constants::pi)*sigma);
weights.push_back(w1 + w2);
total += 2.0f * weights[i];
offsets.push_back((i * w1 + (i + 1) * w2) / weights[i]);
for (int i = 0; i < support; i++)
weights[i] /= total;
And here is the fragment shader (there is another vertical version of this shader too):
void main()
vec3 acc = texture2D(tex_object,*weights[0];
for (int i = 1; i < NUM_SAMPLES; i++)
acc += texture2D(tex_object, ([i], 0.0)/tex_size))).rgb*weights[i];
acc += texture2D(tex_object, ([i], 0.0)/tex_size))).rgb*weights[i];
gl_FragColor = vec4(acc, 1.0);
Here is a screenshot depicting the issue:
This looks like a correct gaussian blur to me. The extent to which text is disrupted depends on your sigma. What value are you using?
Also I would check the scaling matrix for the projection you are using.
If you want to blur but without affecting text and thin pixel lines, you might think of
compositing the result with the output of a mild high-pass filter
use a smaller sigma
change the shape of the kernel so it's not gaussian: rather than exp(-i*i/s*s), you might try a function with higher excess kurtosis. You could try a linear up/down function, or one of the functions listed on this page instead: . They will all lead to blurs with varying degrees of disrupting fine detail.
This is an inherent issue with the bilinear filtering. It's unavoidable.