Which mouse picking strategy for milions of primitives? - opengl

I am rendering models based on milions (up to ten) of triangles using VBOs and I need to detect which of these triangles the user may click on.
I try to read and understand how the "name stack" and the "unique-color" work.
I found that the name stack can contain at max only 128 names, while the unique-color can have up to 2^(8+8+8) = 16777216 possible different colors, but sometimes there could be some approximations, so it can get modified..
Which is the best strategy for my case?

Basically, you have 2 classes of options:
The "unique color way per triangle", which means you attach an id to every triangle, and render out the id's to a seperate render target. It can be 32 bit's (8 for rgb, 8 for a), but you could add a second one for even more ideas. It'll be fiddly to get the id's per triangle, but it's relatively easy to implement. Can be quite detrimental to performance though (fillrate).
Proper ray tracing. You almost certainly want to have an acceleration structure (octree, kd,...), but you probably already have one for frustum culling. One ray really isn't a lot, this method should be very fast.
Hybrid. probably the easiest to implement. Render out the vertex buffer id ("unique color per buffer:), and when you know which vertex buffer was selected", just trace a ray against all the triangles.
In the general case, I would say 2) is the best option. If you want to have something work quickly, go for 3). 1) is probably pretty useless.

If your GPU card has OpenGL 4.2, you may use the function imageStore() in GLSL to mark the triangle Id in an image. In my case, I need to detect all triangles behind a predefined window on the screen. Picking (choosing rendered triangles on a window) works similarly. The selection runs in real-time for me.
The maximum size of an image (or texture) should >= 8192x8192 = 64 M. So it can be used up to 64 M primitives (and even more if we use 2, 3 images).
Saving all trianges Id behind the screen could be done with this fragment shader:
uniform uimage2D id_image;
void main()
{
color_f = vec4(0)
ivec2 p;
p.x = gl_PrimitiveID % 2048;
p.y = gl_PrimitiveID / 2048;
imageStore(id_image, p, uvec4(255));
}
To save all trianges Id rendered on the screen: first, we precompute a depth buffer, then use a slightly different fragment shader:
uniform uimage2D id_image;
**layout(early_fragment_tests) in;** //the shader does not run for fragment > depth
void main()
{
color_f = vec4(0)
ivec2 p;
p.x = gl_PrimitiveID % 2048;
p.y = gl_PrimitiveID / 2048;
imageStore(id_image, p, uvec4(255));
}

Related

glTexSubImage3D(GL_TEXTURE_2D_ARRAY, ...) and GL_TEXTURE_SWIZZLE_RGBA

I have a texture array (~512 layers).
Some of the textures I upload have 4 channels (RGBA), some have only one (RED).
When creating individual textures, I can do this:
GLint swizzleMask[] = { GL_ONE, GL_ONE, GL_ONE, GL_RED };
glTexParameteriv(GL_TEXTURE_2D, GL_TEXTURE_SWIZZLE_RGBA, swizzleMask);
Can I do this for specific layers of my texture array? (Swizzling should apply to one texture in the array only, not the others).
I suspect this is not possible, and if so, what's the preferred method? (Vertex attributes would be my last resort option).
(i) EDIT: Looking preferably for an OpenGL 3.3 or below solution.
(ii) EDIT: The idea is that I have RGBA bitmaps for my game (grass, wall, etc...) and I also have font bitmaps. I'm trying to render these in the same draw call.
In my fragment shader, I have something like:
uniform sampler2DArray TextureArraySampler;
out vec4 FragmentColor;
in VertexOut
{
vec2 UV;
vec4 COLOR;
flat uint TEXTURE_INDEX;
} In;
void main(void)
{
FragmentColor = In.COLOR * texture(TextureArraySampler, vec3(In.UV.x, In.UV.y, In.TEXTURE_INDEX));
}
So, when rendering fonts, I would like the shader to sample like:
FragmentColor = In.COLOR * vec4(1, 1, 1, texture(TextureArraySampler, vec3(In.UV.x, In.UV.y, In.TEXTURE_INDEX)).r);
And, when rendering bitmaps:
FragmentColor = In.COLOR * texture(TextureArraySampler, vec3(In.UV.x, In.UV.y, In.TEXTURE_INDEX)).rgba;
To start with, no, there's no way to do what you want. Well, there is a way, but it involves sticking a non-dynamically uniform conditional branch in your fragment shader, which is not a cost worth paying.
I'm trying to render these in the same draw call.
Performance presentations around OpenGL often talk about reducing draw calls being an important aspect of performance. This is very true, particularly for high-performance applications.
That being said, this does not mean that one should undertake Herculean efforts to reduce the number of draw calls to 1. The point of the advice is to get people to structure their engines so that the number of draw calls does not increase with the complexity of the scene.
For example, consider your tile map. Issuing a draw call per-tile is bad because the number of draw calls increases linearly with the number of tiles being drawn. So it makes sense to draw the entire tile map in a single call.
Now, let's say that your scene consists of tile maps and font glyphs, and it will always be exactly that. You could rendering this in two calls (one for the maps and one for the glyphs), or you could do it in one call. But the performance difference between them will be negligible. What matters is that adding more tiles/glyphs does not mean adding more draw calls.
So you should not be concerned about adding a new draw call to your engine. What should concern you is if you're adding a new draw call per-X to your engine.

Comparing two textures in openGL

I'm new to OpenGL and I'm looking forward to compare two textures to understand how much they are similar to each other. I know how to to this with two bitmap images but I really need to use a method to compare two textures.
Question is: Is there any way to compare two textures as we compare two images? Like comparing two images pixel by pixel?
Actually what you seem to be asking for is not possible or at least not as easy as it would seem to accomplish on the GPU. The problem is GPU is designed to accomplish as many small tasks as possible in the shortest amount of time. Iterating through an array of data such as pixels is not included so getting something like an integer or a floating value might be a bit hard.
There is one very interesting procedure you may try but I can not say the result will be appropriate for you:
You may first create a new texture that is a difference between the two input textures and then keep downsampling the result till 1x1 pixel texture and get the value of that pixel to see how different it is.
To achieve this it would be best to use a fixed size of the target buffer which is POT (power of two) for instance 256x256. If you didn't use a fixed size then the result could vary a lot depending on the image sizes.
So in first pass you would redraw the two textures to the 3rd one (using FBO - frame buffer object). The shader you would use is simply:
vec4 a = texture2D(iChannel0,uv);
vec4 b = texture2D(iChannel1,uv);
fragColor = abs(a-b);
So now you have a texture which represents the difference between the two images per pixel, per color component. If the two images will be the same, the result will be a totally black picture.
Now you will need to create a new FBO which is scaled by half in every dimension which comes to 128x128 in this example. To draw to this buffer you would need to use GL_NEAREST as a texture parameter so no interpolations on the texel fetching is done. Then for each new pixel sum the 4 nearest pixels of the source image:
vec4 originalTextCoord = varyingTextCoord;
vec4 textCoordRight = vec2(varyingTextCoord.x+1.0/256, varyingTextCoord.y);
vec4 textCoordBottom = vec2(varyingTextCoord.x, varyingTextCoord.y+1.0/256);
vec4 textCoordBottomRight = vec2(varyingTextCoord.x+1.0/256, varyingTextCoord.y+1.0/256);
fragColor = texture2D(iChannel0, originalTextCoord) +
texture2D(iChannel0, textCoordRight) +
texture2D(iChannel0, textCoordBottom) +
texture2D(iChannel0, textCoordBottomRight);
The 256 value is from the source texture so that should come as a uniform so you may reuse the same shader.
After this is drawn you need to drop down to 64, 32, 16... Then read the pixel back to the CPU and see the result.
Now unfortunately this procedure may produce very unwanted results. Since the colors are simply summed together this will produce an overflow for all the images which are not similar enough (results in a white pixel or rather (1,1,1,0) for non-transparent). This may be overcome first by using a scale on the first shader pass, to divide the output by a large enough value. Still this might not be enough and an average might need to be done in the second shader (multiply all the texture2D calls by .25).
In the end the result might still be a bit strange. You get 4 color components on the CPU which represent the sum or the average of an image differential. I guess you could sum them up and choose what you consider for the images to be much alike or not. But if you want to have a more sense in the result you are getting you might want to treat the whole pixel as a single 32-bit floating value (these are a bit tricky but you may find answers around the SO). This way you may compute the values without the overflows and get quite exact results from the algorithms. This means you would write the floating value as if it is a color which starts with the first shader output and continues for every other draw call (get texel, convert it to float, sum it, convert it back to vec4 and assign as output), GL_NEAREST is essential here.
If not then you may optimize the procedure and use GL_LINEAR instead of GL_NEAREST and simply keep redrawing the differential texture till it gets to a single pixel size (no need for 4 coordinates). This should produce a nice pixel which represents an average of all the pixels in the differential textures. So this is the average difference between pixels in the two images. Also this procedure should be quite fast.
Then if you want to do a bit smarter algorithm you may do some wonders on creating the differential texture. Simply subtracting the colors may not be the best approach. It would make more sense to blur one of the images and then comparing it to the other image. This will lose precision for those very similar images but for everything else it will give you a much better result. For instance you could say you are interested only if the pixel is 30% different then the weight of the other image (the blurred one) so you would discard and scale the 30% for every component such as result.r = clamp(abs(a.r-b.r)-30.0/100.0, .0, 1.0)/((100.0-30.0)/100.0);
You can bind both textures to a shader and visit each pixel by drawing a quad or something like this.
// Equal pixels are marked green. Different pixels are shown in red color.
void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
vec2 uv = fragCoord.xy / iResolution.xy;
vec4 a = texture2D(iChannel0,uv);
vec4 b = texture2D(iChannel1,uv);
if(a != b)
fragColor = vec4(1,0,0,1);
else
fragColor = vec4(0,1,0,1);
}
You can test the shader on Shadertoy.
Or you can also bind both textures to a compute shader and visit every pixel by iteration.
You cannot compare vectors. You have to use
if( any(notEqual(a,b)))
Check the GLSL language spec

Better to update a small vertex buffer, or send a uniform?

I'm writing/planning a GUI renderer for my OpenGL (core profile) game engine, and I'm not completely sure how I should be representing the vertex data for my quads. So far, I've thought of 2 possible solutions:
1) The straightforward way, every GuiElement keeps track of it's own vertex array object, containing 2d screen co-ordinates and texture co-ordinates, and is updated (glBufferSubData()) any time the GuiElement is moved or resized.
2) I globally store a single vertex array object, whose co-ordinates are (0,0)(1,0)(0,1)(1,1), and upload a rect as a vec4 uniform (x, y, w, h) every frame, and transform the vertex positions in the vertex shader (vertex.xy *= guiRect.zw; vertex.xy += guiRect.xy;).
I know that method #2 works, but I want to know which one is better.
I do like the idea of option two, however, it would be quite inefficient because it requires a draw call for each element. As was mentioned by other replies, the biggest performance gains lie in batching geometry and reducing the number of draw calls. (In other words, reducing the time your application spends communicating with the GL driver).
So I think the fastest possible way of drawing 2D objects with OpenGL is by using a technique similar to your option one, but adding batching to it.
The smallest possible vertex format you need in order to draw a quadrilateral on the screen is a simple vec2, with 4 vec2s per quadrilateral. The texture coordinates can be generated in a very lightweight vertex shader, such as this:
// xy = vertex position in normalized device coordinates ([-1,+1] range).
attribute vec2 vertexPositionNDC;
varying vec2 vTexCoords;
const vec2 scale = vec2(0.5, 0.5);
void main()
{
vTexCoords = vertexPositionNDC * scale + scale; // scale vertex attribute to [0,1] range
gl_Position = vec4(vertexPositionNDC, 0.0, 1.0);
}
In the application side, you can set up a double buffer to optimize throughput, by using two vertex buffers, so you can write to one of them on a given frame then flip the buffers and send it to GL, while you start writing to the next buffer right away:
// Update:
GLuint vbo = vbos[currentVBO];
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferSubData(GL_ARRAY_BUFFER, dataOffset, dataSize, data);
// Draw:
glDrawElements(...);
// Flip the buffers:
currentVBO = (currentVBO + 1) % NUM_BUFFERS;
Or another simpler option is to use a single buffer, but allocate new storage on every submission, to avoid blocking, like so:
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, dataSize, data, GL_STREAM_DRAW);
This is a well known and used technique for simple async data transfers. Read this for more.
It is also a good idea to use indexed geometry. Keep an index buffer of unsigned shorts with the vertex buffer. A 2-byte per element IB will reduce data traffic quite a bit and should have an index range big enough for any amount of 2D/UI elements that you might wish to draw.
For GUI elements you could use dynamic vertex buffer (ring buffer) and just upload the geometry every frame because this is quite small amount of geometry data. Then you can batch your GUI element rendering unlike in both of your proposed methods.
Batching is quite important if you render large number of GUI elements, such as text. You can quite easily build a generic GUI rendering system with this which caches the GUI element draw calls and flushes the draws to the GPU upon state changes.
I would recommend doing it like DXUT does it, where it takes the rects from each element, and renders them with a single universal method that takes an element as a parameter, which contains a rect. Each control can have many elements. It adds the four points of the rect to a buffer in a specific order in STREAM_DRAW mode and a constant index buffer. This does draw each rect individually, but performance is not completely vital, because your geometry is simple, and when you are in a dialog, you can usually put the rendering of the 3d scene on the back burner. EDIT: even using this to do HUD items, it has a negligible performance penalty.
This is a simple and organized way to do it, where it works well with textures, and there are only two shaders, one for drawing textured components, and one for non-textured. THen there is a special way to do text.
If you want to see how I did it, you can look at this:
https://github.com/kevinmackenzie/ObjGLUF
It is in GLUFGui.h/.cpp

Texture lookup into rendered FBO is off by half a pixel

I have a scene that is rendered to texture via FBO and I am sampling it from a fragment shader, drawing regions of it using primitives rather than drawing a full-screen quad: I'm conserving resources by only generating the fragments I'll need.
To test this, I am issuing the exact same geometry as my texture-render, which means that the rasterization pattern produced should be exactly the same: When my fragment shader looks up its texture with the varying coordinate it was given it should match up perfectly with the other values it was given.
Here's how I'm giving my fragment shader the coordinates to auto-texture the geometry with my fullscreen texture:
// Vertex shader
uniform mat4 proj_modelview_mat;
out vec2 f_sceneCoord;
void main(void) {
gl_Position = proj_modelview_mat * vec4(in_pos,0.0,1.0);
f_sceneCoord = (gl_Position.xy + vec2(1,1)) * 0.5;
}
I'm working in 2D so I didn't concern myself with the perspective divide here. I just set the sceneCoord value using the clip-space position scaled back from [-1,1] to [0,1].
uniform sampler2D scene;
in vec2 f_sceneCoord;
//in vec4 gl_FragCoord;
in float f_alpha;
out vec4 out_fragColor;
void main (void) {
//vec4 color = texelFetch(scene,ivec2(gl_FragCoord.xy - vec2(0.5,0.5)),0);
vec4 color = texture(scene,f_sceneCoord);
if (color.a == f_alpha) {
out_fragColor = vec4(color.rgb,1);
} else
out_fragColor = vec4(1,0,0,1);
}
Notice I spit out a red fragment if my alpha's don't match up. The texture render sets the alpha for each rendered object to a specific index so I know what matches up with what.
Sorry I don't have a picture to show but it's very clear that my pixels are off by (0.5,0.5): I get a thin, one pixel red border around my objects, on their bottom and left sides, that pops in and out. It's quite "transient" looking. The giveaway is that it only shows up on the bottom and left sides of objects.
Notice I have a line commented out which uses texelFetch: This method works, and I no longer get my red fragments showing up. However I'd like to get this working right with texture and normalized texture coordinates because I think more hardware will support that. Perhaps the real question is, is it possible to get this right without sending in my viewport resolution via a uniform? There's gotta be a way to avoid that!
Update: I tried shifting the texture access by half a pixel, quarter of a pixel, one hundredth of a pixel, it all made it worse and produced a solid border of wrong values all around the edges: It seems like my gl_Position.xy+vec2(1,1))*0.5 trick sets the right values, but sampling is just off by just a little somehow. This is quite strange... See the red fragments? When objects are in motion they shimmer in and out ever so slightly. It means the alpha values I set aren't matching up perfectly on those pixels.
It's not critical for me to get pixel perfect accuracy for that alpha-index-check for my actual application but this behavior is just not what I expected.
Well, first consider dropping that f_sceneCoord varying and just using gl_FragCoord / screenSize as texture coordinate (you already have this in your example, but the -0.5 is rubbish), with screenSize being a uniform (maybe pre-divided). This should work almost exact, because by default gl_FragCoord is at the pixel center (meaning i+0.5) and OpenGL returns exact texel values when sampling the texture at the texel center ((i+0.5)/textureSize).
This may still introduce very very very slight deviations form exact texel values (if any) due to finite precision and such. But then again, you will likely want to use a filtering mode of GL_NEAREST for such one-to-one texture-to-screen mappings, anyway. Actually your exsiting f_sceneCoord approach may already work well and it's just those small rounding issues prevented by GL_NEAREST that create your artefacts. But then again, you still don't need that f_sceneCoord thing.
EDIT: Regarding the portability of texelFetch. That function was introduced with GLSL 1.30 (~SM4/GL3/DX10-hardware, ~GeForce 8), I think. But this version is already required by the new in/out syntax you're using (in contrast to the old varying/attribute syntax). So if you're not gonna change these, assuming texelFetch as given is absolutely no problem and might also be slightly faster than texture (which also requires GLSL 1.30, in contrast to the old texture2D), by circumventing filtering completely.
If you are working in perfect X,Y [0,1] with no rounding errors that's great... But sometimes - especially if working with polar coords, you might consider aligning your calculated coords to the texture 'grid'...
I use:
// align it to the nearest centered texel
curPt -= mod(curPt, (0.5 / vec2(imgW, imgH)));
works like a charm and I no longer get random rounding errors at the screen edges...

Is it possible to thicken a quadratic Bézier curve using the GPU only?

I draw lots of quadratic Bézier curves in my OpenGL program. Right now, the curves are one-pixel thin and software-generated, because I'm at a rather early stage, and it is enough to see what works.
Simply enough, given 3 control points (P0 to P2), I evaluate the following equation with t varying from 0 to 1 (with steps of 1/8) in software and use GL_LINE_STRIP to link them together:
B(t) = (1 - t)2P0 + 2(1 - t)tP1 + t2P2
Where B, obviously enough, results in a 2-dimensional vector.
This approach worked 'well enough', since even my largest curves don't need much more than 8 steps to look curved. Still, one pixel thin curves are ugly.
I wanted to write a GLSL shader that would accept control points and a uniform thickness variable to, well, make the curves thicker. At first I thought about making a pixel shader only, that would color only pixels within a thickness / 2 distance of the curve, but doing so requires solving a third degree polynomial, and choosing between three solutions inside a shader doesn't look like the best idea ever.
I then tried to look up if other people already did it. I stumbled upon a white paper by Loop and Blinn from Microsoft Research where the guys show an easy way of filling the area under a curve. While it works well to that extent, I'm having trouble adapting the idea to drawing between two bouding curves.
Finding bounding curves that match a single curve is rather easy with a geometry shader. The problems come with the fragment shader that should fill the whole thing. Their approach uses the interpolated texture coordinates to determine if a fragment falls over or under the curve; but I couldn't figure a way to do it with two curves (I'm pretty new to shaders and not a maths expert, so the fact I didn't figure out how to do it certainly doesn't mean it's impossible).
My next idea was to separate the filled curve into triangles and only use the Bézier fragment shader on the outer parts. But for that I need to split the inner and outer curves at variable spots, and that means again that I have to solve the equation, which isn't really an option.
Are there viable algorithms for stroking quadratic Bézier curves with a shader?
This partly continues my previous answer, but is actually quite different since I got a couple of central things wrong in that answer.
To allow the fragment shader to only shade between two curves, two sets of "texture" coordinates are supplied as varying variables, to which the technique of Loop-Blinn is applied.
varying vec2 texCoord1,texCoord2;
varying float insideOutside;
varying vec4 col;
void main()
{
float f1 = texCoord1[0] * texCoord1[0] - texCoord1[1];
float f2 = texCoord2[0] * texCoord2[0] - texCoord2[1];
float alpha = (sign(insideOutside*f1) + 1) * (sign(-insideOutside*f2) + 1) * 0.25;
gl_FragColor = vec4(col.rgb, col.a * alpha);
}
So far, easy. The hard part is setting up the texture coordinates in the geometry shader. Loop-Blinn specifies them for the three vertices of the control triangle, and they are interpolated appropriately across the triangle. But, here we need to have the same interpolated values available while actually rendering a different triangle.
The solution to this is to find the linear function mapping from (x,y) coordinates to the interpolated/extrapolated values. Then, these values can be set for each vertex while rendering a triangle. Here's the key part of my code for this part.
vec2[3] tex = vec2[3]( vec2(0,0), vec2(0.5,0), vec2(1,1) );
mat3 uvmat;
uvmat[0] = vec3(pos2[0].x, pos2[1].x, pos2[2].x);
uvmat[1] = vec3(pos2[0].y, pos2[1].y, pos2[2].y);
uvmat[2] = vec3(1, 1, 1);
mat3 uvInv = inverse(transpose(uvmat));
vec3 uCoeffs = vec3(tex[0][0],tex[1][0],tex[2][0]) * uvInv;
vec3 vCoeffs = vec3(tex[0][1],tex[1][1],tex[2][1]) * uvInv;
float[3] uOther, vOther;
for(i=0; i<3; i++) {
uOther[i] = dot(uCoeffs,vec3(pos1[i].xy,1));
vOther[i] = dot(vCoeffs,vec3(pos1[i].xy,1));
}
insideOutside = 1;
for(i=0; i< gl_VerticesIn; i++){
gl_Position = gl_ModelViewProjectionMatrix * pos1[i];
texCoord1 = tex[i];
texCoord2 = vec2(uOther[i], vOther[i]);
EmitVertex();
}
EndPrimitive();
Here pos1 and pos2 contain the coordinates of the two control triangles. This part renders the triangle defined by pos1, but with texCoord2 set to the translated values from the pos2 triangle. Then the pos2 triangle needs to be rendered, similarly. Then the gap between these two triangles at each end needs to filled, with both sets of coordinates translated appropriately.
The calculation of the matrix inverse requires either GLSL 1.50 or it needs to be coded manually. It would be better to solve the equation for the translation without calculating the inverse. Either way, I don't expect this part to be particularly fast in the geometry shader.
You should be able to use technique of Loop and Blinn in the paper you mentioned.
Basically you'll need to offset each control point in the normal direction, both ways, to get the control points for two curves (inner and outer). Then follow the technique in Section 3.1 of Loop and Blinn - this breaks up sections of the curve to avoid triangle overlaps, and then triangulates the main part of the interior (note that this part requires the CPU). Finally, these triangles are filled, and the small curved parts outside of them are rendered on the GPU using Loop and Blinn's technique (at the start and end of Section 3).
An alternative technique that may work for you is described here:
Thick Bezier Curves in OpenGL
EDIT:
Ah, you want to avoid even the CPU triangulation - I should have read more closely.
One issue you have is the interface between the geometry shader and the fragment shader - the geometry shader will need to generate primitives (most likely triangles) that are then individually rasterized and filled via the fragment program.
In your case with constant thickness I think quite a simple triangulation will work - using Loop and Bling for all the "curved bits". When the two control triangles don't intersect it's easy. When they do, the part outside the intersection is easy. So the only hard part is within the intersection (which should be a triangle).
Within the intersection you want to shade a pixel only if both control triangles lead to it being shaded via Loop and Bling. So the fragment shader needs to be able to do texture lookups for both triangles. One can be as standard, and you'll need to add a vec2 varying variable for the second set of texture coordinates, which you'll need to set appropriately for each vertex of the triangle. As well you'll need a uniform "sampler2D" variable for the texture which you can then sample via texture2D. Then you just shade fragments that satisfy the checks for both control triangles (within the intersection).
I think this works in every case, but it's possible I've missed something.
I don't know how to exactly solve this, but it's very interesting. I think you need every different processing unit in the GPU:
Vertex shader
Throw a normal line of points to your vertex shader. Let the vertex shader displace the points to the bezier.
Geometry shader
Let your geometry shader create an extra point per vertex.
foreach (point p in bezierCurve)
new point(p+(0,thickness,0)) // in tangent with p1-p2
Fragment shader
To stroke your bezier with a special stroke, you can use a texture with an alpha channel. You can check the alpha channel on its value. If it's zero, clip the pixel. This way, you can still make the system think it is a solid line, instead of a half-transparent one. You could apply some patterns in your alpha channel.
I hope this will help you on your way. You will have to figure out things yourself a lot, but I think that the Geometry shading will speed your bezier up.
Still for the stroking I keep with my choice of creating a GL_QUAD_STRIP and an alpha-channel texture.