I am looking into using a VBO instead of immediate mode for performance reasons. I am creating a 2D orthographic scene filled with sprites. I do not want to draw sprites that are off-screen. I do this by checking their position against the screen size and position of the camera.
In immediate mode this is simple; there is draw method for each sprite. Using a VBO this seems non-trivial; I render an entire section of a VBO at one time. There would be no way for me (that I can think of) to elect out of rendering sprites that are off-screen.

I'll just assume that you do indeed animate the sprites on the CPU, because that's the only thing that makes sense in the light of your question (otherwise, how would you draw them in immediate mode initially, and how would you skip drawing some).
AGP/PCIe behaves much like a harddisk from a performance point of view. Bandwidth is huge, but access time is quite noticeable. In other words, doing a transfer at all is painful, but once you do it, a few kilobytes more don't really make any difference. Uploading 500 sprites and uploading 1000 sprites is the same thing.
Since you animate the sprites on the CPU, you already must do one transfer (glBufferSubData or glMapBuffer/glUnmapBuffer) every frame, there is no other way.
Be sure to use a "fresh" buffer e.g. by applying the glBufferData(null) idiom. This avoids pipeline stalls by allowing OpenGL to continue using (drawing from) the buffer while giving you a different buffer (without you knowing) at the same time. Later when it is done drawing, it just secretly flips buffers and throws the old one away. That way, you achieve good parallelism (which is key to performance and much more important than culling a few thousand vertices).
Also, graphics cards are reasonably good at culling geometry (this includes discarding entire triangles that are off-screen before fragments are generated). Hundreds? Thousands? Hundred thousands? No issue. Let the graphics card do it.
Unless you have a million sprites of which one half is visible at a time and the other half isn't, it is not unlikely that writing the entire buffer continuously and without branches is not only just as fast, but even faster due to cache and pipeline effects.


Efficiently update Uniform Buffer Objects with instancing and culling

I've successfully updated my rendering engine to use uniform buffer objects and instancing.
The problem is that, since I do a first frustum culling pass every frame in order to know the objects I need to draw, I have to update the buffers every frame because the objects I draw could change every time, and this isn't the most efficient thing.
How could I make this more efficient?
The only thing I could think of is to not do the frustum culling so all the buffers remain static and I don't need to update them all the times, but not doing frustum culling I'd end up draw a lot of unnecessary objects.
Updating uniform buffers is fairly cheap, to be honest. You are quite limited in size and that prevents you from doing anything too crazy.
What you need to focus on to make this efficient is actually accommodating incomplete commands that are queued up. You are more likely to run into problems where the driver/GPU is forced to stop working on the next frame/command due to poor data write patterns than you are to run into data transfer rate limitations. The problem is always going to be avoiding situations where you might write to portions of data that are still in use by the GPU (it is often working on data 1-2 frames behind the CPU).
You have multiple options depending on your target version, and the OpenGL Wiki has a general overview of buffer streaming approaches.
You will have to do some performance testing to say for sure, but I suspect that CPU-side frustum culling combined with buffer orphaning of your instance UBO will give good results. Rather than reusing any data from previous frames, you would just stream the entire instance UBO from CPU to GPU each frame and let the GPU discard the old UBO when it finishes each frame.

Performance of GL_POINTS on modern hardware

Is there any difference in performance between drawing a scene with full triangles (GL_TRIANGLES) instead of just drawing their vertices (GL_POINTS), on modern hardware?
Where GL_POINTS is initialized like this:
I have a somewhat low-end graphics card (9600gt) and drawing vertices-only can bring a 2x fps increase on certain sceneries. Not sure if it applies too on more recent gpus.
2x fps increase on
You lose 98% of picture and get only 2x fps increase. That's not impressive. If you take into account that you should be able to easily render 300..500 fps on any decent hardware (with vsync disabled and minor optimizations), that's probably not worth it.
Is there any difference in performance between drawing a scene with full triangles (GL_TRIANGLES) instead of just drawing their vertices (GL_POINTS), on modern hardware?
Well, if your scene has a LOT of alpha-blending and very "heavy" pixel shaders, then, obviously, displaying scene as point cloud will speed things up, because there's less pixels to fill.
On other hand, this kind of "optimization" will be completely useless for any practical task. I mean, if you're using blending and shaders, you probably wouldn't want to display your scene as pointlist in the first place, unless you're doing some kind of debug render (using glPolygonMode), and in case of debug render, you'll probably turn shaders off (because shaded/lit point will be hard to see) and disable lighting.
Even if you're using point sprites as particles or something, I'd stick with triangles - they give more control and do not have maximum size limit (compared to point sprites).
I can display more objects?
If you want more objects, you should probably try to optimzie things elsewhere first. If you stop trying to draw invisible objects (outside of field of view, etc), that'll be a start that can improve performance.
you have a mesh which is very far away from the camera. 1 million triangles and you know it is always in view. At this density ratio, triangles can't be bigger than a pixel,
When triangles are smaller than a pixel, and there are many of them, your mesh start looking like garbage and turns into pixelated mess of points. It will be ugly. Roughly same effect as when you disable mippimapping and texture filters and then render checkboard pattern. Using points instead of triangles might even aggravate effect.
: If you have 1mil triangle mesh that is always visible, you already need different kind of optimization. Reduce number of triangles (level of detail, dynamic tesselation or some solution that can simplify geometry on the fly), use bump mapping(maybe parallax mapping) to simulate extra geometry details that aren't even here, or even turn it into static background or a sprite. That'll work much better. Trying to render it using points will simply make it look ugly.
No, if the number of triangles is similar to the number of their shared vertices (considering the glDrawElements rendering command being used) in both modes the geometry-wise part of the rendering pipeline will be evaluated at roughly the same speed. The only benefit you can get from drawing GL_POINTS relies solely on the percentage of empty screen space you get from not drawing faces, thus only at fragment shader level.

OpenGL: Is it more efficient to use GL_QUADS or GL_TRIANGLES?

I know that OpenGL deprecated and got rid of GL_QUADS in the newer releases. I have heard this is due to the fact that modern GPUs only render with triangles so calling a quad would just make the GPU work harder to break it into two triangles (what I have heard anyway, I am not much of an expert on any of this topic).
I was wondering whether or not it is better (assuming the average person's CPU is faster, relatively, than their GPU) to just manually break the rendering of quads into two triangles yourself or to just let the GPU do it itself. Again, I have absolutely no real experience with OpenGL as I am just starting. I would rather know which is better for most machines these days so I could focus my attention on either rendering method*. Thanks.
*Yet I will probably utilize the 'triangle method' for the sake of it.
Even if you feed OpenGL quads, the triangularization is done by the driver on the CPU side before it even hits the GPU. Modern GPUs these days eat nothing except triangles. (Well, and lines and points.) So something will be triangulating, whether it's you or the driver -- it doesn't matter too much where it happens.
This would be less efficient if, say, you don't reuse your vertex buffers, and instead refill them anew every time with quads (in which case the driver will have to retriangulate every vertex buffer), instead of refilling them with pretriangulated triangles every time, but that's pretty contrived (and the problem you should be fixing in that case is just the fact you're refilling your vertex buffers).
I would say, if you have the choice, stick with triangles, since that's what most content pipelines put out anyways, and you're less likely to run into problems with non-planar quads and the like. If you get to choose what format your content comes in, then use triangles for sure, and the triangulation step gets skipped altogether.
Any geometry can be represented with triangles, and that is why it was decided to use triangles instead of quads. Another reason is two triangles do not have to be co-planar, which is not true for quad.
Yes, you select to render quads, but the driver will converting the quad into two triangles.
Therefore, by choosing to render a quad will not make GPU work less, but will make your CPU work more, because it has to do the conversion.

2D engine with OpenGL: Use Z buffer or own implementation for sprite sorting?

If I was making a 3D engine, the answer to this question would be clear: I'd go for using the depth buffer instead of thinking of sorting all my polygons on my own.
However, this is a different situation with 2D, because here layers can be implemented easily without the help of OpenGL - and you then could even sort and move sprites within layers. (Which isn't possible in OpenGL afaik)
(Why) should I use the OpenGL depth buffer instead of a C++ layer system running on the CPU?
How much slower would the depth buffer version be?
It is clear to me that making a layer system in C++ would impose as good as no performance impact at all, as I have to iterate over the sprites for rendering in any case.
I would suggest you to do it in software since you probably want to use transparency on your sprites and that implies you render them from back to front. Also sorting a couple of sprites shouldn't be that CPU demanding.
Use both, if you can.
Depth information is nice for post-processing and stuff like 3D-glasses, so you shouldn't throw it away. These kinds of effects can be very nice for 2D games.
Also, if you draw your (opaque) layers front to back, you can save fill-rate because the Z-Buffer can do the clipping for you (Depth tests are faster than actual drawing).
Depth testing is usually almost free, especially when you got hierarchical Z info. Because of this and the fill-rate savings, using depth testing will probably be even faster.
On the other hand, the software sorting is nice so you can actually do front to back rendering for opaque sprites and it's mandatory to do alpha-blending right (of course, you draw these sprites back to front).
Direct answers:
allowing the GPU to use the depth buffer would allow you to dynamically adjust the draw order of things without any on-CPU shuffling and would free you from having to assign things to different layers in situations where doing so is a bit of a fiction — for example, you could have effects like projectiles that come from the background towards and then in front of the player, without having to figure out which layer to assign them to all the time
on the GPU, the use of a depth would have no measurable effect, even if you're on an embedded chip, a plug-in card from more than a decade ago or an integrated part; they're so fundamental to modern GPUs that they've been optimised down to costing nothing in practical terms
However, I'd imagine you actually want to do it on the CPU for the simple reason of treating transparency correctly. A depth buffer stores one depth per pixel, so if you draw a near transparent object then attempt to draw something behind it, the thing behind won't be drawn even though it should be visible. In a 2d game it's likely that anti-aliasing will give your sprites partially transparent edges; if you submit drawing to the GPU in draw order then your partial transparencies will always be composited correctly. If you leave the z-buffer to do it then you risk weird looking fringing.

How to implement independent rendering layers in Direct3D9?

I'm working on a windowed Direct3D data plotting application that needs to display multiple overlays on top of the data (similar to HUDs in games). Since there could be a large amount of data that needs plotting, and not all overlays will be changed every time, I figured it wouldn't be a good idea to replot verticies when only one overlay in the display changes.
This led me to the idea of rendering the textures and verticies of the overlays to multiple textures with transparent backgrounds that could be overlaid in the render loop and updated independently (similar to layers in Photoshop).
Before I embark on changing a large portion of this program to render to textures as opposed to surfaces, I was just wondering if using textures is the best approach.
RTT works well, I used it in a game I did recently. Each scene (scene refers to layer, "HUD" was a scene, "Main" was the main scene etc...) was rendered onto a texture, then each texture was rendering onto a quad, sorted back to front (for alpha blending). I chose this over just rendering the scenes directly onto the back buffer because it allowed me to do post-processing.
For your caching purposes this seems to be the best way to go, but just be aware that the textures can eat memory quickly, and sometimes its just better to render everything again, making sure you sort back to front.
Render to texture will certainly work and could be a good route but it is probably overkill. Modern 3D hardware is very fast and I'd suggest you verify whether performance is really an issue re-rendering when you need an update before investing significant time making major changes to your program.
If performance is an issue your time might be better spent optimizing the code that renders your plot since that will benefit updates that involve changes to the data as well as those that just change an overlay. I'm a graphics programmer for games and generally with realtime 3D you want to focus your optimization efforts on your worst case (you have to redraw everything) rather than your best (only one overlay needs an update).
Rendering to texture render target surfaces is a very good idea, and can be used for a lot of things e.g. optimization/caching, but beware of the blend operation with regular alpha (a*c1 + (1-a)*c2); if # is ARGB blend, then l1#l2#l3 != l3#l1#l2; i.e. it's not commutative, but by using pre-multiplied alpha in all textures/layers the blend operation can be made commutative.
The ultimate reference is the Porter/Duff article "Compositing Digital Images" from 1984.