Average triangles in DirectX 10 - c++

So I basically want to check some information for my project.
I have GTX 460 video card. I wrote DX10 program with 20k triangles printed on the screen and now I get 28 FPS in Release build. All those triangles call DrawIndexed inside them so this is ofcourse an overhead in calling so much draws.
But anyway, I would like to know: how much triangles could I draw on the screen with those capabilities and at which FPS? I think 20k triangles is not even nearly enough to load some good models on game scene.
Sorry for my terrible english.

Sounds like you are creating a single draw call per triangle primitive, this is very bad, hence the horrid FPS, you should aim to draw as many triangles as possible per draw call, this can be done in a few ways:
Profile your code, both nVidia and AMD have free to to you you find why your code is slow, allowing you to focus where it really matters, so use them.
Index buffers & triangle strips to reduce bandwidth
Grouping of verts by material type/state/texture to improve batching
Instancing of primitive groups: draw multiple models/meshs in one call
Remove as much redundant state change (setting of shaders, textures, buffers, paramters) as possible, this goes hand-in-hand with the group mentioned earlier
The DX SDK will have examples of implementing each of these. The exact amount for triangles you can draw and a decent FPS(either 30 or 60 if you want vsync) varies greatly depending on the complexity of shading the triangles, however, if draw most simply, you should be able to push a few million with ease.
I would recommend taking a good look at the innards of an open source DX11 (not many DX10 projects exists, but the API is almost identical) engine, such as heiroglyph 3, and going through the SDK tutorials.
There are also quite a few presentations on increasing performance with DX10, but profile your code before diving into the suggestions full-on, here are a few from the hardware vendors themselves (color coded hints for nVidia vs AMD hardware):
GDC '08
GDC '09

Related

drawing time series of millions of vertices using OpenGL

I'm working on a data visualisation application where I need to draw about 20 different time series overlayed in 2D, each consisting of a few million data points. I need to be able to zoom and pan the data left and right to look at it and need to be able to place cursors on the data for measuring time intervals and to inspect data points. It's very important that when zoomed out all the way, I can easily spot outliers in the data and zoom in to look at them. So averaging the data can be problematic.
I have a naive implementation using a standard GUI framework on linux which is way too slow to be practical. I'm thinking of using OpenGL instead (testing on a Radeon RX480 GPU), with orthogonal projection. I searched around and it seems VBOs to draw line strips might work, but I have no idea if this is the best solution (would give me the best frame rate).
What is the best way to send data sets consisting of millions of vertices to the GPU, assuming the data does not change, and will be redrawn each time the user interacts with it (pan/zoom/click on it)?
In modern OpenGL (versions 3/4 core profile) VBOs are the standard way to transfer geometric / non-image data to the GPU, so yes you will almost certainly end up using VBOs.
Alternatives would be uniform buffers, or texture buffer objects, but for the application you're describing I can't see any performance advantage in using them - might even be worse - and it would complicate the code.
The single biggest speedup will come from having all the data points stored on the GPU instead of being copied over each frame as a 2D GUI might be doing. So do the simplest thing that works and then worry about speed.
If you are new to OpenGL, I recommend the book "OpenGL SuperBible" 6th or 7th edition. There are many good online OpenGL guides and references, just make sure you avoid the older ones written for OpenGL 1/2.
Hope this helps.

How can I draw to the display, without OpenGL?

I've been learning OpenGL, and as I sit trying to write my VBOs, PBOs, VAOs, textures, quads, bindings, fragment shaders, vertex shaders, and a whole suite of other modern abstractions upon abstractions built after decades of evolution, I wonder: Isn't the display nothing but a large block of memory?
I've heard of tales, that in the "good ol' days" (such as the Commodore 64), all you had to do was assign a value to an arbitrary byte in memory, and the screen would change a pixel. Extremely simple and elegant. In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
This begs the question, is it possible in the modern day to just "update a pixel of the screen"? Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels? This is an extremely broad question, but I'm curious. The answer I'm looking for to this question would provide a rough outline of what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen, as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
Isn't the display nothing but a large block of memory?
Yes, it is called a framebuffer.
I've heard of tales, that in the "good ol' days" (such as the Commodore 64)
Your current PC works like that right when you power it up! If you use the CPU to write into video memory, that is called a software renderer.
In the modern day, this has changed with layers upon layers of abstractions and safeguards, such that changing a pixel on your display is several hundred feet away.
No, they are not abstractions/safeguards for "changing pixels". Nowadays software renderers are not used anymore. Instead, you have to tell the GPU (which is another computer on its own) how to draw. That "talk" is what the APIs (like OpenGL) do for you.
Now, the GPUs are meant to be fast at drawing, and that requires specialized code and data structures. Those are all the things you mention: VBOs, PBOs, VAOs, shaders, etc. (in OpenGL parlance). There is no way around that, because GPUs are different hardware.
is it possible in the modern day to just "update a pixel of the screen"?
Yes, but that will end up being drawn somehow by the GPU, even if it looks to you like a memory write.
Is it possible to write my own graphics driver or something, where I can send commands to some C wrapper which interfaces with the GPU to change those pixels?
Yes, but that "C wrapper" is the graphics driver. A graphics driver for a modern GPU is very complex.
what you'd have to do in order to be able to arbitrarily get some C code to set a pixel on the screen
You cannot write a "C program" to write to a graphical screen because the C standard does not concern itself with graphical displays.
So it depends on your operating system, your hardware, whether you want 2D or 3D acceleration support, the API you choose...
as well as a rough outline of why OpenGL has progressed the way it has - what problems did VBOs, PBOs, VAOs, bindings, shaders, etc. solve, and how we got to where we are today.
See above.
You can make your own frame buffer - that is just an integer array - and do rasterization on it, then use for example the Windows GDI function SetBitmapBits() to draw it to the display in one go. The final draw-to-display command depends on the operating system.
How you do the rasterization on your framebuffer is completely up to you. You can use the CPU to draw individual pixels or rasterize lines and triangles, see for example this demo of my old CPU graphics engine using Windows GDI: https://youtu.be/GFzisvhtRS4.
Using the CPU is fine as long as you do not rasterize large datasets. From my experience, the limit to real-time 60fps rendering on the CPU is ~50k lines per frame.
If you want to rasterize really large datasets, you have to use a GPU in some way. Since the framebuffer is just an integer array, you can transfer it to/from the GPU using OpenCL or CUDA and on the GPU - if your dataset happens to already be in video memory - do all the rasterization extremely fast in parallel. For this you will need an additional z-buffer to decide which pixels to overdraw by occluding geometries. This way you can rasterize approximately 30 Million lines per frame at 60fps. This demo is rendered on the GPU in real time using OpenCL: https://youtu.be/lDsz2maaZEo
Is it possible in the modern day to just "update a pixel of the screen"?
Yes. In Windows for example, you can use SetPixel() to draw a pixel or BitBlt() to draw in bulk. See this Q/A
This works fine, but this means you're using the CPU for rendering and you'll find the GPU is much more effective for this task, especially if you require decent framerate and non-trivial graphics. The reason there's these "whole suite of other modern abstractions upon abstractions" is to serve as an interface to the GPU since it has an independent set of memory and totally different execution model. Other GPU libraries (OpenCL, DirectX, Vulkan, etc) all have the same kind of abstractions.
I've glossed over many nuances but I hope the point gets across.

What is state-of-the-art for text rendering in OpenGL as of version 4.1? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
The community reviewed whether to reopen this question last month and left it closed:
Original close reason(s) were not resolved
Improve this question
There are already a number of questions about text rendering in OpenGL, such as:
How to do OpenGL live text-rendering for a GUI?
But mostly what is discussed is rendering textured quads using the fixed-function pipeline. Surely shaders must make a better way.
I'm not really concerned about internationalization, most of my strings will be plot tick labels (date and time or purely numeric). But the plots will be re-rendered at the screen refresh rate and there could be quite a bit of text (not more than a few thousand glyphs on-screen, but enough that hardware accelerated layout would be nice).
What is the recommended approach for text-rendering using modern OpenGL? (Citing existing software using the approach is good evidence that it works well)
Geometry shaders that accept e.g. position and orientation and a character sequence and emit textured quads
Geometry shaders that render vector fonts
As above, but using tessellation shaders instead
A compute shader to do font rasterization
Rendering outlines, unless you render only a dozen characters total, remains a "no go" due to the number of vertices needed per character to approximate curvature. Though there have been approaches to evaluate bezier curves in the pixel shader instead, these suffer from not being easily antialiased, which is trivial using a distance-map-textured quad, and evaluating curves in the shader is still computationally much more expensive than necessary.
The best trade-off between "fast" and "quality" are still textured quads with a signed distance field texture. It is very slightly slower than using a plain normal textured quad, but not so much. The quality on the other hand, is in an entirely different ballpark. The results are truly stunning, it is as fast as you can get, and effects such as glow are trivially easy to add, too. Also, the technique can be downgraded nicely to older hardware, if needed.
See the famous Valve paper for the technique.
The technique is conceptually similar to how implicit surfaces (metaballs and such) work, though it does not generate polygons. It runs entirely in the pixel shader and takes the distance sampled from the texture as a distance function. Everything above a chosen threshold (usually 0.5) is "in", everything else is "out". In the simplest case, on 10 year old non-shader-capable hardware, setting the alpha test threshold to 0.5 will do that exact thing (though without special effects and antialiasing).
If one wants to add a little more weight to the font (faux bold), a slightly smaller threshold will do the trick without modifying a single line of code (just change your "font_weight" uniform). For a glow effect, one simply considers everything above one threshold as "in" and everything above another (smaller) threshold as "out, but in glow", and LERPs between the two. Antialiasing works similarly.
By using an 8-bit signed distance value rather than a single bit, this technique increases the effective resolution of your texture map 16-fold in each dimension (instead of black and white, all possible shades are used, thus we have 256 times the information using the same storage). But even if you magnify far beyond 16x, the result still looks quite acceptable. Long straight lines will eventually become a bit wiggly, but there will be no typical "blocky" sampling artefacts.
You can use a geometry shader for generating the quads out of points (reduce bus bandwidth), but honestly the gains are rather marginal. The same is true for instanced character rendering as described in GPG8. The overhead of instancing is only amortized if you have a lot of text to draw. The gains are, in my opinion, in no relation to the added complexity and non-downgradeability. Plus, you are either limited by the amount of constant registers, or you have to read from a texture buffer object, which is non-optimal for cache coherence (and the intent was to optimize to begin with!).
A simple, plain old vertex buffer is just as fast (possibly faster) if you schedule the upload a bit ahead in time and will run on every hardware built during the last 15 years. And, it is not limited to any particular number of characters in your font, nor to a particular number of characters to render.
If you are sure that you do not have more than 256 characters in your font, texture arrays may be worth a consideration to strip off bus bandwidth in a similar manner as generating quads from points in the geometry shader. When using an array texture, the texture coordinates of all quads have identical, constant s and t coordinates and only differ in the r coordinate, which is equal to the character index to render.
But like with the other techniques, the expected gains are marginal at the cost of being incompatible with previous generation hardware.
There is a handy tool by Jonathan Dummer for generating distance textures: description page
Update:
As more recently pointed out in Programmable Vertex Pulling (D. Rákos, "OpenGL Insights", pp. 239), there is no significant extra latency or overhead associated with pulling vertex data programmatically from the shader on the newest generations of GPUs, as compared to doing the same using the standard fixed function.
Also, the latest generations of GPUs have more and more reasonably sized general-purpose L2 caches (e.g. 1536kiB on nvidia Kepler), so one may expect the incoherent access problem when pulling random offsets for the quad corners from a buffer texture being less of a problem.
This makes the idea of pulling constant data (such as quad sizes) from a buffer texture more attractive. A hypothetical implementation could thus reduce PCIe and memory transfers, as well as GPU memory, to a minimum with an approach like this:
Only upload a character index (one per character to be displayed) as the only input to a vertex shader that passes on this index and gl_VertexID, and amplify that to 4 points in the geometry shader, still having the character index and the vertex id (this will be "gl_primitiveID made available in the vertex shader") as the sole attributes, and capture this via transform feedback.
This will be fast, because there are only two output attributes (main bottleneck in GS), and it is close to "no-op" otherwise in both stages.
Bind a buffer texture which contains, for each character in the font, the textured quad's vertex positions relative to the base point (these are basically the "font metrics"). This data can be compressed to 4 numbers per quad by storing only the offset of the bottom left vertex, and encoding the width and height of the axis-aligned box (assuming half floats, this will be 8 bytes of constant buffer per character -- a typical 256 character font could fit completely into 2kiB of L1 cache).
Set an uniform for the baseline
Bind a buffer texture with horizontal offsets. These could probably even be calculated on the GPU, but it is much easier and more efficient to that kind of thing on the CPU, as it is a strictly sequential operation and not at all trivial (think of kerning). Also, it would need another feedback pass, which would be another sync point.
Render the previously generated data from the feedback buffer, the vertex shader pulls the horizontal offset of the base point and the offsets of the corner vertices from buffer objects (using the primitive id and the character index). The original vertex ID of the submitted vertices is now our "primitive ID" (remember the GS turned the vertices into quads).
Like this, one could ideally reduce the required vertex bandwith by 75% (amortized), though it would only be able to render a single line. If one wanted to be able to render several lines in one draw call, one would need to add the baseline to the buffer texture, rather than using an uniform (making the bandwidth gains smaller).
However, even assuming a 75% reduction -- since the vertex data to display "reasonable" amounts of text is only somewhere around 50-100kiB (which is practically zero to a GPU or a PCIe bus) -- I still doubt that the added complexity and losing backwards-compatibility is really worth the trouble. Reducing zero by 75% is still only zero. I have admittedly not tried the above approach, and more research would be needed to make a truly qualified statement. But still, unless someone can demonstrate a truly stunning performance difference (using "normal" amounts of text, not billions of characters!), my point of view remains that for the vertex data, a simple, plain old vertex buffer is justifiably good enough to be considered part of a "state of the art solution". It's simple and straightforward, it works, and it works well.
Having already referenced "OpenGL Insights" above, it is worth to also point out the chapter "2D Shape Rendering by Distance Fields" by Stefan Gustavson which explains distance field rendering in great detail.
Update 2016:
Meanwhile, there exist several additional techniques which aim to remove the corner rounding artefacts which become disturbing at extreme magnifications.
One approach simply uses pseudo-distance fields instead of distance fields (the difference being that the distance is the shortest distance not to the actual outline, but to the outline or an imaginary line protruding over the edge). This is somewhat better, and runs at the same speed (identical shader), using the same amount of texture memory.
Another approach uses the median-of-three in a three-channel texture details and implementation available at github. This aims to be an improvement over the and-or hacks used previously to address the issue. Good quality, slightly, almost not noticeably, slower, but uses three times as much texture memory. Also, extra effects (e.g. glow) are harder to get right.
Lastly, storing the actual bezier curves making up characters, and evaluating them in a fragment shader has become practical, with slightly inferior performance (but not so much that it's a problem) and stunning results even at highest magnifications.
WebGL demo rendering a large PDF with this technique in real time available here.
http://code.google.com/p/glyphy/
The main difference between GLyphy and other SDF-based OpenGL renderers is that most other projects sample the SDF into a texture. This has all the usual problems that sampling has. Ie. it distorts the outline and is low quality. GLyphy instead represents the SDF using actual vectors submitted to the GPU. This results in very high quality rendering.
The downside is that the code is for iOS with OpenGL ES. I'm probably going to make a Windows/Linux OpenGL 4.x port (hopefully the author will add some real documentation, though).
The most widespread technique is still textured quads. However in 2005 LORIA developed something called vector textures, i.e. rendering vector graphics as textures on primitives. If one uses this to convert TrueType or OpenType fonts into a vector texture you get this:
http://alice.loria.fr/index.php/publications.html?Paper=VTM#2005
I'm surprised Mark Kilgard's baby, NV_path_rendering (NVpr), was not mentioned by any of the above. Although its goals are more general than font rendering, it can also render text from fonts and with kerning. It doesn't even require OpenGL 4.1, but it is a vendor/Nvidia-only extension at the moment. It basically turns fonts into paths using glPathGlyphsNV which depends on the freetype2 library to get the metrics, etc. Then you can also access the kerning info with glGetPathSpacingNV and use NVpr's general path rendering mechanism to display text from using the path-"converted" fonts. (I put that in quotes, because there's no real conversion, the curves are used as is.)
The recorded demo for NVpr's font capabilities is unfortunately not particularly impressive. (Maybe someone should make one along the lines of the much snazzier SDF demo one can find on the intertubes...)
The 2011 NVpr API presentation talk for the fonts part starts here and continues in the next part; it is a bit unfortunate how that presentation is split.
More general materials on NVpr:
Nvidia NVpr hub, but some material on the landing page is not the most up-to-date
Siggraph 2012 paper for the brains of the path-rendering method, called "stencil, then cover" (StC); the paper also explains briefly how competing tech like Direct2D works. The font-related bits have been relegated to an annex of the paper. There are also some extras like videos/demos.
GTC 2014 presentation for an update status; in a nutshell: it's now supported by Google's Skia (Nvidia contributed the code in late 2013 and 2014), which in turn is used in Google Chrome and [independently of Skia, I think] in a beta of Adobe Illustrator CC 2014
the official documentation in the OpenGL extension registry
USPTO has granted at least four patents to Kilgard/Nvidia in connection with NVpr, of which you should probably be aware of, in case you want to implement StC by yourself: US8698837, US8698808, US8704830 and US8730253. Note that there are something like 17 more USPTO documents connected to this as "also published as", most of which are patent applications, so it's entirely possible more patents may be granted from those.
And since the word "stencil" did not produce any hits on this page before my answer, it appears the subset of the SO community that participated on this page insofar, despite being pretty numerous, was unaware of tessellation-free, stencil-buffer-based methods for path/font rendering in general. Kilgard has a FAQ-like post at on the opengl forum which may illuminate how the tessellation-free path rendering methods differ from bog standard 3D graphics, even though they're still using a [GP]GPU. (NVpr needs a CUDA-capable chip.)
For historical perspective, Kilgard is also the author of the classic "A Simple OpenGL-based API for Texture Mapped Text", SGI, 1997, which should not be confused with the stencil-based NVpr that debuted in 2011.
Most if not all the recent methods discussed on this page, including stencil-based methods like NVpr or SDF-based methods like GLyphy (which I'm not discussing here any further because other answers already cover it) have however one limitation: they are suitable for large text display on conventional (~100 DPI) monitors without jaggies at any level of scaling, and they also look nice, even at small size, on high-DPI, retina-like displays. They don't fully provide what Microsoft's Direct2D+DirectWrite gives you however, namely hinting of small glyphs on mainstream displays. (For a visual survey of hinting in general see this typotheque page for instance. A more in-depth resource is on antigrain.com.)
I'm not aware of any open & productized OpenGL-based stuff that can do what Microsoft can with hinting at the moment. (I admit ignorance to Apple's OS X GL/Quartz internals, because to the best of my knowledge Apple hasn't published how they do GL-based font/path rendering stuff. It seems that OS X, unlike MacOS 9, doesn't do hinting at all, which annoys some people.) Anyway, there is one 2013 research paper that addresses hinting via OpenGL shaders written by INRIA's Nicolas P. Rougier; it is probably worth reading if you need to do hinting from OpenGL. While it may seem that a library like freetype already does all the work when it comes to hinting, that's not actually so for the following reason, which I'm quoting from the paper:
The FreeType library can rasterize a glyph using sub-pixel anti-aliasing in RGB mode.
However, this is only half of the problem, since we also want to achieve sub-pixel
positioning for accurate placement of the glyphs. Displaying the textured quad at
fractional pixel coordinates does not solve the problem, since it only results in texture
interpolation at the whole-pixel level. Instead, we want to achieve a precise shift
(between 0 and 1) in the subpixel domain. This can be done in a fragment shader [...].
The solution is not exactly trivial, so I'm not going to try to explain it here. (The paper is open-access.)
One other thing I've learned from Rougier's paper (and which Kilgard doesn't seem to have considered) is that the font powers that be (Microsoft+Adobe) have created not one but two kerning specification methods. The old one is based on a so-called kern table and it is supported by freetype. The new one is called GPOS and it is only supported by newer font libraries like HarfBuzz or pango in the free software world. Since NVpr doesn't seem to support either of those libraries, kerning might not work out of the box with NVpr for some new fonts; there are some of those apparently in the wild, according to this forum discussion.
Finally, if you need to do complex text layout (CTL) you seem to be currently out of luck with OpenGL as no OpenGL-based library appears to exist for that. (DirectWrite on the other hand can handle CTL.) There are open-sourced libraries like HarfBuzz which can render CTL, but I don't know how you'd get them to work well (as in using the stencil-based methods) via OpenGL. You'd probably have to write the glue code to extract the re-shaped outlines and feed them into NVpr or SDF-based solutions as paths.
I think your best bet would be to look into cairo graphics with OpenGL backend.
The only problem I had when developing a prototype with 3.3 core was deprecated function usage in OpenGL backend. It was 1-2 years ago so situation might have improved...
Anyway, I hope in the future desktop opengl graphics drivers will implement OpenVG.

How to speed up offscreen OpenGL rendering with large textures on Win32?

I'm developing some C++ code that can do some fancy 3D transition effects between two images, for which I thought OpenGL would be the best option.
I start with a DIB section and set it up for OpenGL, and I create two textures from input images.
Then for each frame I draw just two OpenGL quads, with the corresponding image texture.
The DIB content is then saved to file.
For example one effect is to locate the two quads (in 3d space) like two billboards, one in front of the other(obscuring it), and then swoop the camera up, forward and down so you can see the second one.
My input images are 1024x768 or so and it takes a really long time to render (100 milliseconds) when the quads cover most of the view. It speeds up if the camera is far away.
I tried rendering each image quad as hundreds of individual tiles, but it takes just the same time, it seems like it depends on the number of visible textured pixels.
I assumed OpenGL could do zillions of polygons a second. Is there something I am missing here?
Would I be better off using some other approach?
Thanks in advance...
Edit :
The GL strings show up for the DIB version as :
Vendor : Microsoft Corporation
Version: 1.1.0
Renderer : GDI Generic
The Onscreen version shows :
Vendor : ATI Technologies Inc.
Version : 3.2.9756 Compatibility Profile Context
Renderer : ATI Mobility Radeon HD 3400 Series
So I guess I'll have to use FBO's , I'm a bit confused as to how to get the rendered data out from the FBO onto a DIB, any pointers (pun intended) on that?
It sounds like rendering to a DIB is forcing the rendering to happen in software. I'd render to a frame buffer object, and then extract the data from the generated texture. Gamedev.net has a pretty decent tutorial.
Keep in mind, however, that graphics hardware is oriented primarily toward drawing on the screen. Capturing rendered data will usually be slower that displaying it, even when you do get the hardware to do the rendering -- though it should still be quite a bit faster than software rendering.
Edit: Dominik Göddeke has a tutorial that includes code for reading back texture data to CPU address space.
One problem with your question:
You provided no actual rendering/texture generation code.
Would I be better off using some other approach?
The simplest thing you can do is to make sure your textures have sizes equal to power of two. I.e. instead of 1024x768 use 1024x1024, and use only part of that texture. Explanation: although most of modern hardware supports non-pow2 textures, they are sometimes treated as "special case", and using such texture MAY produce performance drop on some hardware.
I assumed OpenGL could do zillions of polygons a second. Is there something I am missing here?
Yes, you're missing one important thing. There are few things that limit GPU performance:
1. System memory to video memory transfer rate (probably not your case - only for dynamic textures\geometry when data changes every frame).
2. Computation cost. (If you write a shader with heavy computations, it will be slow).
3. Fill rate (how many pixels program can put on screen per second), AFAIK depends on memory speed on modern GPUs.
4. Vertex processing rate (not your case) - how many vertices GPU can process per second.
5. Texture read rate (how many texels per second GPU can read), on modern GPUs depends on GPU memory speed.
6. Texture read caching (not your case) - i.e. in fragment shader you can read texture few hundreds times per pixel with little performance drop IF coordinates are very close to each other (i.e. almost same texel in each read) - because results are cached. But performance will drop significantly if you'll try to access 100 randomly located texels for every pixels.
All those characteristics are hardware dependent.
I.e., depending on some hardware you may be able to render 1500000 polygons per frame (if they take a small amount of screen space), but you can bring fps to knees with 100 polygons if each polygon fills entire screen, uses alpha-blending and is textured with a highly-detailed texture.
If you think about it, you may notice that there are a lot of videocards that can draw a landscape, but fps drops when you're doing framebuffer effects (like blur, HDR, etc).
Also, you may get performance drop with textured surfaces if you have built-in GPU. When I fried PCIEE slot on previous motherboard, I had to work with built-in GPU (NVidia 6800 or something). Results weren't pleasant. While GPU supported shader model 3.0 and could use relatively computationally expensive shaders, fps rapidly dropped each time when there was a textured object on screen. Obviously happened because built-in GPU used part of system memory as video memory, and transfer rates in "normal" GPU memory and system memory are different.

How to get games' FPS (with OpenGL) to like 800 FPS

How can we run a OpenGL applications (say a games) in higher frame rate like 500 - 800 FPS ?
For a example AOE 2 is running with more than 700 FPS (I know it is about DirectX). Eventhough I just clear buffers and swap buffers within the game loop, I can only get about 200 (max) FPS. I know that FPS isn't a good messurenment (and also depend on the hardware), but I feel I missed some concepts in OpenGL. Did I ? Pls anyone can give me a hint ?
I'm getting roughly 5.600 FPS with an empty display loop (GeForce 260 GTX, 1920x1080). Adding glClear lowers it to 4.000 FPS which is still way over 200...
A simple graphics engine (AoE2 style) should run at about 100-200 FPS (GeForce 8 or similar). Probably more if it's multi-threaded and fully optimized.
I don't know what exactly you do in your loop or what hardware that is running on, but 200 FPS sounds like you are doing something else besides drawing nothing (sleep? game logic stuff? greedy framework? Aero?). The swapbuffer function should not take 5ms even if both framebuffers have to be copied. You can use a profile to check where the most CPU time is spent (timing results from gl* functions are mostly useless though)
If you are doing something with OpenGL (drawing stuff, creating textures, etc.) there is a nice extension to measure times called GL_EXT_timer_query.
Some general optimization tips:
don't use immediate mode (glBegin/glEnd), use VBO and/or display lists+vertex arrays instead
use some culling technique to remove objects outside your view (opengl would have to cull every polygon separately)
try minimizing state changes, especially changing the bound texture or vertex buffer
AOE 2 is a DirectDraw application, not Direct3D. There is no way to compare OpenGL and DirectDraw.
Also, check the method you're using for swapping buffers. In Direct3D there are flip method, copy method, and discard method. The best one is discard, which means that you don't care about previous contents in the buffer, and allow the driver to manage them efficiently.
One of the things you seem to miss (judging from your answer/comments, so correct me if I'm wrong) is that you need to determine what to render.
For example as you said you have multiple layers and such, well the first thing you need to do is not render anything that is off screen (which is possible and is sometimes done). What you should also do is not render things that you are certain are not visible, for example if some area of the top layer is not transparent (or filled up) you should not render the layers below it.
In general what I'm trying to say is that it is in most cases better to eliminate invisible things in the logic than to render all things and just let the things on top end up in the rendered image.
If your textures are small, try to combine them in one bigger texture and address them via texture coordinates. That will save you a lot of state changes. If your textures are e.g. 128x128, you can put 16 of them in one 512x512 texture, bringing your texture related state changes down by a factor of 16.