I am trying to optimize some OpenGL code and I was wondering if someone knows of a table that would give a rough approximation of the relative costs of various OpenGL functions ?
Something like (these numbers are probably completely wrong) :
method cost
glDrawElements(100 indices) 1
glBindTexture(512x512) 2
glGenBuffers(1 buffer) 1.2
If that doesn't exist, would it be possible to build one or are the various hardware/OS too different for that to be even meaningful ?
There certainly is no such list. One of the problems in creating such a list is answering the question, "what kind of cost?"
All rendering functions have a GPU-time cost. That is, the GPU has to do rendering. How much of a cost depends on the shaders in use, the number of vertices provided, and the textures being used.
Even for CPU time cost, the values are not clear. Take glDrawElements. If you changed the vertex attribute bindings before calling it, then it can take more CPU time than if you didn't. Similarly, if you changed uniform values in a program since you last used it, then rendering with that program may take longer. And so forth.
The main problem with assembling such a list is that it encourages premature optimization. If you have such a list, then users will be encouraged to take steps to avoid using functions that cost more. They may take too many steps along this route. No, it's better to just avoid the issue entirely and encourage users to actually profile their applications before optimizing them.
The relative costs of different OpenGL functions will depend heavily on the arguments to the function, the active OpenGL environment when they are called, and the GPU, drivers, and OS you're running on. There's really no good way to do a comparison like what you're describing -- your best bet is simply to test out the different possibilities and see what performs best for you.
Related
I have read that instancing in OpenGL makes drawing thousands of objects faster. But, if you use instancing and only draw one object, is it much slower? If so, what order of magnitude of objects do you need for instancing to be an improvement? Just a few? Tens? Hundreds?
Some context (in case I have an X-Y problem); if I have to write code for instancing anyway, it would be easier to just leave it on all the time.
Answers to these types of questions tend to be somewhat repetitive: Try different options, and benchmark them on the platform(s) you care about. There's really no way to give a definitive answer that will necessarily apply to every possible platform.
That being said, I would not expect instanced rendering to add significant overhead on hardware that fully supports it. Instanced rendering is not a very recent feature. Based on the history I could find, it was part of DX10 (released in 2006) and OpenGL 3.1 (released in 2009). So it seems very likely that any moderately recent hardware (DX10 level and later) can support it efficiently.
On recent hardware, non-instanced rendering could be just a special case of instanced rendering where only a single instance is drawn. There might be a little more state setup, but overall it could be basically no additional overhead.
In general, it's not uncommon that features are supported on hardware that does not really have full support for the feature. In those cases, the driver will sometimes have to jump through hoops to provide the feature, often with lower efficiency and additional CPU overhead. It's not impossible that this could be the case for instanced rendering on some platform, which brings us back to the start: Benchmark!
I'm trying to make a game or 3D application using openGL. The game/program will have many objects in them and drawn to the screen(around 7000 of them). When I render them, I would need to calculate the distance between the camera and the object and sort them in order to correctly render the objects within the scene. Knowing this, what is the best way to sort them? I really want the sorting to be done really fast, but I've heard there are "trade off" for them, so what algorithm should I use to get the best performance out of it?
Any help would be greatly appreciated.
Edit: a lot of people are talking about the z-buffer/depth buffer. This doesn't work in some cases like a few people talked about. This is why I asked this question.
Sorting by distance doesn't solve the transparency problem perfectly. Consider the situation where two transparent surfaces intersect and each has a part which is closer to you. Perhaps rare in games, but still something to consider if you don't want an occasional glitched look to your renderer.
The better solution is order-independent transparency. With the latest graphics hardware supporting atomic operations, you can use an A-buffer to do this with little memory overhead and in a single pass so it is pretty efficient. See for example this article.
The issue of sorting your scene is still a valid one, though, even if it isn't for transparency -- it is still useful to sort opaque objects front to back to to allow depth testing to discard unseen fragments. For this, Vaughn provided the great solution of BSP trees -- these have been used for this purpose for as long as 3D games have been around.
Use http://en.wikipedia.org/wiki/Insertion_sort which has O(n) complexity for nearly sorted arrrays.
In your case by exploiting temporal cohesion insertion sort gives fastest results.
It is used for http://en.wikipedia.org/wiki/Sweep_and_prune
From link above:
In many applications, the configuration of physical bodies from one time step to the next changes very little. Many of the objects may not move at all. Algorithms have been designed so that the calculations done in a preceding time step can be reused in the current time step, resulting in faster completion of the calculation.
So in such cases insertion sort is best(or similar sorts with O(n) at best case)
Why do people tend to mix deprecated fixed-function pipeline features like the matrix stack, gluPerspective(), glMatrixMode() and what not when this is meant to be done manually and shoved into GLSL as a uniform.
Are there any benefits to this approach?
There is a legitimate reason to do this, in terms of user sanity. Fixed-function matrices (and other fixed-function state tracked in GLSL) are global state, shared among all uniforms. If you want to change the projection matrix in every shader, you can do that by simply changing it in one place.
Doing this in GLSL without fixed function requires the use of uniform buffers. Either that, or you have to build some system that will farm state information to every shader that you want to use. The latter is perfectly doable, but a huge hassle. The former is relatively new, only introduced in 2009, and it requires DX10-class hardware.
It's much simpler to just use fixed-function and GLSL state tracking.
No benefits as far as I'm aware of (unless you consider not having to recode the functionality a benefit).
Most likely just laziness, or a lack of knowledge of the alternative method.
Essentially because those applications requires shaders to run, but programmers are too lazy/stressed to re-implement those features that are already available using OpenGL compatibility profile.
Notable features that are "difficult" to replace are the line width (greater than 1), the line stipple and separate front and back polygon mode.
Most tutorials teach deprecated OpenGL, so maybe people don't know better.
The benefit is that you are using well-known, thoroughly tested and reliable code. If it's for MS Windows or Linux proprietary drivers, written by the people who built your GPU and therefore can be assumed to know how to make it really fast.
An additional benefit for group projects is that There Is Only One Way To Do It. No arguments about whether you should be writing your own C++ matrix class and what it should be called and which operators to overload and whether the internal implementation should be a 1D or 2D arrary...
I am in a project to process an image using CUDA. The project is simply an addition or subtraction of the image.
May I ask your professional opinion, which is best and what would be the advantages and disadvantages of those two?
I appreciate everyone's opinions and/or suggestions since this project is very important to me.
General answer: It doesn't matter. Use the language you're more comfortable with.
Keep in mind, however, that pycuda is only a wrapper around the CUDA C interface, so it may not always be up-to-date, also it adds another potential source of bugs, …
Python is great at rapid prototyping, so I'd personally go for Python. You can always switch to C++ later if you need to.
If the rest of your pipeline is in Python, and you're using Numpy already to speed things up, pyCUDA is a good complement to accelerate expensive operations. However, depending on the size of your images and your program flow, you might not get too much of a speedup using pyCUDA. There is latency involved in passing the data back and forth across the PCI bus that is only made up for with large data sizes.
In your case (addition and subtraction), there are built-in operations in pyCUDA that you can use to your advantage. However, in my experience, using pyCUDA for something non-trivial requires knowing a lot about how CUDA works in the first place. For someone starting from no CUDA knowledge, pyCUDA might be a steep learning curve.
Take a look at openCV, it contains a lot of image processing functions and all the helpers to load/save/display images and operate cameras.
It also now supports CUDA, some of the image processing functions have been reimplemented in CUDA and it gives you a good framework to do your own.
Alex's answer is right. The amount of time consumed in the wrapper is minimal. Note that PyCUDA has some nice metaprogramming constructs for generating kernels which might be useful.
If all you're doing is adding or subtracting elements of an image, you probably shouldn't use CUDA for this at all. The amount of time it takes to transfer back and forth across the PCI-E bus will dwarf the amount of savings you get from parallelism.
Any time you deal with CUDA, it's useful to think about the CGMA ratio (computation to global memory access ratio). Your addition/subtraction is only 1 float point operation for 2 memory accesses (1 read and 1 write). This ends up being very lousy from a CUDA perspective.
I'm just learning about them, and find it discouraging that they have been deprecated. Should I keep investing into learning them? Would I learn something useful for the current model?
I think, though I may be wrong, that since most high-performance graphics apps (mostly games) pretty much only used vertex buffers and the like (in order to squeeze every drop of performance out of the card), that there was pressure to stop worrying about "frivolous" items such as display lists (and even good-old glVertex calls). IMHO, this provides a huge barrier to people learning to write OpenGL code, and (for my own purposes) is a big impediment to whipping up some quick, legible, and reasonably well performing code.
Note that these features were deprecated in 3.0, and actually removed in 3.1 (but still provided compatibility via an ARB extension). In OpenGL 3.2, they moved these features into a 'compatibility' profile that is optional for driver writers to implement.
So what does this mean? NVidia, at least, has vowed to continue support for the old-school compatibility mode for the forseeable future - there is a large wealth of legacy code out there that they need to support. You can find the discussion of their support in a slideshow at:
http://www.slideshare.net/Mark_Kilgard/opengl-32-and-more
starting at about slide #32. I don't know ATI/AMD's stance on this, but I would assume that it would be similar.
So, while display lists are technically removed from the required portion of the OpenGL 3.2 standard, I think that you are safe using them for quite a while. Eventually, you may wish to learn the buffer/shader-centric interface to OpenGL, especially if your end-goal is envelope-pushing game writing, but it really is a lot less intuitive (no glRotate, even!), so I would recommend starting with good old OpenGL 2.x.
-matt
Display lists were removed, because with opengl 3+ all vertex, texture and lighting data are stored on the graphics card, in what is called retained mode rendering (the data is retained, allowing you to send a single command to the card to draw a mesh, rather than sending vertex data to the card every frame). A major bottleneck in computer graphics is data bandwidth between RAM and gpuRAM. by generating meshes once, and retaining that data, we can transform it using homogeneous transform matrices, and draw it easily. This effectively reduces the bottleneck, with the drawback of longer loading times.
Immediate mode, however (pre 3.0) uses massive amounts of graphics bandwidth to send vertex data every frame, pre-transformed, with recalculated normals etc.
The problems with this approach are twofold:
1) excessive bandwidth use, and too much gpu idle time.
2) Excessive use of cpu time for calculations that could be done in parallel on 100+ cores, on the gpu
The simple solution to this, is retained mode.
With retained mode, display lists are no longer necessary. Hence their removal from the core profile.
Immediate mode is still very good for learning the theory of computer graphics. (and it's loads of fun, to boot) It just suffers in terms of maximum possible performance.
VBOs & VAOs may be, at first, less intuitive, but in terms of speed, it is far superior.
There are several easy to understand opengl 3.0 tutorials on the internet. Once you have openGL 2.0 down, you should consider moving on to 3.0+, as it allows you to build very fast 3d graphics applications.
While Matthew Hall has a good answer and covers most things, there are a few things I'll add.
If you look at what's been deprecated, you'll see it's a lot of client side and fixed functionality. So it's obvious that they're trying to move people away from client side centered code and have people do everything possible server side on the GPU instead.
When it comes to which context to use, well, that's up to you. Though if performance is a major concerned then 3.x is probably the way to go. I personally definitely want to learn opengl 3.x, but I doubt I'll be giving up 1.x/2.x. It's just so much easier to put together a quick app with what's available in a 1.x or 2.x context.
If you want a list of what's been deprecated, download the 3.0 specification and look under "The Deprecation Model"
A note from the future: latest Direct-X, Metal, and Vulkan apis have command buffers and command queues, which allow to record commands in the CPU, then sent them to the GPU to execute them there. Thus, perhaps, display lists was not a so old-fashioned idea. In fact, compiling display list is something orthogonal to the use of shaders and VBOs, and display lists can improve performance further....I wonder if a Vulkan or Metal to OpenGL translator could use display lists for command buffers....
Because VBOs (vertex buffer objects) are much more efficient and can do everything display lists can do. They're not really any more complex, either, just a little different. Unless you're already more familiar with the old style glBegin/glEnd stuff, you're probably best off learning about buffers from the get go.