I try to implement a Volume Renderer with OpenGL and ray casting. Everything works well but I get a performance problem when I look in a negative direction. This means if I look in positive x direction (viewing vector 1, 0, 0) is the performance ok. But if I look in negative x direction (-1, 0, 0) the framerate goes down to 2-3 fps.
I use a 3D texture to hold the data of im dataset. Is there maybe a problem with the caching of the texture on the GPU? Or what could be the problem that the framerate goes down, when I look in a negative direction?
It would be great if I get some tipps, what the problem could be.
There are two things to consider in this situation:
memory access pattern
and
texture data swapping
The performance of a GPU is strongly influenced by the pattern in which data is addressed in accessed from memory. A ray caster casts its ray from front to back of the view (or in the opposite direction, depending on the implementation and internal blending mode) so depending on from which side you look at a 3D texture you get completely different access patterns. And that can have a very large influence.
3D textures consume very large amounts of memory. OpenGL uses an abstract memory model, where there's no explicit limit on the size of objects like textures. Either loading them works or not. And if a driver can manage it, you can load textures larger than what fits into GPU memory. Effectively the GPU memory is a cache for OpenGL data. And if your 3D texture happens to be larger than the available GPU memory it might be, that the OpenGL implementation swaps out texture data while its being accessed at rendering. If your access pattern is lucky, this swapping processed nicely fits "into the gaps" and can sort of stream the texture data into the cache (by which I mean GPU memory) right before its required, without stalling the rendering process. However a different access pattern (other view) will not work well with on demand swapping of data, trashing performance.
I suggest you reduce the size of your 3D texture.
Related
I am creating a simple framework for teaching fundamental graphics concepts under C++/D3D11. The framework is required to enable direct manipulation of the screen raster contents via a simple interface function (e.g. Putpixel( x,y,r,g,b )).
Under D3D9 this was a relatively simple goal achieved by allocating a surface buffer on the heap where the CPU would compose a surface. Then the backbuffer would be locked and the heap buffer's contents transferred to the backbuffer. As I understand it, it is not possible to access the backbuffer directly from the CPU under D3D11. One must prepare a texture resource and then draw it to the backbuffer via some fullscreen geometry.
I have considered two systems for such a procedure. The first comprises a D3D11_USAGE_DEFAULT texture and a D3D11_USAGE_STAGING texture. The staging texture is first mapped and then drawn to from the CPU. When the scene is complete, the staging texture is unmapped and copied to the default texture with CopyResource (which uses the GPU to perform the copy if I am not mistaken), and then the default texture is drawn to the backbuffer via a fullscreen textured quad.
The second system comprises a D3D11_USAGE_DYNAMIC texture and a frame buffer allocated on the heap. When the scene is composed, the dynamic texture is mapped, the contents of the heap buffer are copied over to the dynamic texture via the CPU, the dynamic texture is unmapped, and then it is drawn to the backbuffer via a fullscreen textured quad.
I was under the impression that textures created with read and write access and D3D11_USAGE_STAGING would reside in system memory, but the performance tests I have run seem to indicate that this is not the case. Namely, drawing a simple 200x200 filled rectangle via CPU is about 3x slower with the staging texture than with the heap buffer (exact same disassembly for both cases (a tight rep stos loop)), strongly hinting that the staging texture resides in the graphics adapter memory.
I would prefer to use the staging texture system, since it would allow both the work of rendering to the backbuffer and the work of copying from system memory to graphics memory to be offloaded onto the GPU. However, I would like to prioritize CPU access speed over such an ability in any case.
So what method method would be optimal for this usage case? Any hints, modifications of my two approaches, or suggestions of altogether different approaches would be greatly appreciated.
The dynamic and staging are both likely to be in system memory, but their is good chance that your issue, is write combined memory. It is a cache mode where single writes are coalesced together, but if you attempt to read, because it is un-cached, each load pay the price of a full memory access. You even have to be very careful, because a c++ *data=something; may sometime also leads to unwanted reads.
There is nothing wrong with a dynamic texture, the GPU can read system memory, but you need to be careful, create a few of them, and cycle each frame with a map_nooverwrite, to inhibit the costly driver buffer renaming of the discard. Of course, never do a map in read and write, only write, or you will introduce gpu/cpu sync and kill the parallelism.
Last, if you want a persistent surface and only a few putpixel a frame (or even a lot of them), i would go with an unordered access view and a compute shader that consume a buffer of pixel position with colors to update. That buffer would be a dynamic buffer with nooverwrite mapping, once again. With that solution, the main surface will reside in video memory.
On a personal note, i would not even bother to teach cpu surface manipulation, this is almost always a bad practice and a performance killer, and not the way to go in a modern gpu architecture. This was not a fundamental graphic concept a decade ago already.
What does gl.glTexImage2D do? The docs say it "uploads texture data". But does this mean the whole image is in GPU memory? I'd like to use one large image file for texture mapping. Further: can I simply use a VBO for uv and position coordinates to draw the texture?
Right, I am using words the wrong way here. What I meant was carrying a 2D array of UV coordinates and a 2D array of model to subsample a larger PNG image (in texture memory) onto individual tile models. My confusion here lies in not knowing how fast these fetches can take. Lets say I have a 5000x5000 pixel image. I load it as a texture. Then I create my own algorithm for fetching portions of it to draw. Where do I save myself the bandwidth for drawing these tiles? If I implement an LOD algorithm to determine which tiles are close, which are far and which are out of the camera frustum how do manage each these tiles in memory? Loaded question I know but I am struggling to find the best implementation to get started. I am developing for mobile devices with OpenGL ES 2.0.
What exactly happens when you call glTexImage2D() is system dependent, and there's no way for you to know, unless you have developer tools that allow you to track GPU and memory usage.
The only thing guaranteed is that the data you pass to the call has been consumed by the time the call returns (since the API definition allows you to modify/free the data after the call), and that the data is accessible to the GPU when it's used for rendering. Between that, anything is fair game. Keep in mind that OpenGL is a very asynchronous API. When you make API calls, the corresponding work is mostly queued up for later execution by the GPU, and is generally not completed by the time the calls return. This can include calls for uploading data.
Also, not all GPUs have "GPU memory". In fact, if you look at them by quantity, very few of them do. Mobile GPUs have caches, but mostly not VRAM in the sense of traditional discrete GPUs. How VRAM and caches are managed is highly system dependent.
With all the caveats above, and picturing a GPU that has VRAM: While it's possible that they can load the data into VRAM in the glTexImage2D() call, I would be surprised if that was commonly done. It just wouldn't make much sense to me. When a texture is loaded, you have no idea how soon it will be used for rendering. Since you don't know if all textures will fit in VRAM (and they often will not), you might have to evict it from VRAM before it was ever used. Which would obviously be very wasteful. As a general strategy, I think it will be much more efficient to load the texture data into VRAM only when you have a draw call that uses it.
Things would be somewhat different if the driver could be very confident that all texture data will fit in VRAM. But with OpenGL, there's really no reasonable way to know this ahead of time. And things get even more complicated since at least on desktop computers, you can have multiple applications running at the same time, while VRAM is a shared resource.
You are correct.
glteximage2d is the function that actually moves the texture data across to the gpu.
you will need to create the texture object first using glGenTextures() and then bind it using glBindTexture().
there is a good example of this process in the opengl redbook
example
you can then use this texture with a VBO. There are many ways to accomplish this, but interleaving your vertex coordinates, texture coordinates, and vertex normals and then telling the GPU how to unpack them with several calls to glVertexAttribPointer is the best bet as far as performance.
you are on the right track with VBOs, the old fixed pipeline GL stuff is depricated so you should just learn VBO from the outset.
this book is not 100% up to date, but it is complete and free and should serve as a great place to start learning VBO Open GL Book
When rotating a scene in a 3d modeling interface, which part of such task is CPU responsible for and which part the GPU takes on? (mesh verticies moving, shading, keeping track of UV coords - perhaps offsetting them, lighting the triangles and rendering transparency correctly)
What rendering mode is normally used via such modeling programs (realtime) - immediate or retained?
First and formost the GPU is responsible for putting points, lines and triangles to the screen. However this involves a certain amount of calculations.
The usual pipeline is, that for each vertex (which is a combination of attributes, that usually includes, but is not limited to position, normals, texture coordinates and so on), the vertex position is transformed from model local space into normalized device coordinates. This is in most implementations a 3-stage process.
transformation from model local space into view=eye space – eye space coordinates are later reused for things like illumination calculations
transformation from view space to clip space, also called projection; this is determined which part of the view space will later be visible in the viewport; it's also where affine perspective is introduced
mapping into normalized device coordinates, by coordinate homogenization (this later step actually creates perspective if an affine projection is used).
The above calculations are normally carried out by the GPU.
When rotating a scene in a 3d modeling interface, which part of such task is CPU responsible for and which part the GPU takes on?
Well, that depends on what kind of rotation you mean. If you mean an alteration of the viewport but nothing in the scene input data is actually changed. The only thing that gets altered is a parameter used in the first transformation step. This parameter is normally a 4×4 matrix. When rotating the viewport a new modelview transformation matrix is calculated on the CPU. This matrix is then passed to the GPU and the whole scene redrawn.
If however a model is actually modified in a modeller, then the calculations are usually carried out on the CPU.
(mesh verticies moving, shading, keeping track of UV coords - perhaps offsetting them, lighting the triangles and rendering transparency correctly)
In an online renderer this normally done mostly by the GPU, but certain parts may be precalculated by the CPU.
It's impossible to make a definitive statement, because how the workload is shared depends on the actual application.
This question is really I mean REALLY vague.
Generally whatever comes, because it really depends on the program and it's makers. Older programs were mostly all CPU since GPUs where non existent or too weak to handle massive scenes.
However today GPUs are powerful enough to handle massive scenes, the programs creators can offer various solutions but its usually an abstracted system where you have your data and your view and thus they allow you to specify what you want with in the editing viewport : realtime / immediate, preprocessed , high or low detail. Programs usually sacrifice accuracy for speed so you can edit with ease.
For example 3ds max uses rendering devices, viewport handlers and renderers. Renderers handle the production quality output but today they are not limited to CPU since they can take advantage of the GPU ( just think about OpenCL or CUDA ) while maintaining the quality and lowering rendering times. Secondly anyone can make plugins to implement a viewport renderer the way they want it let it be CPU or GPU or a mixed renderer.
So abstraction being common in modelling tools the scene information is fed into a viewport renderer which is usually very similar to a game engine's renderer. If you think about it , because it's a tool and has an UI and various systems that it needs to handle the try to offload the CPU as much as possible by doing as much rendering work as possible on the GPU, so editing takes place in memory on the in the "general" data structure than that is displayed to you, I'm not sure but I think they may also use the graphics api ( let it be OpenGL or DirectX ) to do the picking ( selection ).
The rendering mode is as I would describe it "on-demand" it usually renders when it needs to, so in a scene where realtime preview is off, it would only render if you modify something or move the camera, and as soon as you do something that needs constant updating it will do exactly that.
On top of all this there are hybrid methods, where the users desire production like quality which even to this day is difficult with GPUs so they settle with a fast watered down version of a real renderer to get as close to production quality as possible, one simple example is 3dsMax's 'Realistic' viewport it does Ambient Occlusion its not "realtime" but it does it fast enough so it's actually useful. In more advanced cases they make special extension cards to handle fast raytracing to be able to do fast / good quality graphics but still even in these cases, the main thing is the same they store the editable data in a generic internal format and feed that into some sort of renderer that outputs something, not necessarily the same that you would get from very high quality offline renderer but still it gives a good outline of what it will be like.
as we all know, openGL uses a pixel-data orientation that has 0/0 at left/bottom, whereas the rest of the world (including virtually all image formats) uses left/top.
this has been a source of endless worries (at least for me) for years, and i still have not been able to come up with a good solution.
in my application i want to support following image data as textures:
image data from various image sources (including still-images, video-files and live-video)
image data acquired via copying the framebuffer to main memory (glReadPixels)
image data acquired via grabbing the framebuffer to texture (glCopyTexImage)
(case #1 delivers images with top-down orientation (in about 98% of the cases; for the sake of simplicity let's assume that all "external images" have top-down orientation); #2 and #3 have bottom-up orientation)
i want to be able to apply all of these textures onto various arbitrarily complex objects (e.g. 3D-models read from disk, that have texture coordinate information stored).
thus i want a single representation of the texture_coords of an object. when rendering the object, i do not want to be bothered with the orientation of the image source.
(until now, i have always carried a topdown-flag alongside the texture id, that get's used when the texture coordinates are actually set. i want to get rid of this clumsy hack!
basically i see three ways to solve the problem.
make sure all image data is in the "correct" (in openGL terms this
is upside down) orientation, converting all the "incorrect" data, before passing it to openGL
provide different texture-coordinates depending on the image-orientation (0..1 for bottom-up images, 1..0 for top-down images)
flip the images on the gfx-card
in the olde times i've been doing #1, but it turned out to be too slow. we want to avoid the copy of the pixel-buffer at all cost.
so i've switched to #2 a couple of years ago, but it is way to complicated to maintain. i don't really understand why i should carry metadata of the original image around, once i transfered the image to the gfx-card and have a nice little abstract "texture"-object.
i'm in the process of finally converting my code to VBOs, and would like to avoit having to update my texcoord arrays, just because i'm using an image of the same size but with different orientation!
which leaves #3, which i never managed to work for me (but i believe it must be quite simple).
intuitively i though about using something like glPixelZoom().
this works great with glDrawPixels() (but who is using that in real life?), and afaik it should work with glReadPixels().
the latter is great as it allows me to at least force a reasonably fast homogenous pixel orientation (top-down) for all images in main memory.
however, it seems thatglPixelZoom() has no effect on data transfered via glTexImage2D, let alone glCopyTex2D(), so the textures generated from main-memory pixels will all be upside down (which i could live with, as this only means that i have to convert all incoming texcoords to top-down when loading them).
now the remaining problem is, that i haven't found a way yet to copy a framebuffer to a texture (using glCopyTex(Sub)Image) that can be used with those top-down texcoords (that is: how to flip the image when using glCopyTexImage())
is there a solution for this simple problem? something that is fast, easy to maintain and runs on openGL-1.1 through 4.x?
ah, and ideally it would work with both power-of-two and non-power-of-two (or rectangle) textures. (as far as this is possible...)
is there a solution for this simple problem? something that is fast, easy to maintain and runs on openGL-1.1 through 4.x?
No.
There is no method to change the orientation of pixel data at pixel upload time. There is no method to change the orientation of a texture in-situ. The only method for changing the orientation of a texture (besides downloading, flipping and re-uploading) is to use an upside-down framebuffer blit from a framebuffer containing a source texture to a framebuffer containing a destination texture. And glFramebufferBlit is not available on any hardware that's so old it doesn't support GL 2.x.
So you're going to have to do what everyone else does: flip your textures before uploading them. Or better yet, flip the textures on disk, then load them without flipping them.
However, if you really, really want to not flip data, you could simply have all of your shaders take a uniform that tells them whether or not to invert the Y of their texture coordinate data. Inversion shouldn't be anything more than a multiply/add operation. This could be done in the vertex shader to minimize processing time.
Or, if you're coding in the dark ages of fixed-function, you can apply a texture matrix that inverts the Y.
why arent you change the way how you map the texture to the polygone ?
I use this mapping coordinates { 0, 1, 1, 1, 0, 0, 1, 0 } for origin top left
and this mapping coordinates { 0, 0, 1, 0, 0, 1, 1, 1 } for origin bottom left.
Then you dont need to manualy switch your pictures.
more details about mapping textures to a polygone could be found here:
http://iphonedevelopment.blogspot.de/2009/05/opengl-es-from-ground-up-part-6_25.html
If I was making a 3D engine, the answer to this question would be clear: I'd go for using the depth buffer instead of thinking of sorting all my polygons on my own.
However, this is a different situation with 2D, because here layers can be implemented easily without the help of OpenGL - and you then could even sort and move sprites within layers. (Which isn't possible in OpenGL afaik)
(Why) should I use the OpenGL depth buffer instead of a C++ layer system running on the CPU?
How much slower would the depth buffer version be?
It is clear to me that making a layer system in C++ would impose as good as no performance impact at all, as I have to iterate over the sprites for rendering in any case.
I would suggest you to do it in software since you probably want to use transparency on your sprites and that implies you render them from back to front. Also sorting a couple of sprites shouldn't be that CPU demanding.
Use both, if you can.
Depth information is nice for post-processing and stuff like 3D-glasses, so you shouldn't throw it away. These kinds of effects can be very nice for 2D games.
Also, if you draw your (opaque) layers front to back, you can save fill-rate because the Z-Buffer can do the clipping for you (Depth tests are faster than actual drawing).
Depth testing is usually almost free, especially when you got hierarchical Z info. Because of this and the fill-rate savings, using depth testing will probably be even faster.
On the other hand, the software sorting is nice so you can actually do front to back rendering for opaque sprites and it's mandatory to do alpha-blending right (of course, you draw these sprites back to front).
Direct answers:
allowing the GPU to use the depth buffer would allow you to dynamically adjust the draw order of things without any on-CPU shuffling and would free you from having to assign things to different layers in situations where doing so is a bit of a fiction — for example, you could have effects like projectiles that come from the background towards and then in front of the player, without having to figure out which layer to assign them to all the time
on the GPU, the use of a depth would have no measurable effect, even if you're on an embedded chip, a plug-in card from more than a decade ago or an integrated part; they're so fundamental to modern GPUs that they've been optimised down to costing nothing in practical terms
However, I'd imagine you actually want to do it on the CPU for the simple reason of treating transparency correctly. A depth buffer stores one depth per pixel, so if you draw a near transparent object then attempt to draw something behind it, the thing behind won't be drawn even though it should be visible. In a 2d game it's likely that anti-aliasing will give your sprites partially transparent edges; if you submit drawing to the GPU in draw order then your partial transparencies will always be composited correctly. If you leave the z-buffer to do it then you risk weird looking fringing.