How to use the native pointer to a texture on the GPU? - c++

I'm currently doing some GPGPU on my GPU. I've written a shader that performs all the calculations I want it to do and this gives the right results. However, the engine I'm using (Unity), requires me to use a slow and cumbersome way to load the values from the GPU to the CPU, which is also memory-inefficient and loses precision. In short, it works, but it also sucks.
However, Unity also gives me the option to retrieve the texture's ID (openGL specific ?), or the texture's pointer (not platform specific apparently), after which I can write a DLL in native code (c++), to get the data from the GPU to the CPU. On the GPU it's a texture in RGBAFloat (so 4 floats per pixel, but I could easily change this to just 1 float per pixel if that would be necessary), and on the CPU I just want a two-dimensional array of floats. It seems to me that this would be pretty trivial, yet I can't seem to find useful information.
Does anyone have any ideas how I can retrieve the floats in the texture using the pointer, and let C++ output it as an array of floats?
Please ask for clarification if needed.

Related

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?
The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.
When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.
Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

Should I make my raytracer with GLSL or OpenCL, and how I do I get a large 1gb buffer?

Right now, I have implemented a GLSL raytracer that uses a buffer texture to access the acceleration structure used for ray tracing.
I'm traversing the texture with a while loop, and it's very costly, but I think there's hope for making it faster. But there seems to be a wall that I'm going to hit that I can't seem to fix. Buffer textures have a limited size, on my GPU it was around 200mb, I forget exactly what it was.
I need my data structure to be around 1gb.
Someone recommended OpenCL to me to solve the problem, so I studied OpenCL and now I'm familiar with the API. However, I discovered that OpenCL also has a similar problem with maximum buffer sizes. Most GPUs will only give you access to 1/4 of total vram in a single buffer. Most GPU's have 1 or 2 gbs of vram so creating 1 buffer for my structure will not work.
It seems like the only way get my data structure on the GPU is to split it up into multiple buffers now. My question is, what's the most effective and fast way to do this, and would it wise to continue in OpenCL or GLSL. I know branching buffer/texture reads can be costly, and it seems like that's what I would have to do if I split it up. You could avoid the branch if you somehow put the buffer to read in an array and index the buffer somehow, however, I have experienced indexing with GLSL to be EXTREMELY slow, even if it's just indexing a local array (why is this?). I wonder if the same slowness would occur if you grouped buffers into an array in a kernel, if that's even possible.
Current devices with updated drivers can access more than that. AMD has an envvar that lets you set it even higher.
OpenCL could be a good solution for you.
Also, OpenGL 4.3 added Compute Shaders which are extremely OpenCL-like and perfect for folks with OpenGL experience and an existing OpenGL application.
Regarding performance, looping in your kernel can be a problem due to work group divergence, and if you don't have many work items active, it can reduce device occupancy.

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.
If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.

How does OpenGL convert single component textures?

I am confused as to how OpenGL stores single component textures(like GL_RED).
The GL converts it to floating point and assembles it into an RGBA element by attaching 0 for green and blue, and 1 for alpha.
Does this mean that my texture will take 32 bpp in graphic memory even though I only give 8 bpp?
Also I would like to know how OpenGL converts bytes to float for operations in the shader. It doesn't seem logical to simply divide by 255..
You don't know, and you have no way of knowing (ok ok, I kind of lied... there exists documentation which tells you those details for some particular hardware. But in general you have no way of knowing, because you don't know in advance what hardware your program will run on).
OpenGL stores textures somewhat following your request, but it finally chooses something that the hardware supports. If that means that it converts your input data to something completely different, it does that silently.
For example, most implementations convert RGB to RGBA because that's more convenient for memory accesses. The same goes for 5-5-5 data being converted to 8-8-8 and similar.
Usually, a 8 bpp texture will take only 1 byte per pixel nowadays (since pretty much every card supports that, and for software implementations it does not matter), though this is not something you can 100% rely on. You should not worry either, though... it will make sure that it somehow works.
Similar can happen with non-power-of-two textures too, by the way. On all modern versions of OpenGL, this is supported (beginning with 2.0 if I remember right). Though, at least in theory, some older cards might not support this feature.
In that case, OpenGL would just silently make the texture the next bigger power-of-two size and only use a part of it (without telling you!).