I have written an emulator which I am in the process of porting to Linux. At the moment to do the video I am using Direct3D 11, which I am porting to OpenGL (which I'm running on Windows for now). I render to a 1024x1024 texture which I upload to memory every frame (the original hardware doesn't really lend itself to modern hardware acceleration, so I just do it all in software). However, I have found that uploading the texture in OpenGL is a lot slower.
In Direct3D uploading the texture every frame drops the frame rate from 416 to 395 (a 5% drop). In OpenGL it drops from 427 to 297 (a 30% drop!).
Here's the relevant code from my draw function.
Direct3D:
D3D11_MAPPED_SUBRESOURCE resource;
deviceContext_->Map(texture, 0, D3D11_MAP_WRITE_DISCARD, 0, &resource);
uint32_t *buf = reinterpret_cast<uint32_t *>(resource.pData);
memcpy(buf, ...);
deviceContext_->Unmap(texture, 0);
OpenGL:
glBindTexture(GL_TEXTURE_2D, texture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA,
GL_UNSIGNED_BYTE, textureBuffer);
Can anyone suggest what may be causing this slowdown?
If it makes any odds, I'm running Windows 7 x64 with an NVIDIA GeForce GTX 550 Ti.
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, textureBuffer);
You're doing several things wrong here. First, glTexImage2D is the equivalent of creating a Direct3D texture resource every frame. But you're not creating it; you're just uploading to it. You should use glTexImage2D only once per mipmap layer of interest; after that, all uploading should happen with glTexSubImage2D.
Second, your internal format (third parameter from the left) is GL_RGBA. You should always use explicit sizes for your image formats. So use GL_RGBA8. This isn't really a problem, but you should get into the habit now.
Third, you're using GL_RGBA ordering for your pixel transfer format (the third parameter from the right, not the left). This is generally not the most optimal pixel transfer format, as lots of hardware tends to prefer GL_BGRA ordering. But if you're not getting your data from whatever is producing it in that order, then there's not much that can be done.
Fourth, if you have something else you can do between starting the upload and actually rendering with it, you can employ asynchronous pixel transfer operations. You write your data to a buffer object (which can be mapped, so that you don't have to copy into it). Then you use glTexSubImage2D to transfer this data to OpenGL. Because the source data and the destination image are part of OpenGL's memory, it doesn't have to copy the data out of client memory before glTexSubImage2D returns.
Granted, that's probably not going to help you much, since you're already effectively doing that copy in the D3D case.
In OpenGL it drops from 427 to 297 (a 30% drop!)
The more important statistic is that it's a 1 millisecond difference. You should look at your timings in absolute time, not in frames-per-second, nor in percentage drops of FPS.
glTexImage2d does memory reallocation as well as update. Try to use glTexSubImage2d instead.
Related
As the question implies, I'm trying to transfer some buffer data from the gpu to the cpu and I want to do it fast.
Specifically, I'd like to transfer a 640x480 float buffer in less than 1ms.
Question 1: Is this possible in less than 1ms?
Whether it is or not, I'd like to find out what the fastest way is. Everything I've tried up to this point is by using FBOs. Here are the different methods and their respective average time for transfer. All of these are run right after binding to the FBO and rendering on the textures. As I'm no expert, there might be mistakes in my code or I might be doing unnecessary steps so please let me know. The transfers, however, have all been checked to be successful. I transfer everything to cv::Mat objects.
1)Using glReadPixels - < 3.1ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT,mat.data);
2)Using glGetTexImage - <2.9ms
glBindTexture(GL_TEXTURE_2D,depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, mat.data);
3)Using PBO with glGetTexImage - <2.3ms
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);
glBindTexture(GL_TEXTURE_2D, depthTexture);
glGetTexImage(GL_TEXTURE_2D, 0, GL_DEPTH_COMPONENT, GL_FLOAT, 0);
mat.data = (uchar*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
Before I go on, I understand that I don't use PBOs to their full potential since I immediately call glMapBuffer but there is no other process for the cpu to do at the moment. The texture is drawn the moment I have the necessary data and the texture data is necessary for me to move on. Despite all this, PBOs still seem faster.
Here's something interesting(to me at least). These are measured in debug mode. In release mode they are all 1ms slower.
Question 2: Why are they slower in release mode? Can I change this?
Question 3: Are there any other ways I can try to do the transfer?
Extra notes on Q3:
I read somewhere that the integrated graphics card can have faster access. Is this a thing? How would I make use of this? Is this connected to GL_INTEL_map_texture?
I barely know what CUDA is but it seems there is a way to do the transfer faster using it. Is this true?
Will reading the depth buffer instead of a texture be faster?
I was trying to make ffmpeg decode and transform pixels into rgb8 format and write into a mapped pixel buffer and use streaming to update opengl texture, which is then rendered to a sdl window.
The decoding and uploading happens in a dedicated thread(make sws_scale writes to the mapped buffer), and the rendering is done in a the render thread in another context with sharing. (The PBO actually holds several frames, and the texture is a 2d array texture, to decouple the positions.)
Things works fine if I flush the mapped range in the decoding thread, and use glTextureSubImage3D in the render thread to update the texture at needed index. The integrated Intel gpu works pretty fast (it should) in this scenario, but the NV driver complains about Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation. This time the NV gpu works fine, and the warning disappears, whereas the intel gpu gives a black window, and only shows decoded content on closing.
The code is something like this:
//render thread
void RenderFrame(SDL_Window* window,GLobjects& glo, int index, int width, int height) {
glUniform1f(glo.index_location,index);
//The function in question
glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
glClear(GL_COLOR_BUFFER_BIT);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
SDL_GL_SwapWindow(window);
}
//Decode thread
int DecodeFrameToPBO(GLobjects& glo, int index){
//fill the mapped range needed
glFlushMappedBufferRange(GL_PIXEL_UNPACK_BUFFER, index * width * height * 4, 4 * width * height);
//The function in question
//glTextureSubImage3D(glo.texture, 0, 0, 0, index, width, height, 1, GL_RGBA, GL_UNSIGNED_BYTE, (void*)(index * width * height * 4));
}
I'm really confused by the idea of client-side memory and how the driver asynchronously uploads the texture, where exactly is the upload is supposed to happen and what glTextureSubImage3D actually does when a GL_PIXEL_UNPACK_BUFFER is bound?
EDIT:
After adding a glFlush() command to flush the upload context's command queue after each upload, the intel version works properly without black screen.
UPDATE:
Adding the glFlush() seems to make the NV gpu emit warning 'Pixel-path performance warning: Pixel transfer is synchronized with 3D rendering.' again, and same video sample's GPU utilization grew from 8% to 10%. It seems the glFlush() triggers some internal synchronization that perhaps make things go under busy wait? Since the Intel GPU cannot work without the glFlush, even with the clientwaitsync version with the flush command bit set, and calling flush explicitly on the render side also does not work. So what should be done in order to make both driver happy (and reduce utilization)?
I think you are mislead by the nvidia warning. It does not imply that there is a CPU-GPU synchronization, it only tells you that the rendering of the texture is synchronized (has to wait for) uploading the texture, which is fine. See this answer for more details.
So my answer is: there is no issue, hence the solution is to not change anyhing.
I thought that might be that only glTextureSubImage3D actually does the upload, so I moved the glTextureSubImage3D right after the flush operation [into the decode thread].
If you do that, you now have to manually synchronize the rendering with the texture upload, or you will encounter half-written frames or even undefined content at times, and basically have a race condition.
You could do such synchronization with OpenGL Sync Objects. But in the end, you would not get more performance than in your original variant.
whereas the intel gpu gives a black window, and only shows decoded content on closing
It is not clear if this is only a result of the missing synchronization, a bug in your code, or even a driver bug.
I have a 2d graphic library that I want to use in OpenGL, to be able to mix 2d and 3d graphic. The simplest way seems to be with glDrawPixels, but many recent tutorial, and forums, suggest to use a texture with the command glTexSubImage2D, and then to draw a square with such a texture.
My question is: why? where is the advantage? It just adds one more step (memory buffer->texture->video buffer, instead of memory buffer->video buffer).
There are two main reasons:
glDrawPixels() is deprecated, and not available in the OpenGL core profile, or in OpenGL ES.
When drawing the image multiple times, a lot of repeated work can be saved by storing the image data in a texture.
It's quite rare that you would have to draw an image only once. Much more commonly, you'll draw it repeatedly, on each redraw. With glDrawPixels() you have to pass the image data into OpenGL each time. If you store it in a texture, you can draw it repeatedly, and OpenGL can reuse the same data each time.
To draw the content of a texture, you don't necessarily have to set up a shader, draw a quad, etc. You can use glBlitFramebuffer() to copy the texture content to the display.
Since OpenGL use a video memory, use a simple "draw pixel" must be really slow because you will do a lot GPU/CPU synchronisation for each draw.
When you use glTexSubImage2D, you ensure that your image will reside(all the time) into the video memory which is fast.
One way to load a texture inside video memory could be :
glCreateTextures(GL_TEXTURE_2D, 1, &texture->mId);
glTextureParameteri(mId, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTextureParameteri(mId, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
GLsizei numMipmaps = ((GLsizei)log2(std::max(surface->w, surface->h)) + 1);
glTextureStorage2D(*texture, numMipmaps, internalFormat, surface->w, surface->h);
glTextureSubImage2D(*texture, 0, 0, 0, surface->w, surface->h,
format, GL_UNSIGNED_BYTE, surface->pixels);
glGenerateTextureMipmap(*texture);
Don't forget binding if you do not want to use direct state access.
However, if you still want to perform pixel draw (for example for procedural rendering), you must write your own fragment shader to be as fast as possible
I know Directx for Dx9 at least, has a texture object where you are able to get only a small portion of the texture to CPU accessible memory. It was a function called "LockRect" I believe. OpenGL has glGetTexImage() but it grabs the entire image and if the format isn't the same as the texture's then it is going to have to convert the entire texture into the new pixel format on top of transferring the entire texture. This function is also not in OpenGL ES. Framebuffers is another option but where I could potentially bind a framebuffer where a color attachment in connected to a texture. Then there is glReadPixels which reads from the framebuffer, so it should be reading from the texture. glReadPixels has limited pixel format options so a conversion is going to have to happen, but I can read the pixels I need (which is only 1 pixel). I haven't used this method but it seems like it is possible. If anyone can confirm the framebuffer method, that it is a working alternative. Then this method would also work for OpenGL ES 2+.
Are there any other methods? How efficient is the framebuffer method (if it works), does it end up having to convert the entire texture to the desired format before it reads the pixels or is it entirely implementation defined?
Edit: #Nicol_Bolas Please stop removing OpenGL from tags and adding OpenGL-ES, OpenGL-ES isn't applicable, OpenGL is. This is for OpenGL specifically but I would like it to be Open ES 2+ compatible if possible, though it doesn't have to be. If a OpenGL only solution is available then it is a consideration I will make if it is worth the trade off. Thank you.
Please note, I do not have that much experience with ES in particular, so there might be better ways to do this specifically in that context. The general gist applies in either plain OpenGL or ES, though.
First off, the most important performance consideration should be when you are doing the reading. If you request data from the video card while you are rendering, your program (the CPU end) will have to halt until the video card returns the data, which will slow rendering due to your inability to issue further render commands. As a general rule, you should always upload, render, download - do not mix any of these processes, it will impact speed immensely, and how much so can be very driver/hardware/OS dependent.
I suggest using glReadPixels( ) at the end of your render cycle. I suspect the limitations on formats for that function are connected to limitations on framebuffer formats; besides, you really should be using 8 bit unsigned or floating point, both of which are supported. If you have some fringe case not allowing any of those supported formats, you should explain what that is, as there may be a way to handle it specifically.
If you need the contents of the framebuffer at a specific point in rendering (rather than the end), create a second texture + framebuffer (again with the same format) to be an effective "backbuffer" and then copy from the target framebuffer to that texture. This occurs on the video card, so it does not impose the bus latency directly reading does. Here is something I wrote that does this operation:
glActiveTexture( GL_TEXTURE0 + unit );
glBindTexture( GL_TEXTURE_2D, backbufferTextureHandle );
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glCopyTexSubImage2D(
GL_TEXTURE_2D,
0, // level
0, 0, // offset
0, 0, // x, y
screenX, screenY );
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, framebufferHandle );
Then when you want the data, bind the backbuffer to GL_READ_FRAMEBUFFER and use glReadPixels( ) on it.
Finally, you should keep in mind that a download of data will still halt the CPU end. If you download before displaying the framebuffer, you will put off displaying the image until after you can again execute commands, which might result in visible latency. As such, I suggest still using a non-default framebuffer even if you only care about the final buffer state, and ending your render cycle to the effect of:
(1.) Blit to the default framebuffer:
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, 0 ); // Default framebuffer
glBindFramebuffer( GL_READ_FRAMEBUFFER, framebufferHandle );
glBlitFramebuffer(
0, 0, screenX, screenY,
0, 0, screenX, screenY,
GL_COLOR_BUFFER_BIT,
GL_NEAREST );
(2.) Call whatever your swap buffers command may be in your given situation.
(3.) Your download call from the framebuffer (be it glReadPixels( ) or something else).
As for the speed impact of the blit/texcopy operations, it's quite good on most modern hardware and I have not found it to have a noticeable impact even done 10+ times a frame, but if you are dealing with antiquated hardware, it might be worth a second thought.
In the perfect case, if the screen resolution is 1024x768, only
786432 texels are needed at a given moment, and 2 megabytes memory
is enough. But in real world there is managing cost, so textures
take much more memory.
Texture streaming could reduce the memory cost of texture, that
is, not all the mipmaps of textures are needed at a given moment.
A texture needs level 0 mipmap because it's near the camera, if
it's far from the current camera, level 5 to 11 mipmaps may be
enough. Camera moves in the scene a while, some mipmaps can be
loaded and some mipmaps can be unloaded.
My question is how to do it effectively.
Say I have a 512x512 OpenGL texture in the scene, so it will has
10 mipmaps. From level 0 to level 9, there are: 512x512, 256x256,
128x128... and 1x1 mipmap. Simplely upload the data like this:
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, 512, 512, 0, GL_RGBA, GL_UNSIGNED_BYTE, p1);
glTexImage2D(GL_TEXTURE_2D, 1, GL_RGBA8, 256, 256, 0, GL_RGBA, GL_UNSIGNED_BYTE, p2);
glTexImage2D(GL_TEXTURE_2D, 2, GL_RGBA8, 128, 128, 0, GL_RGBA, GL_UNSIGNED_BYTE, p3);
...
glBindTexture(GL_TEXTURE_2D, 0);
After a while, camera goes far from this texture in the scene,
64x64 mipmap is enough, so the top 3 mipmaps are going to be
unloaded:
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, 64, 64, 0, GL_RGBA, GL_UNSIGNED_BYTE, p4);
glTexImage2D(GL_TEXTURE_2D, 1, GL_RGBA8, 32, 32, 0, GL_RGBA, GL_UNSIGNED_BYTE, p5);
glTexImage2D(GL_TEXTURE_2D, 2, GL_RGBA8, 16, 16, 0, GL_RGBA, GL_UNSIGNED_BYTE, p6);
...
glBindTexture(GL_TEXTURE_2D, 0);
And then, camera moves towards this texture, 256x256 mipmap is
needed:
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, 256, 256, 0, GL_RGBA, GL_UNSIGNED_BYTE, p4);
glTexImage2D(GL_TEXTURE_2D, 1, GL_RGBA8, 128, 128, 0, GL_RGBA, GL_UNSIGNED_BYTE, p5);
glTexImage2D(GL_TEXTURE_2D, 2, GL_RGBA8, 64, 64, 0, GL_RGBA, GL_UNSIGNED_BYTE, p6);
...
glBindTexture(GL_TEXTURE_2D, 0);
This is just inefficient. Basically it recreates the texture
everytime although the texture id is not changed. PBOs could makes
it faster but the data copying is still cost.
For a 1024x1024 OpenGL texture, can I make it uses only the lower
mipmaps, say level 1 to 9, and leave the level 0 mipmap empty (do
not allocate in video memory) ? In other words: Always keep a
subset of mipmaps from low level to high level. Load or unload
higher level mipmaps without changing the lower mipmaps of a
texture. I think in hardware's perspective this might be possible.
This is what I tried: If I don't call glTexImage2D for level 0
mipmap, this texture may be in an incomplete state. But if I call
glTexImage2D with a null data pointer, it will be allocated in
video memory with zero data (profiling via gDEBugger).
If your goal with all of this is limiting the amount of active texture memory you have available, nothing in OpenGL 4.5 is going to help you with that. Reallocation of the texture as a smaller one is a terrible idea (and no, the Unreal engine does not do that).
There are two cases where what you are talking about may matter. Case 1 would be what the Unreal engine uses it for: loading performance. It allocates the whole texture, all of its mipmaps, but it only loads the lower mipmaps first. This saves time on loading in the level. This can also be used to speed up streaming performance by doing the same thing.
This is done easily in OpenGL 4.5. That's what the mipmap range setting is for; you only load in the lower mipmaps and set the GL_TEXTURE_BASE_LEVEL and GL_TEXTURE_MAX_LEVEL to that range. OpenGL guarantees it will not try to access memory outside of that range.
But dynamically evicting mipmaps from memory is not something that OpenGL 4.5 has a mechanism for.
ARB_sparse_texture however does. It allows you to declare that certain mipmaps are "sparse". You allocate storage for them as normal, but you can declare that the higher levels may be evicted from memory. You decide when levels are available by giving them virtual pages. And you can remove these pages, which allows the GPU to use that memory for someone else.
The goal in doing this is, in theory, to be able to use more textures without running out of GPU memory (and thus thrashing). However, you don't really have the tools to be able to do this effectively.
This is because OpenGL doesn't tell you how much memory you have available. Nor does it say how much memory is allocated for buffers/textures. Nor does it say how many of those allocated buffers/textures is currently active and resident on the GPU. So you don't really know when you're close to being about to thrash.
This means that, if you're "close to" a number of large textures, you can still blow memory limits. And OpenGL will not tell you when you've done so.
Even so, this can still be helpful, if you plan your level layouts around it. Oh, and sparse textures are useful for the Unreal engine's case too.
I am thinking about that every mipmap of an OpenGL texture is stored independently, we can attach a lower level mipmap to a texture at runtime, why cannot attach a higher one.
Do not make the mistake of thinking that hardware follows what the API says.
Can you only issue glTexImage*D calls to mipmaps other than layer 0? Yes; just use the base/max levels to prevent access outside of the allocated range (that will keep the texture complete).
Will that guarantee that the implementation only allocates memory for those particular mipmap levels? No; implementations may allocate mipmap levels outside that range.
Indeed, there's evidence that it doesn't work that way. Consider ARB_texture_storage. This extension, core in OpenGL 4.3+, provides a single function that allocates all of the mipmap levels of the texture at once. So instead of calling glTexImage2D for each level, you call glTexStorage2D once, and it will allocate all specified mipmap levels for the size you specified. You can leave some off the small range, but not off the top.
This also makes the texture immutable, so you cannot change that texture's storage ever again. You can upload to it, but you cannot call glTexStorage*D or glTexImage*D on it. So no reallocation.
Why would the ARB create an extension who's whole purpose is to prevent you from allocating individual mipmaps, if individual allocations were something that hardware actually supported? And if you think that's a fluke, consider this too.
When ARB_direct_state_access was created, they obviously added DSA-style functions for texture allocation. But notice that they didn't add DSA-style functions to make non-immutable textures; they didn't add glTextureImage*D. Their reasoning? "Immutable texture is a more robust approach to handle textures".
Clearly, the ARB sees no value in an API that lets the user say which mipmaps are allocated and which are not.
And it should be noted that nothing in Direct3D 12, a much lower level API, allows this either. The answer to the question of how much memory a texture needs is purely a byte-count + alignment. It is not a series of byte counts, one-per-mipmap. And the functions to allocate resources similarly do not allow expanding mipmaps or anything of the like.
And I even checked Mantle's documentation. It has no facility to have an image's memory be non-contiguous. grBindObjectMemory (the function that associates memory storage with a texture) does not take a parameter specifying the mipmap level.
And for the sake of completeness, neither does Vulkan. It treats an images mipmap pyramid as a single, contiguous block of storage. Sparse images can work, but that works on page boundaries, not at the level of whole mipmaps.
So I would assume nothing about the nature of the hardware based on old OpenGL APIs.
so textures take much more memory.
Actually no. (EDIT) for a 1-dimensional texture the zeroth mipmap level consumes N bytes, the first N/2, the second N/4, and so on. So the total amount of bytes consumes is
sum(i in 0…){ N * 2^-i } = N * sum(i in 0…){2^-i}
That's a geometric series which converges to 2. So a mipmaped texture consumes exactly twice the memory of a not mipmaped texture. For a 2D texture it's 1/4th, 1/16th and so on. The larger the dimension of the texture the smaller is the mipmap overhead.
That's not "a lot".
A texture needs level 0 mipmap because it's near the camera, if it's far from the current camera, level 5 to 11 mipmaps may be enough.
There is no camera in OpenGL and that's not how the mipmaping level is determined. The Mipmap level being used is determined by the rate of change of the texture coordinate in screen coordinates (gradient of the texture coordinate).
After a while, camera goes far from this texture in the scene, 64x64 mipmap is enough
And maybe you're using the same texture for rendering on a primitive with a very steep texture coordinate gradient (low mipmap level) right after drawing it with a shallow texture coordinate gradient.
But in real world there is managing cost,
What I'm trying to say is, that the management overhead of trying to properly stream mipmaps is much higher and consumes much more GPU resources, than simply loading all the mipmaps and call it a day.
Also modern GPUs are capable of data fetches on their own; the graphics RAM is just a cache for the system RAM and when you load a texture it actually doesn't go to the graphics memory in the first place. The GPU will fetch the data it needs.