KTX vs DDS images in OpenGL

KTX vs DDS images in OpenGL - opengl

I used DDS (DXT5)till now for fast load of texture data.Now,I read that since OpenGL 4.3 (and for ES2) the compressed standard is KTX(ETC1/ETC2).I integrated the Khronos libktx SDK and bench-marked.
Updating texture using glCompressedTexSubImage2D for 3000 times the results are:
DDS:
1450 - millisecond
KTX - forever....
Actually, running a loop of only 300 times updating KTX, the total time already reaches 24 seconds!
Now I have 2 questions:
Is this the expected speed of KTX?
if the answer to the first question is "YES" then what is the advantage of ETC except of smaller file size than that of DDS?
I use OpenGL 4.3 with Quadro4000 GPU.

I asked this question on Khronos KTX forum.Here is the answer I got from the forum moderator:
I have been told by the NVIDIA OpenGL driver team that the Quadro 4000
does not support ETC in hardware while it does support DXTC. This
means the ETC-compressed images will be decompressed by the OpenGL
driver in software then loaded into GPU memory while the
DXTC-compressed images will simply be loaded into GPU memory. I
believe that is the source of the performance difference you are
observing.
So it seems like my card's hardware doesn't support ETC.

Related

QOpenGLWidget video rendering perfomance in multiple processes

My problem may seem vague without code, but it actually isn't.
So, there I've got an almost properly-working widget, which renders video frames.
Qt 5.10 and QOpenGLWidget subclassing worked fine, I didn't make any sophisticated optimizations -- there are two textures and a couple of shaders, converting YUV pixel format to RGB -- glTexImage2D() + shaders, no buffers.
Video frames are obtained from FFMPEG, it shows great performance due to hardware acceleration... when there is only one video window.
The piece of software is a "video wall" -- multiple independent video windows on the same screen. Of course, multi-threading would be the preferred solution, but legacy holds for now, I can't change it.
So, 1 window with Full HD video consumes ~2% CPU & 8-10% GPU regardless of the size of the window. But 7-10 similar windows, launched from the same executable at the same time consume almost all the CPU. My math says that 2 x 8 != 100...
My best guesses are:
This is a ffmpeg decoder issue, hardware acceleration still is not magic, some hardware pipeline stalls
7-8-9 independent OpenGL contexts cost a lot more than 1 cost x N
I'm not using PUBO or some other complex techniques to improve OpenGL rendering. It still explains nothing, but at least it is a guess
The behavior is the same on Ubuntu, where decoding uses different codec (I mean that using GPU accelerated or CPU accelerated codecs makes no difference!), so, it makes more probable that I'm missing something about OpenGL... or not, because launching 6-7 Qt examples with dynamic textures shows normal growth of CPU usages -- it is approximately a sum for the number of windows.
Anyway, it becomes quite tricky for me to profile the case, so I hope someone could have been solving the similar problem before and could share his experience with me. I'd be appreciated for any ideas, how to deal with the described riddle.
I can add any pieces of code if that helps.

Same Direct2D application performs better on a "slower" machine

I wrote a Direct2D application that displays a certain number of graphics.
When I run this application it takes about 4 seconds to display 700,000 graphic elements on my notebook:
Intel Core i7 CPU Q 720 1.6 GHz
NVIDIA Quadro FX 880M
According to the Direct2D MSDN page:
Direct2D is a user-mode library that is built using the Direct3D 10.1
API. This means that Direct2D applications benefit from
hardware-accelerated rendering on modern mainstream GPUs.
I was expecting that the same application (without any modification) should perform better on a different machine with better specs. So I tried it on a desktop computer:
Intel Xeon(R) CPU 2.27 GHz
NVIDIA GeForce GTX 960
But it took 5 seconds (1 second more) to display the same graphics (same number and type of elements).
I would like to know how can it be possible and what are the causes.

It's impossible to say for sure without measuring. However, my gut tells me that melak47 is correct. There is no lack of GPU acceleration, it's a lack of bandwidth. Integrated GPUs have access to the same memory as the CPU. They can skip the step of having to transfer bitmaps and drawing commands across the bus to dedicated graphics memory for the GPU.
With a primarily 2D workload, any GPU will be spending most of its time waiting on memory. In your case, the integrated GPU has an advantage. I suspect that extra second you feel, is your GeForce waiting on graphics coming across the motherboard bus.
But, you could profile and enlighten us.

Some good points in the comments and other replies.(can't add a comment yet)
Your results dont surprise me as there are some differencies between your 2 setups.
Let's have a look there: http://ark.intel.com/fr/compare/47640,43122
A shame we can't see the SSE version supported by your Xeon CPU. Those are often used for code optimization. Is the model I chose for the comparison even the good one?
No integrated GPU in that Core-I7, but 4 cores + hyperthreading = 8 threads against 2 cores with no hyperthreading for the Xeon.
Quadro stuff rocks when it comes to realtime rendering. As your scene seems to be quite simple, it could be well optimized for that, but just "maybe" - I'm guessing here... could someone with experience comment on that? :-)
So it's not so simple. What appears to be a better gfx card doesn't mean better performance for sure. If you have a bottleneck somewhere else you're screwed!
The difference is small, you must compare every single element of your 2 setups: CPU, RAM, HDD, GPU, Motherboard with type of PCI-e and chipset.
So again, a lot of guessing, some tests are needed :)
Have fun and good luck ;-)

Direct3D 11.1's target-independent rasterization (TIR) equivalent in OpenGL (including extensions)

Target-independent rasterization (TIR) is a new hardware feature in DirectX 11.1, which Microsoft used to improve Direct2D in Windows 8. AMD claimed that TIR improved performance in 2D vector graphics by some 500%. And there was some "war of words" with Nvidia's because Kepler GPUs apparently don't support TIR (among other DirectX 11.1 features). The idea of TIR appears to have originated at Microsoft, because they have a patent application for it.
Now Direct2D is fine your OS is Windows, but is there some OpenGL (possibly vendor/AMD) extension that provides access to the same hardware/driver TIR thing? I think AMD is in a bit of a weird spot because there is no vendor-independent 2D vector graphics extension for OpenGL; only Nvidia is promoting NV_path_rendering for now and its architecture is rather different from Direct2D. So it's unclear where anything made by AMD to accelerate 2D vector graphics can plug (or show up) in OpenGL, unlike in the Direct2D+Direct3D world. I hope I my pessimism is going to be unraveled by a simple answer below.
I'm actually posting an update of sorts here because there's not enough room in comment-style posts for this. There seems to be a little confusion as to what TIR does, which is not simply "a framebuffer with no storage attached". This might be because I've only linked above to the mostly awful patentese (which is however the most detailed document I could find on TIR). The best high-level overview of TIR I found is the following snippet from Sinofsky's blog post:
to improve performance when rendering irregular geometry (e.g. geographical borders on a map), we use a new graphics hardware feature called Target Independent Rasterization, or TIR.
TIR enables Direct2D to spend fewer CPU cycles on tessellation, so it can give drawing instructions to the GPU more quickly and efficiently, without sacrificing visual quality. TIR is available in new GPU hardware designed for Windows 8 that supports DirectX 11.1.
Below is a chart showing the performance improvement for rendering anti-aliased geometry from a variety of SVG files on a DirectX 11.1 GPU supporting TIR: [chart snipped]
We worked closely with our graphics hardware partners [read AMD] to design TIR. Dramatic improvements were made possible because of that partnership. DirectX 11.1 hardware is already on the market today and we’re working with our partners to make sure more TIR-capable products will be broadly available.
It's this bit of hardware I'm asking to use from OpenGL. (Heck, I would settle even for invoking it from Mantle, because that also will be usable outside of Windows.)

The OpenGL equivalent of TIR is EXT_raster_multisample.
It's mentioned in the new features page for Nvidia's Maxwell architecture: https://developer.nvidia.com/content/maxwell-gm204-opengl-extensions.

I believe TIR is just a repurposing of a feature nvidia and AMD use for antialiasing.
Nvidia calls it coverage sample antialiasing and their gl extensions is GL_NV_framebuffer_multisample_coverage.
AMD calls it EQAA but they don't seem to have a gl extension.

Just to expand a bit on Nikita's answer, there's a more detailed Nvidia (2017) extension page that says:
(6) How do EXT_raster_multisample and NV_framebuffer_mixed_samples
interact? Why are there two extensions?
RESOLVED: The functionality in EXT_raster_multisample is equivalent to
"Target-Independent Rasterization" in Direct3D 11.1, and is expected to be
supportable today by other hardware vendors. It allows using multiple
raster samples with a single color sample, as long as depth and stencil
tests are disabled, with the number of raster samples controlled by a
piece of state.
NV_framebuffer_mixed_samples is an extension/enhancement of this feature
with a few key improvements:
- Multiple color samples are allowed, with the requirement that the number
of raster samples must be a multiple of the number of color samples.
- Depth and stencil buffers and tests are supported, with the requirement
that the number of raster/depth/stencil samples must all be equal for
any of the three that are in use.
- The addition of the coverage modulation feature, which allows the
multisample coverage information to accomplish blended antialiasing.
Using mixed samples does not require enabling RASTER_MULTISAMPLE_EXT; the
number of raster samples can be inferred from the depth/stencil
attachments. But if it is enabled, RASTER_SAMPLES_EXT must equal the
number of depth/stencil samples.

Is 1024x1024 a widely supported OpenGL maximum texture size on the desktop?

I am creating a sprite engine that uses the concept of "graphics banks" to speed up rendering using batches. I was wondering if anyone knows if 1024x1024 textures are widely supported enough these days to "count on", and/or if there is a way to find out how far back there has been support for 1024x0124 in terms of graphics chipsets / cards, like a chart or something, and from that decide based on estimated compatibility % if I should go with 1024x1024 or limit banks to 512x512 (which I have a feeling is more widely supported, if you count integrated accelerators)

According to this SO answer "what is the max size of the texture iphone can support?" the iPhone 3g can support 1024x1024 texture sizes. That phone was released in 2008, which leads me to believe that most desktop would be able to support the same 5 years ago.
The GL_MAX_TEXTURE_SIZE is defined by the implementation. Wild Fire Games has also been kind enough to offer their OpenGL capabilities report which has max texture sizes from their data set. As of writing this, only 18 out of 33085 (0.05%) users have reported to not have the ability to render a GL_MAX_TEXTURE_SIZE 1024 or more.

Why is CGLFlushDrawable so slow? (I am using VBOs)

My application details:
Running on : Macbook pro with 4GB RAM, ATI Radeon X1600 with 128MB VRAM, Opengl version: 2.1 ATI-7.0.52
Using vertical sync (via CVDisplay) : YES
Programming Language: Lisp (Lispworks) with FFI to Opengl
Pixel Format information
ns-open-gl-pfa-depth-size 32
ns-open-gl-pfa-sample-buffers 1
ns-open-gl-pfa-samples 6
ns-open-gl-pfa-accelerated 1
ns-open-gl-pfa-no-recovery 1
ns-open-gl-pfa-backing-store 0
ns-open-gl-pfa-virtual-screen-count 1
[1 = YES, 0 = NO] for boolean attribs
I have in my application the following meshes:
14 static meshes (which do not change). I have defined a VBO for each mesh with static draw type.
2 dynamic meshes (which change per frame). I have defined a VBO for each mesh with stream draw type.
For these dynamic meshes, per frame I do a bind buffer data with null pointer, then map buffer, update the mapped buffer and unmap the buffer.
When I run the app and check with Opengl profiler: it shows the following (Statistics View) for:
CGLFlushDrawable:
Average Time (in micro sec): 52990.63 = 52.990 ms
% GL Time: 98.55
% App Time: 43.96
No wonder I get a very poor FPS of around 6-7 FPS.
What is the way to optimize CGLFlushDrawable, since I just invoke flushBuffer which in turn invokes CGLFlushBuffer I believe.

Well, it turns out that there is a problem with my ATI Radeon X1600 graphics card.
Without any change, when I test the same code on another newer 13" Macbook Pro which has an Intel HD Graphics 3000 with 384MB of DDR3 SDRAM, the application works fine with
around 30 FPS which is what I expect, given the dynamic meshes that I have.
Also, there is no bottleneck whatsoever in CGLFlushDrawable as was the case on my old
MBP. Further the amount of memory in VRAM available after VBO allocation remains the same
(again what I was expecting). This is not what was happening on my old MBP.
And finally, my MBP display has crashed (not regularly enough though) and external LCD display also does not work fine, which points to problems with my graphics card.
#Brad, thanks for all your inputs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js