Can you explain me, why hardware acceleration required for a long time textures be power of two? For PCs, since GeForce 6 we achieved npot textures with no-mips and simplified filtering. OpenGL ES 2.0 also supports npot textures without mipmaps and etc. What is the hardware restriction for this? Just simplified arithmetics?
I imagine it has to do with being able to use bitwise shift-left operations, rather than multiplication to convert an (x, y) coordinate to a memory offset from the start of the texture. So yes, simplified arithmetics, from a processor point of view.
I'm guessing that it was to make mipmap generation easier, because it allows you to just average 2x2 pixels into one pixel all the way from NxN down to 1x1.
Now that doesn't matter if you're not using mipmapping, but it's easier to have just one rule, and I think that mipmapping was the more common use case.
Related
Since imageAtomicAdd (which seems to be the only real atomic "read-modify-store" function that operates on images) is only available for 32bit integers, I don't see any sensible way to accumulate multiple color values from different shader invocations in one pixel.
The only somewhat reasonable way to do this that I can see is to use 32bit per color (128bit per RGBA pixel), add 8bit color values up, hope that it doesn't overflow and clamp to 8bit afterwards.
This seems wasteful and restrictive (only pure additive blending?)
Accumulating in other data structures also doesn't solve the issue, since shared variables and ssbos also only seem to support atomicAdd and also only on integers.
There are two reasons that make me think I am probably missing something:
1. Every pathtracer that allows for concurrent intersection testing (for example for shadow rays) has to solve this issue so it seems like there must be a solution.
2. All kinds of fancy blending can be done in fragment shaders, so the hardware is definitely capable of doing this.
Is everyone just writing pathtracers that have a 1:1 shader invocation:pixel mapping?
OpenGL uses power-of-two textures.
This is because some GPUs only accept power-of-two textures due to MipMapping. Using these power-of-two textures causes problems when drawing a texture larger than it is.
I had thought of one way to workaround this, which is to only use the PO2 ratios when we're making the texture smaller than it actually is, and using a 1:1 ratio when we're making it bigger, but will this create compatibility issues with some GPUs?
If anybody knows whether issues would occur (I cannot check this as my GPU accepts NPO2 Textures), or a better workaround, I would be grateful.
Your information is outdated. Arbitrary dimension textures are supported since OpenGL-2, which has been released in 2004. All contemporary GPUs do support NPOT2 textures very well, and without any significant performance penality.
There's no need for any workarounds.
I have been hearing controversial opinions on whether it is safe to use non-power-of two textures in OpenGL applications. Some say all modern hardware supports NPOT textures perfectly, others say it doesn't or there is a big performance hit.
The reason I'm asking is because I want to render something to a frame buffer the size of the screen (which may not be a power of two) and use it as a texture. I want to understand what is going to happen to performance and portability in this case.
Arbitrary texture sizes have been specified as core part of OpenGL ever since OpenGL-2, which was a long time ago (2004). All GPUs designed every since do support NP2 textures just fine. The only question is how good the performance is.
However ever since GPUs got programmable any optimization based on the predictable patterns of fixed function texture gather access became sort of obsolete and GPUs now have caches optimized for general data locality and performance is not much of an issue now either. In fact, with P2 textures you may need to upscale the data to match the format, which increases the required memory bandwidth. However memory bandwidth is the #1 bottleneck of modern GPUs. So using a slightly smaller NP2 texture may actually improve performance.
In short: You can use NP2 textures safely and performance is not much of a big issue either.
All modern APIs (except some versions of OpenGL ES, I believe) on modern graphics hardware (the last 10 or so generations from ATi/AMD/nVidia and the last couple from Intel) support NP2 texture just fine. They've been in use, particularly for post-processing, for quite some time.
However, that's not to say they're as convenient as power-of-2 textures. One major case is memory packing; drivers can often pack textures into memory far better when they are powers of two. If you look at a texture with mipmaps, the base and all mips can be packed into an area 150% the original width and 100% the original height. It's also possible that certain texture sizes will line up memory pages with stride (texture row size, in bytes), which would provide an optimal memory access situation. NP2 makes this sort of optimization harder to perform, and so memory usage and addressing may be a hair less efficient. Whether you'll notice any effect is very much driver and application-dependent.
Offscreen effects are perhaps the most common usecase for NP2 textures, especially screen-sized textures. Almost every game on the market now that performs any kind of post-processing or deferred rendering has 1-15 offscreen buffers, many of which are the same size as the screen (for some effects, half or quarter-size are useful). These are generally well-supported, even with mipmaps.
Because NP2 textures are widely supported and almost a sure bet on desktops and consoles, using them should work just fine. If you're worried about platforms or hardware where they may not be supported, easy fallbacks include using the nearest power-of-2 size (may cause slightly lower quality, but will work) or dropping the effect entirely (with obvious consquences).
I have a lot of experience in making games (+4 years) and using texture atlases for iOS & Android though cross platform development using OpenGL 2.0
Stick with PoT textures with a maximum size of 2048x2048 because some devices (especially the cheap ones with cheap hardware) still don't support dynamic texture sizes, i know this from real life testers and seeing it first hand. There are so many devices out there now, you never know what sort of GPU you'll be facing.
You're iOS devices will also show black squares and artefacts if you are not using PoT textures.
Just a tip.
Even if arbitrary texture size is required by OpenGL X certain videocards are still not fully compliant with OpenGL. I had a friend with a IntelCard having problems with NPOT2 textures (I assume now Intel Cards are fully compliant).
Do you have any reason for using NPOT2 Textures? than do it, but remember that maybe some old hardware don't support them and you'll probably need some software fallback that can make your textures POT2.
Don't you have any reason for using NPOT2 Textures? then just use POT2 Textures. (certain compressed formats still requires POT2 textures)
I am experimenting with several ways to draw a lot of sprites (e.g. for particle system) and I have some inconclusive results. So this is what I tried and what I have:
This is done drawing 25k sprites:
Using regular glBegin/glEnd and using trig to calculate vertex points - 17-18fps.
Using regular glBegin/glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 14-15fps.
Using vertex arrays instead of glBegin and glEnd, but still using trig to calculate vertex point position - 10-11fps.
Using vertex arrays instead of glBegin and glEnd, but using glRotate, glTranslate and glScale to transform the sprite - 10-11fps.
So my question is, why is using vertex arrays slower than using glBegin/glEnd while I have read (here even) that it should be faster?
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be much, much faster.
I do get more promising results when drawing less sprites. Like when I draw 10k sprites, then vertex arrays can be about 5fps faster, but still inconsistent. Also note than these fps can be increased overall because I have other calculations going on, so I am not really looking at the fps itself, but the difference between them. Like if vertex arrays and gl transform was 5-10fps more than glBegin/glEnd with manual trig, then I would be happy, but for now, it just doesn't seem to be worth the hassle. They would help with porting to GLES (as it doesn't have glBegin/glEnd), but I guess I will make a separate implementation for that.
So is there any way to speed this up without using geometry shaders? I don't really understand them (maybe some great tutorial?), and they could break compatibility with older hardware, so I want to squeeze all the juice I can without using shaders.
So my questions are why does using vertex arrays is slower than using glBegin/glEnd while I have read (here even) that it should be faster?
Who says that they are slower?
All you can say is that, for your particular hardware, for your current driver, glBegin/glEnd are slower. Have you verified this on other hardware?
More importantly, there is the question of how you are drawing these. Do you draw a single sprite from the vertex array, then draw another, then draw another? Or do you draw all of them with a single glDrawArrays or glDrawElements call?
If you're not drawing all of them in one go (or at least large groups of them at once), then you're not going as fast as you should be.
And why does using your own trigonometry (which in my case is 5 cos, 5 sin, more than 5 divisions, 15 multiplications and about 10 additions/subtractions) is faster than using 5 functions (glPushMatrix(), glTranslated(), glRotated(), glScaled(), glPopMatrix()). I though they are done on the GPU so it should be A LOT faster.
Well, let's think about this. glPushMatrix costs nothing. glTranslated creates a double-precision floating-point matrix and then does a matrix multiply. glRotated does at least one sin and one cos, does some additions and subtractions to compute a matrix (all in double-precision) and then does a matrix multiply. glScaled computes a matix, and does a matrix multiply.
Each "does a matrix multiply" consists of 16 floating-point multiplies and 12 floating-point adds. And since you asked for double-precision math, you can forget about SSE vector math or whatever; this is doing standard math. And you're doing 3 of these for every point.
What happens on the GPU is the multiplication of that matrix with the vertex positions. And since you're only passing 4 positions before changing the matrix, it's not particularly surprising that this is slower.
Have you considered using glPoints...() instead? This is kinda what they were designed to do, depending on which version of OpenGL you are supporting.
Have you tried VBO's instead? They're the current standard, so most cards are optimized in their favor.
Also:
you should use your own math calculations
consider offloading as much calculation as possible to a shader
The fps amounts you posted are contrary to what one might expect -- you probably do something wrong. Can you paste some of your rendering code?
Do you have a specific reason to use double precision matrix functions? They are usually a lot slower than single precision ones.
What's the most efficient way to do image pyramiding in CUDA? I have written my own kernels to do so but imagine we can do better.
Binding to an OpenGL texture using OpenGL interop and using the hardware mipmapping would probably be much faster. Any pointers on how to do this or other
MipMaps are setup when accessed/initialized in OpenGL/DirectX. A CUDA kernel can do the same thing if you allocate a texture 50% wider (or higher) than the initial texture and use the kernel to down-sample the texture and write the result beside the original texture. The kernel will probably work best where each thread evaluates a pixel in the next down-sampled image. It's up to you to determine the sampling-scheme and choose appropriate weights for combining the pixels. Try bilinear to start with, then once it's working you can setup trilinear (cubic) or other sampling schemes like anisotropic etc. Simple sampling (linear and cubic) will likely be more efficient since coalesced memory access will occur (refer to the CUDA SDK programming guide). You will probably need to tile the kernel execution since the thread-count is limited for parallel invokation (too many pixels, too few threads = use tiling to chunk parallel execution).You might find Mesa3D useful as a reference (it's an open-source implementation of OpenGL).