I was wondering if it's possible for two mipmap chains to share the data for the top level mip.
My use case is the following. I have two environment maps with mipmapping. One is blurred, and one isn't. But the level 0 mip is the same for both.
If sharing the top level mip was possible, that would allow me to save a big chunk of memory.
Related
How does the glGenerateMipmap function create the mipmaps? Does it do some kind of interpolation, does it take the average of 4 pixels (this is what assume), or does it skip every second pixel in the next mipmap level?
And is there a way to influence its underlying algorithm? Or choose the way mipmaps are created?
The OpenGL Spec, Section 8.14.4 states about mipmap generation:
No particular filter algorithm is required, though a box filter is recommended as the default filter
It is not possible to modify the algorithm used by glGenerateMipmap, but you can create mipmap levels on your own with any algorithm you like and then upload them to the texture using glTexImage2D(...) with the appropriate level. Level 0 is the full resolution, Level 1 is the one with 1/2 of the original size and so on.
I have a grayscale texture (8000*8000) , the value of each pixel is an ID (actually, this ID is the ID of triangle to which the fragment belongs, I want to using this method to calculate how many triangles and which triangles are visible in my scene).
now I need to count how many unique IDs there are and what are them. I want to implement this with GLSL and minimize the data transfer between GPU RAM and RAM.
The initial idea I come up with is to use a shader storage buffer, bind it to an array in GLSL, its size is totalTriangleNum, then iterate through the ID texture in shader, increase the array element by 1 that have index equal to ID in texture.
After that, read the buffer to OpenGL application and get what I want. Is this a efficient way to do so? Or are there some better solutions like compute-shader (well I'm not familiar with it) or something else.
I want to using this method to calculate how many triangles and which triangles are visible in my scene)
Given your description of your data let me rephrase that a bit:
You want to determine how many distinct values there are in your dataset, and how often each value appears.
This is commonly known as a Histogram. Unfortunately (for you) generating histograms are among the problems not that trivially solved on GPUs. Essentially you have to divide down your image into smaller and smaller subimages (BSP, quadtree, etc.) until divided down to single pixels on which you perform the evaluation. Then you backtrack propagating up the sub-histograms, essentially performing an insertion or merge sort on the histogram.
Generating histograms with GPUs is still actively researched, so I suggest you read up on the published academic works (usually accompanied with source code). Keywords: Histogram, GPU
This one is a nice paper done by the AMD GPU researchers: https://developer.amd.com/wordpress/media/2012/10/GPUHistogramGeneration_preprint.pdf
I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?
Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).
Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
Single-Pass Gaussian Blur (Single 2D blur):
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.
I have a process that accumulates mostly static data over time--and a lot of it, millions of data elements. It is possible that small parts of the data may change occasionally, but mostly, it doesn't change.
However, I want to allow the user the freedom to change how this data is viewed, both in shape and color.
Is there a way that I could store the data on the GPU just as data. Then have a number of ways to convert that data to something renderable on the GPU. The user could then choose between those algorithms and we swap it in efficiently without having to touch the data at all. Also, color ids would be in the data, but the user could change what color each id should match to, again, without touching the data.
So, for example, maybe there are the following data:
[1000, 602, 1, 1]
[1003, 602.5, 2, 2]
NOTE: the data is NOT vertices, but rather may require some computation or lookup to be converted to vertices.
The user can choose between visualization algorithms. Let's say one would display 2 cubes each at (0, 602, 0) and (3, 602.5, 100). The user chooses that color id 1 = blue and 2 = green. So the origin cube is shown as blue and the other as green.
Then without any modification to the data at all, the user chooses a different visualization and now a spheres are shown at (10, 602, 10) and (13, 602.5, 20) and the colors are different because the user changed the color mapping.
Yet another visualization might show lines between all the data elements, or a rectangle for each set of 4, etc.
Is the above description something that can be done in a straightforward way? How would it best be done?
Note that we would be adding new data, appending to the end, a lot. Bursts of thousands per second are likely. Modifications of existing data would be more rare and taking a performance hit for those cases is acceptable. User changing algorithm and color mapping would be relatively rare.
I'd prefer to do this using a cross platform API (across OS and GPU's), so I'm assuming OpenGL.
You can store your data in a VBO (in GPU memory) and update it when it changes.
On the GPU side, you can use a geometry shader to generate more geometry. Not sure how to switch from line to cube to sphere, but if you are drawing a triangle at each location, your GS can output "extra" triangles (ditto for lines and points).
As for the color change feature, you can bake that logic into the vertex shader. The idx (1, 2, ...) should be a vertex attribute; have the VS lookup a table giving idx -> color mappings (this could be stored as a small texture). You can update the texture to change the color mapping on the fly.
For applications like yours there are special GPGPU programming infrastructures: CUDA and OpenCL. OpenCL is the cross vendor system. CUDA is cross plattform, but supports only NVidia GPUs. Also OpenGL did introduce general purpose compute functionality in OpenGL-4.2 (compute shaders).
and a lot of it, millions of data elements
Millions is not a very lot, even if a single element consumed 100 bytes, that would be then only 100 MiB to transfert. Modern GPUs can transfer about 10 GiB/s from/to host system memory.
Is the above description something that can be done in a straightforward way? How would it best be done?
Yes it can be done. However only if you can parallelize your problem and make it's memory access pattern cater to what GPUs prefer you'll really see performance. Especially bad memory access patterns can cause several orders of magnitude performance loss.
i am currently working on some kind of virtual texture implementation. The mipmap levels are used as a level of detail controlling structure. (Every texel in the virtual texture relates to a block of data in the 'real' texture.)
The data exists in several detail levels which result in different block counts in the virtual texture.
Example:
level size of data number of blocks
0 60 4
1 30 2
2 15 1
My idea was to call glTexImage for every detail level in the virtual texture to create the different mipmap levels.
The problem is that altough there are no errors when creating or updating/loading i can't get any data from the texture. Creating only the base level and calling glGenerateMipmap works fine but results in the wrong sizes for some base sizes. (technically they are correct, but not in my case)
I read somewhere that mipmap level sizes must be a division by two (or by two and floor).
The questions:
Is it possible to load 'custom' mipmap levels?
Are there any constrains mipmap level sizes?
You can load custom mipmap levels, but cannot choose their sizes. OpenGL specifies what MipMap sizes for levels it expects and does not allow deviation from it.
Taking width as example, the required width for mipmap level i is max(1, floor(w_b / 2^i)), where w_b is the width of the first mip level (the base). It is the same for the other dimensions (GL spec 2.1, section 3.8.8, paragraph mipmapping).
Make sure you loaded mipmap levels all the way down to 1x1. See here.