Optimizing data visualization on the GPU? - opengl

I have a process that accumulates mostly static data over time--and a lot of it, millions of data elements. It is possible that small parts of the data may change occasionally, but mostly, it doesn't change.
However, I want to allow the user the freedom to change how this data is viewed, both in shape and color.
Is there a way that I could store the data on the GPU just as data. Then have a number of ways to convert that data to something renderable on the GPU. The user could then choose between those algorithms and we swap it in efficiently without having to touch the data at all. Also, color ids would be in the data, but the user could change what color each id should match to, again, without touching the data.
So, for example, maybe there are the following data:
[1000, 602, 1, 1]
[1003, 602.5, 2, 2]
NOTE: the data is NOT vertices, but rather may require some computation or lookup to be converted to vertices.
The user can choose between visualization algorithms. Let's say one would display 2 cubes each at (0, 602, 0) and (3, 602.5, 100). The user chooses that color id 1 = blue and 2 = green. So the origin cube is shown as blue and the other as green.
Then without any modification to the data at all, the user chooses a different visualization and now a spheres are shown at (10, 602, 10) and (13, 602.5, 20) and the colors are different because the user changed the color mapping.
Yet another visualization might show lines between all the data elements, or a rectangle for each set of 4, etc.
Is the above description something that can be done in a straightforward way? How would it best be done?
Note that we would be adding new data, appending to the end, a lot. Bursts of thousands per second are likely. Modifications of existing data would be more rare and taking a performance hit for those cases is acceptable. User changing algorithm and color mapping would be relatively rare.
I'd prefer to do this using a cross platform API (across OS and GPU's), so I'm assuming OpenGL.

You can store your data in a VBO (in GPU memory) and update it when it changes.
On the GPU side, you can use a geometry shader to generate more geometry. Not sure how to switch from line to cube to sphere, but if you are drawing a triangle at each location, your GS can output "extra" triangles (ditto for lines and points).
As for the color change feature, you can bake that logic into the vertex shader. The idx (1, 2, ...) should be a vertex attribute; have the VS lookup a table giving idx -> color mappings (this could be stored as a small texture). You can update the texture to change the color mapping on the fly.

For applications like yours there are special GPGPU programming infrastructures: CUDA and OpenCL. OpenCL is the cross vendor system. CUDA is cross plattform, but supports only NVidia GPUs. Also OpenGL did introduce general purpose compute functionality in OpenGL-4.2 (compute shaders).
and a lot of it, millions of data elements
Millions is not a very lot, even if a single element consumed 100 bytes, that would be then only 100 MiB to transfert. Modern GPUs can transfer about 10 GiB/s from/to host system memory.
Is the above description something that can be done in a straightforward way? How would it best be done?
Yes it can be done. However only if you can parallelize your problem and make it's memory access pattern cater to what GPUs prefer you'll really see performance. Especially bad memory access patterns can cause several orders of magnitude performance loss.

Related

how to calculate the number of specified colored pixels using GLSL?

I have a grayscale texture (8000*8000) , the value of each pixel is an ID (actually, this ID is the ID of triangle to which the fragment belongs, I want to using this method to calculate how many triangles and which triangles are visible in my scene).
now I need to count how many unique IDs there are and what are them. I want to implement this with GLSL and minimize the data transfer between GPU RAM and RAM.
The initial idea I come up with is to use a shader storage buffer, bind it to an array in GLSL, its size is totalTriangleNum, then iterate through the ID texture in shader, increase the array element by 1 that have index equal to ID in texture.
After that, read the buffer to OpenGL application and get what I want. Is this a efficient way to do so? Or are there some better solutions like compute-shader (well I'm not familiar with it) or something else.
I want to using this method to calculate how many triangles and which triangles are visible in my scene)
Given your description of your data let me rephrase that a bit:
You want to determine how many distinct values there are in your dataset, and how often each value appears.
This is commonly known as a Histogram. Unfortunately (for you) generating histograms are among the problems not that trivially solved on GPUs. Essentially you have to divide down your image into smaller and smaller subimages (BSP, quadtree, etc.) until divided down to single pixels on which you perform the evaluation. Then you backtrack propagating up the sub-histograms, essentially performing an insertion or merge sort on the histogram.
Generating histograms with GPUs is still actively researched, so I suggest you read up on the published academic works (usually accompanied with source code). Keywords: Histogram, GPU
This one is a nice paper done by the AMD GPU researchers: https://developer.amd.com/wordpress/media/2012/10/GPUHistogramGeneration_preprint.pdf

Draw large number (10M) of 2d polygons with opengl

I'm working on a CAD software which need to show circuit blueprint containing more than 10M 2d polygons. Each polygon is simple, 95% of them are only rectangles, others have fewer than 10 vertexes.
In order to show the whole design, I will need to create a huge vertex buffer which will definitely blow up the graphics memory limit. However, since most of the polygons won't be seen clearly at that scale, I'm thinking of using some pre screening algorithm to minimize the polygons to draw. But if I do so, so many polygons (each of them won't be larger than one pixel) will be gone, then the final image will be wrong.
Another thought will be to separate polygons into groups each of which will be strongly connected (touching), then construct a large polygon for each group. Some level of detail algorithm may be used to shrink the points without changing the shapes. Not sure how fast these algorithms are and if I need to pre calculate for different scale level.
Is there any standard way to deal with this problem? I'm pretty sure it has been solved lots of times...
To clarify, we need to make this work on OpenGL 2.1.
You're targeting OpenGL-2.1 so client side vertex arrays are available. Which effectively means: You don't have to upload anything to the GPU at all, the data is fetched from your programs address space on demand.
Of course 10M triangles is not a lot; some professions use programs in which a single frame ends up with 1G triangles. The amount of data required is easy enough to calculate:
10M # number of primitives
* 4 # number of vertices in a quad
* 4B # sizeof GLfloat
* 2 # number of elements in a 2D vector
= 320MB
That's not a lot. Most GPUs you can buy these days come with at least 512MiB of memory, where this fits nicely. However even if your GPU doesn't have as much memory available, OpenGL's memory model is abstract and data is swapped to and from the GPU as needed.

Cost of using multiple render targets

I am using glsl as a framework for GPGPU for real-time image-processing. I am currently trying to "shave off" a few more milliseconds to make my application real-time. Here's the basic setup:
I take an input image, calculate several transformations of it, and then output a result image. For instance, Let the input image be I. Then the one fragment shader calculates f(I); the second calculates g(I); and the last one calculates h(f(I),g(I)).
My question is regarding efficiently calculating f(I),g(I): does it matter if I use 2 separate fragment shaders (and therefore 2 rendering passes), or if I use a single fragment shader with 2 outputs? Will the latter run faster? I have mostly found discussions about the "how-to"; not about the performance.
Edit
Thanks for the replies so far. Following several remarks, here's an example for my use-case with some more details:
I want a to filter the rows of image I with a 1-d filter; and also filter the rows of the squared image (each pixel is squared). f(I) = filter rows and g(I) = square and filter rows:
shader1: (input image) I --> filter rows --> I_rows (output image)
shader2: (input image) I --> square pixels and filter rows--> I^2_rows (output image)
The question is: will writing a single shader that does both operations be faster than running these two shaders one after the other? #derhass suggests that the answer is positive, because of accessing the same texture locations and enjoying locality. But if it wasn't for the texture-locality: would I still be enjoying a performance boost? or is a single shader rendering to two outputs basically equivalent to two render passes?
Using multiple render passes is usually slower than using one pass with MRT output, but this will also depend on your situation.
As I understand it, both f(I) and g(I) sample the input image I, and if each samples the same (or closely neighboring) loactions, you can greatly profit from the texture cache between the different operations - you have to sample the input texture just once, instead of two times with the multipass approach.
Taking this approach one step further: Do you even need the intermediate results f(I) and g(I) separately? Maybe you could just put h(f(I),g(I)) directly onto one shader, so you do neither need multiple passes and MRTs. If you want to be able to dyanmically combine your operations, you can still use that apporach and programatically combine different shader code parts dynamically to implement the operations (where possible), and use multiple passes only where absolutely necessary.
EDIT
As the question has been updated in the meantime, I think I can give some more specific answers:
What I said so far, especially about putting h(f(I),g(f(I)) into one shader is only a good idea if h (or f and g) will not need any neighboring pixels. If h is a nxn filter kernel, you would have to access nxn different input texels, and since those inputs are not directly known, you would have to calculate f and g for each of them. If both f and h are filter kernels, the effective filter size of the compound operation will be greater, and it is much better to calculate the intermediate results first and use multiple passes.
Looking at the specific issue you describe, it comes down to this.
If you use two separate shaders in the most naive way, you rendering will look like this.
use the shader1
select some output color buffer
draw a quad
use shader2
select some different color buffer
draw a quad
Every draw call has its overhead. The GL will have to do some extra validation. Switching the shaders might be the most expensive extra step here compared to the combined shader approach, as it might force a GPU pipeline flush. Als, for each draw call, you have the vertex processing, rasterization, and per fragment attribute interolation operations. With just one shader, lot's of this overhead is going away, and the per-fragment calculations described so far can be "shared" for both filters.
But if it wasn't for the texture-locality: would I still be enjoying a
performance boost?
Because of the stuff I said so far, and specific to the shaders you presented, I tend to say: yes. But the effect will be very small to neglegible if we ignore the texture acesses here, especially if we assume reasonable high resolution images so that the relative overhead compared to the total amount of work appearts small. I would at least say that using a single pass MRT setup will not be slower. However, only benchmarking/profiling the very specific implementation on a specific GPU will give a definitive answer.
Why did I say "the shaders you presented". Because in both cases, you do the image squaring in one shader. You could split that into two different shaders and renderpasses also. In that case, you would get additional overhead (to the already mentioned) for writing the intermediate results, and having to read that back. However, since you run a filter over the intermediate resulte, you do not have to square any input texel more than once, but in the combined approach, you do. If the squaring operation is expensive enough, and your filter size is big enough, you could in theory save more time than is introduced by the overhead of multiple passes. Again, only benchmarking/profiling cann tell you where the break even would lie.
I have done some benchmarking with MRT vs. multiple render passes myself in the past, although the image processing operations I was interested in are a bit different from yours. What I found is that in such scenarios, the texture access is the key factor, and you can hide a lot of other calculations (like squaring a color value) in the texture access latency. I think that your "But if it wasn't for the texture-locality" is a bit unrealistic, since it is the major contribution to the overall running time. And it isn't just the locality, it is also the number of texture accesses in total: With your multipe-shader approach, an imge of size w*h, and a 1D filter of size n, you will end up with 2*w*h*n texture accesses overall, while with the combined approach, you will just have reduced to *w*h*n, and that will make a huge difference in the past.
For a AMD FirePro V9800,image size of 1920x1080, and just copying the pixels to two output buffers by rendering textured quds, I got with two passes: ~0,320ms (even without switching shaders) vs 1 pass MRT: ~0,230ms. So execution time was reduced by "only" 30%, but this was with just one texutre fetch per shader invocation. With filter kernels, I'd expect to see this figure getting closer to 50% reduction with increasing kernel size (but I haven't measured that, though).
Let us ignore any potential benefits from hardware-specific things like data cache, register re-use, etc. that might occur if you do your entire algorithm in a single shader invocation and focus entirely on algorithm complexity for a minute.
A Gaussian Blur on a 2D image is a separable filter (X and Y can be blurred as a much simpler series of 1D blurs), and you can actually get better performance if you split the horizontal and vertical application into two passes.
Consider the complexity of two 1D blurs vs. one 2D blur in Big O:
Two-Pass Gaussian Blur (Two 1D blurs):
     
Single-Pass Gaussian Blur (Single 2D blur):
     
Deferred shading is another example. Instead of one massive loop over all lights in a single-pass, many implementations will do one pass per-light shading only the area of the screen that each individual light actually covers.
Multi-pass is not always a bad thing, when it simplifies your algorithm as in the case of a separable filter or light coverage, it is often good.
Your results may vary, but if you can show an appreciable difference in algorithm complexity in Big O notation using one approach over the other, it is worth exploring the run-time performance of both implementations.

I need advice how to improve graphics

I have file with table containing 23 millions records the following form {atomName, x, y, z, transparence}. For solutions I decided to use OpenGL.
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Then i tried using VBO. I formed three buffers: vertex, color and indexes. This solution worked. I got 60 fps, but i have not comfortable binding buffers and i am drawing points, not spheres.
Then i read about VAO, which can simplify binding buffers. Ok, it is worked. I got comfortable binding.
Now i want to draw spheres, not points. I thought, to form relative to each point of the set of vertices on which it will be possible to build a sphere (with some accuracy). But if i have 23 million vertices, i must calculate yet ~12 or more vertices relaty every point. 23 000 000 * 4 (float) = 1 Gb data, perhaps it not good solution.
What is the best next move i should do? I can not fully understand, applicable shaders in this task or exist other ways.
About your drawing process
My task to render it. In first iteration, I used block "glBegin/glEnd" and have drawed every atom as point some color. This solution worked. But I got 0.002 fps.
Think about it: For every of your 23 million records you make at least one function call directly (glVertex) and probably several function calls implicitly by that. Even worse, glVertex likely causes a context switch. What this means is, that your CPU hits several speed bumps for every vertex it has to processes. A top notch CPU these days has a clock rate of about 3 GHz and a pipeline length in the order of 10 instructions. When you make a context switch that pipeline gets stalled, in the worst case it then takes one pipeline length to actually process one single instruction. Lets consider that you have to perform at least 1000 instructions for processing a single glVertex call (which is actually a rather optimistic estimation). That alone means, that you're limited to process at most 3 million vertices per second. So at 23 million vertices that's already less than one FPS then.
But you also got context switches in there, which add a further penality. And probably a lot of branching which create further pipeline flushes.
And that's just the glVertex call. You also have colors in there.
And you wonder that immediate mode is slow?
Of course it's slow. Using the Immediate Mode has been discouraged for well over 15 years. Vertex Arrays are available since OpenGL-1.1.
This solution worked. I got 60 fps,
Yes, because all the data resides on the GPU's own memory now. GPUs are massively parallel and optimized to crunch this kind of data and doing the operations they do.
but i have not comfortable binding buffers
Well, OpenGL is not a high level scene graph library. It's a mid to low level drawing API. You use it like a sophisticated pencil to draw on a digital canvas.
Then i read about VAO
Well, VAOs are meant to coalesce buffer objects that belong together so it makes sense using them.
Now i want to draw spheres, not points.
You have two options:
Using point sprite textures. This means that your points will get area when drawn, and that area gets a texture applied. I think this is the best method for you. Given the right shader you can even give your point sprite the right kind of depth values, so that your "spheres" will actually intersect like spheres in the depth buffer.
The other option is using instancing a single sphere geometry, using your atom records as control data for the instancing process. This would then process real sphere geometry. However I fear that implementing an instanced drawing process might be a bit too advanced for your skill level at the moment.
About drawing 23 million points
Seriously what kind of display do you have available, that you can draw 23 million, distinguishable points? Your typical computer screen will have some about 2000×1500 points. The highest resolution displays you can buy these days have about 4k×2.5k pixels, i.e. 10 million individual pixels. Let's assume your atoms are evenly distributed in a plane: At 23 million atoms to draw each pixel will get several times overdrawn. You simply can't display 23 million individual atoms that way. Another way to look at this is, that the display's pixel grid implies a spatial sampling and you can't reproduce anything smaller than twice the average sampling distance (sampling theorem).
So it absolutely makes sense to draw only a subset of the data, namely the subset that's actually in view. Also if you're zoomed very far out (i.e. you have the full dataset in view) it makes sense to coalesce atoms closeby.
It definitely makes sense to sort your data into a spatial subdivision structure. In your case I think an octree would be a good choice.

OpenGL - A way to display lot of points dynamically

I am providing a question regarding a subject that I am now working on.
I have an OpenGL view in which I would like to display points.
So far, this is something I can handle ;)
For every point, I have its coordinates (X ; Y ; Z) and a value (unsigned char).
I have a color array giving the link between one value and a color.
For example, 255 is red, 0 is blue, and so on...
I want to display those points in an OpenGL view.
I want to use a threshold value so that depending on it, I can modify the transparency value of a color depending on the value of one point.
I want also that the performance doesn't go bad even if I have a lot of points (5 billions in the worst case but 1~2 millions in a standard case).
I am now looking for the effective way to handle this.
I am interested in the VBO. I have read that it will allow some good performance and also that I can modify the buffer as I want without recalculating it from scratch (as with display list).
So that I can solve the threshold issue.
However, doing this on a million points dynamically will provide some heavy calculations (at least a pretty bad for loop), no ?
I am opened to any suggestions and I would like to discuss about any of your ideas !
Trying to display a billion points or more is generally (forgive the pun) pointless.
Even an extremely high resolution screen has only a few million pixels. Nothing you can do will get it to display more points than that.
As such, your first step is almost undoubtedly to figure out a way to restrict your display to a number of points that's at least halfway reasonable. OpenGL can (and will) oblige if you ask it to display more, but your monitor won't and neither will mine or much or anybody else's.
Not directly related to the OpenGL part of your question, but if you are looking at rendering massive point clouds you might want to read up on space partitioning hierarchies such as octrees to keep performance in check.
Put everything into one VBO. Draw it as an array of points: glDrawArrays(GL_POINTS,0,num). Calculate alpha in a pixel shader (using threshold passed as uniform).
If you want to change a small subset of points - you can map a sub-range of the VBO. If you need to update large parts frequently - you can use Transform Feedback to utilize GPU.
If you need to simulate something for the updates, you should consider using CUDA or OpenCL to run the update completely on the GPU. This will give you the best performance. Otherwise, you can use a single VBO and update it once per frame from the CPU. If this gets too slow, you could try multiple buffers and distribute the updates across several frames.
For the threshold, you should use a shader uniform variable instead of modifying the vertex buffer. This allows you to set a value per-frame which can be then combined with the data from the vertex buffer (for instance, you set a float minVal; and every vertex with some attribute less than minVal gets discarded in the geometry shader.)