Extract overlapping image patches from an image in CUDA - c++

I am currently planning on writing a function that extracts overlapping image patches from a 2D image (width x height) into a 3D batch of these patches (batch_id x patch_width x patch_height). As far as I know, there are no utilities in CUDA or OpenCV CUDA which make that very easy. (Please correct me if I am wrong here)
Since I need to resort to writing my own CUDA kernel for this task I need to decide how to tackle this approach. As far as I see there are two ways how to write the kernel:
Create a GPU thread for each pixel and map this pixel to potentially multiple locations in the 3D batch.
Create a GPU thread for each pixel in the 3D batch and let it fetch its corresponding pixel from the image.
I didn't find a clear answer in the CUDA Programming Guide to whether any of these approaches has specific advantages or disadvantages. Would you favour one of these approaches or is there an even easier way of doing this?

I think 1 is better, because it can minimize memory transactions. Memory transactions are done in a fixed size (e.g. L1 : 128 bytes), so grouping data loads and making as few cache transactions as possible can affect processing time...
Of course, it's possible that memory transactions in both way are same. Although I'm not sure about my choice, consider this when you make a kernel.

Related

drawing time series of millions of vertices using OpenGL

I'm working on a data visualisation application where I need to draw about 20 different time series overlayed in 2D, each consisting of a few million data points. I need to be able to zoom and pan the data left and right to look at it and need to be able to place cursors on the data for measuring time intervals and to inspect data points. It's very important that when zoomed out all the way, I can easily spot outliers in the data and zoom in to look at them. So averaging the data can be problematic.
I have a naive implementation using a standard GUI framework on linux which is way too slow to be practical. I'm thinking of using OpenGL instead (testing on a Radeon RX480 GPU), with orthogonal projection. I searched around and it seems VBOs to draw line strips might work, but I have no idea if this is the best solution (would give me the best frame rate).
What is the best way to send data sets consisting of millions of vertices to the GPU, assuming the data does not change, and will be redrawn each time the user interacts with it (pan/zoom/click on it)?
In modern OpenGL (versions 3/4 core profile) VBOs are the standard way to transfer geometric / non-image data to the GPU, so yes you will almost certainly end up using VBOs.
Alternatives would be uniform buffers, or texture buffer objects, but for the application you're describing I can't see any performance advantage in using them - might even be worse - and it would complicate the code.
The single biggest speedup will come from having all the data points stored on the GPU instead of being copied over each frame as a 2D GUI might be doing. So do the simplest thing that works and then worry about speed.
If you are new to OpenGL, I recommend the book "OpenGL SuperBible" 6th or 7th edition. There are many good online OpenGL guides and references, just make sure you avoid the older ones written for OpenGL 1/2.
Hope this helps.

OpenGL rendering/updating loop issues

I'm wondering how e.g. graphic (/game) engines do their job with lot's of heterogeneous data while a customized simple rendering loop turns into a nightmare when you have some small changes.
Example:
First, let's say we have some blocks in our scene.
Graphic-Engine: create cubes and move them
Customized: create cube template for vertices, normals, etc. copy and translate them to the position and copy e.g. in a vbo. One glDraw* call does the job.
Second, some weird logic. We want block 1, 4, 7, ... to rotate on x-axis, 2, 5, 8, ... on y-axis and 3, 6, 9 on z-axis with a rotation speed linear to the camera distance.
Graphic-Engine: manipulating object's matrix and it works
Customized: (I think) per object glDraw* call with changing model-matrix uniform is not a good idea, so a translation matrix should be something like an attribute? I have to update them every frame.
Third, a block should disappear if the distance to the camera is lower than any const value Q.
Graphic-Engine: if (object.distance(camera) < Q) scene.drop(object);
Customized: (I think) our vbo is invalid and we have to recreate it?
Again to the very first sentence: it feels like engines do those manipulations for free, while we have to rethink how to provide and update data. And while we do so, the engine (might, but I actually don't know) say: 'update whatever you want, at least I'm going to send all matrizes'.
Another Example: What about a voxel-based world (e.g. Minecraft) where we only draw the visible surface, and we are able to throw a bomb and destroy many voxels. If the world's view data is in one huge buffer we only have one glDraw*-call but have to recreate the buffer every time then. If there are smaller chunks, we have many glDraw*-calls and also have to manipulate buffers, which are smaller.
So is it a good deal to send let's say 10MB of buffer update data instead of 2 gl*-calls with 1MB? How many updates are okay? Should a rendering loop deal with lazy updates?
I'm searching for a guide what a 60fps application should be able to update/draw per frame to get a feeling of what is possible. For my tests, every optimization try is another bottleneck.
And I don't want those tutorials which says: hey there is a new cool gl*Instance call which is super-fast, buuuuut you have to check if your gpu supports it. Well, I also rather consider this an optimization than a meaningful implementation at first.
Do you have any ideas, sources, best practices or rule of thumb how a rendering/updating routine best play together?
My questions are all nearly the same:
How many updates per frame are okay on today's hardware?
Can I lazy-load data to have it after a few frames, but without freezing my application
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
Thank you.
It's unclear how any of your questions relate to your "graphics engines" vs "customized" examples. All the updates you do with a "graphics engines" are translated to those OpenGL calls in the end.
In brief:
How many updates per frame are okay on today's hardware?
Today's PCIe bandwidth is huge (can go as high as 30 GB/s). However, to utilize it in its entirety you have to reduce the number transactions via consolidating OpenGL calls. The exact number of updates entirely depends on the hardware, drivers, and the way you use them, and graphics hardware is diverse.
This is the kind of answer you didn't want to hear, but unfortunately you have to face the truth: to reduce the number of OpenGL calls you have to use the newer version APIs. E.g. instead of setting each uniform individually you are better to submit a bunch of them through uniform shader buffer objects. Instead of submitting each MVP of each model individually, it's better to use instanced rendering. And so on.
An even more radical approach would be to move to a lower-level (and newer) API, i.e. Vulkan, which aims to solve exactly this problem: the cost of submitting work to the GPU.
Can I lazy-load data to have it after a few frames, but without freezing my application
Yes, you can upload buffer objects asynchronously. See Buffer Object Streaming for details.
Do I have to do small updates and profile my loop if there are some microseconds left till next rendering?
Maybe I should implement a real-time profiler which gets a feeling over time, how expensive updates are and can determine the amount of updates per frame?
You don't need any of these if you do it asynchronously.

Blitting surfaces in OpenGL

Both SDL and Game Maker have the concept of surfaces, images that you may modify on the fly and display them. I'm using OpenGL 1 and i'd like to know if openGL has this concept of Surface.
The only way that i came up with was:
Every frame create / destroy a new texture based on needs.
Every frame, update said texture based on needs.
These approachs don't seem to be very performant, but i see no alternative. Maybe this is how they are implemented in the mentioned engines.
Yes these two are the ways you would do it in OpenGL 1.0. I dont think there are any other means as far as 1.0 spec is concerned.
Link : https://www.opengl.org/registry/doc/glspec10.pdf
Do note that the textures are stored on the device memory (GPU) which is fast to access for shading. And the above approaches copy it between host (CPU) memory and device memory. Hence the performance hit is the speed of host-device copy.
Why are you limited to OpenGL 1.0 spec. You can go higher and then you start getting more options.
Use GLSL shaders to directly edit content from one texture and output the same to another texture. Processing will be done on the GPU and a device-device copy is as fast as it gets.
Use CUDA. Map a texture to a CUDA array, use your kernel to modify the content. Or use OpenCL for non-NVIDIA cards.
This would be the better scenario so long as the modification can be executed in parallel this would benefit.
I would suggest trying the CPU copy method, as it might be fast enough for your needs. The host-device copy is getting faster with latest hardware. You might be able to get real-time 60fps or higher even with this copy, unless its a lot of textures you plan to execute this for.

Large 3D scene streaming

I'm working on a 3D engine suitable for very large scene display.
Appart of the rendering itself (frustum culling, occlusion culling, etc.), I'm wondering what is the best solution for scene management.
Data is given as a huge list of 3D meshs, with no relation between them, so I can't generate portals, I think...
The main goal is to be able to run this engine on systems with low RAM (500MB-1GB), and the scenes loaded into it are very large and can contain millions of triangles, which leads to very intensive memory usage. I'm actually working with a loose octree right now, constructed on loading, it works well on small and medium scenes, but many scenes are just to huge to fit entirely in memory, so here come my question:
How would you handle scenes to load and unload chunks dynamically (and ideally seamlessly), and what would you base on to determine if a chunk should be loaded/unloaded? If needed, I can create a custom file format, as scenes are being exported using a custom exporter on known 3D authoring tools.
Important information: Many scenes can't be effectively occluded, because of their construction.
Example: A very huge pipe network, so there isn't so much occlusion but very high number of elements.
I think that the best solution will be a "solution pack", a pack of different techniques.
Level of detail(LOD) can reduce memory footprint if unused levels are not loaded. It can be changed more or less seamlessly by using an alpha mix between the old and the new detail. The easiest controller will use mesh distance to camera.
Freeing the host memory(RAM) when the object has been uploaded to the GPU (device), and obviously free all unsued memory (OpenGL resources too). Valgrind can help you with this one.
Use low quality meshes and use tessellation to increase visual quality.
Use VBO indexing, this should reduce VRAM usage and increase performance
Don't use meshes if possible, terrain can be rendered using heightmaps. Some things can be procedurally generated.
Use bump or/and normalmaps. This will improve quality, then you can reduce vertex count.
Divide those "pipes" into different meshes.
Fake 3D meshes with 2D images: impostors, skydomes...
If the vast amount of ram is going to be used by textures, there are commercial packages available such as the GraniteSDK that offer seamless LOD-based texture streaming using a virtual texture cache. See http://graphinesoftware.com/granite . Alternatively you can look at http://ir-ltd.net/
In fact you can use the same technique to construct poly's on the fly from texture data in the shader, but it's going to be a bit more complicated.
For voxels there is a techniques to construct oct-trees entirely in GPU memory, and page in/out the parts you really need. The rendering can then be done using raycasting. See this post: Use octree to organize 3D volume data in GPU , http://www.icare3d.org/research/GTC2012_Voxelization_public.pdf and http://www.cse.chalmers.se/~kampe/highResolutionSparseVoxelDAGs.pdf
It comes down to how static the scene is going to be, and following from that, how well you can pre-bake the data according to your vizualization needs. It would already help if you can determine visibility constraints up front (e.g. google Potential Visiblity Sets) and organize it so that you can stream it at request. Since the visualizer will have limits, you always end up with a strategy to fit a section of the data into GPU memory as quickly and accurately as possible.

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.