understanding how textures work with CUDA

understanding how textures work with CUDA - c++

I got confused of how textures work with CUDA
as when I do device Query "on my GTX 780" I find this:
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
now when I investigated CUDA "particles example", I found this:
checkCudaErrors(cudaBindTexture(0, oldPosTex, sortedPos, numParticles*sizeof(float4)));
where numParticles in my case I have raised it to 1024 * 1024 * 2 (around 2.1 millions)
how does this fit in the 1D texture??
also inside the kernels I've found this "need more explain please as everything here is connected"
texture<float4, 1, cudaReadModeElementType> oldPosTex;
#define FETCH(t, i) tex1Dfetch(t##Tex, i)
at kernel:
float4 pos = FETCH(oldPos, sortedIndex);
now what I need to know also, I can use this texture "with its defined size numParticles*sizeof(float4) in a frame buffer draw instead of drawing a VBO?

how does this fit in the 1D texture?
The texture hardware consists of two main parts, the texture filtering hardware and the texture cache. Texture filtering includes functionality such as interpolation, addressing by normalized floating point coordinates and handling out-of-bounds addresses (clamp, wrap, mirror and border addressing modes). The texture cache can store data in a space filling curve to maximize 2D spatial locality (and thereby the cache hit rate). It can also store data in a regular flat array.
The Maximum Texture Dimension Size refers to limitations in the texture filtering hardware, not the texture caching hardware. And so, it refers to limits you may hit when using functions like tex2D() but not when using functions like tex1Dfetch(), which performs an unfiltered texture lookup. So, code you gave is probably setting things up for tex1Dfetch().
need more explain please as everything here is connected
This question is too broad and may be why your question was downvoted.
now what I need to know also, I can use this texture "with its defined size numParticles*sizeof(float4) in a frame buffer draw instead of drawing a VBO?
This is not a CUDA question as CUDA cannot draw anything. You should look into CUDA OpenGL interop to see if your question is answered there. If it's not, you should create a new question and describe your question more clearly.

Related

Are Shader Storage Buffer Objects the right tool to have persistent memory between shader loops?

Context
I have a fragment shader that processes a 2D image. Sometimes a pixel may be considered "invalid" (RGB value 0/0/0) for a few frame, while being valid the rest of the frames. This causes temporal noise as these pixels flicker.
I'd like to implement a sort of temporal filter where each rendering loop, each pixel is "shown" (RGB value not 0/0/0) if and only if this pixel was "valid" in the last X loops, where X might be 5, 10, etc. I figured if I could have an array of the same size as the image, I could set the element corresponding to a pixel to 0 when that pixel is invalid and increment it otherwise. And if the value is >= X, then the pixel can be displayed.
Image latency caused by the temporal filter is not an issue, but I want to minimize performance costs.
The question
So that's the context. I'm looking for a mechanism that allows me reading and writing (uniforms are therefore out) between different rendering loops of the same fragment shader. Reading back the data from my OpenGL application is a plus but not necessary.
I came across Shader Storage Buffer Object, would it fit my needs?
Are there other concerns I should be aware of? Performances? Coherency/memory barriers?

Yes, SSBOs are a suitable tool to have persistent memory between shader loops.
As I couldn't find a reason why it wouldn't work, I implemented it and I was indeed able to have a SSBO as an array with each element mapped to a pixel in order to do temporal filtering on each pixels.
I had to do a few things to not have artifacts in the image:
Use GL_DYNAMIC_COPY when binding the data with glBufferData.
Set my SSBO as volatile in the shader.
Use a barrier (memoryBarrierBuffer();) in my shader to separate the writing and reading of the SSBO.
As mentioned by #user253751 in a comment, I had to convert texture coordinates to index arrays.
I checked the performance costs of using the SSBO and they were negligible in my case: <0.1 ms for a 848x480 frame.

How to do particle binning in OpenGL?

Currently I'm creating a particle system and I would like to transfer most of the work to the GPU using OpenGL, for gaining experience and performance reasons. At the moment, there are multiple particles scattered through the space (these are currently still created on the CPU). I would more or less like to create a histogram of them. If I understand correctly, for this I would first translate all the particles from world coordinates to screen coordinates in a vertex shader. However, now I want to do the following:
So, for each pixel a hit count of how many particles are inside. Each particle will also have several properties (e.g. a colour) and I would like to sum them for every pixel (as shown in the lower-right corner). Would this be possible using OpenGL? If so, how?

The best tool I recomend for having the whole data (if it fits on GPU memory) is the use of SSBO.
Nevertheless, you need data after transforming them (e.g. by a projection). Still SSBO is your best option:
In the fragment shader you read the properties of already handled particles (let's say, the rendered pixel) and write modified properties (number of particles at this pixel, color, etc) to the same index in the buffer.
Due to parallel nature of GPU, several instances coming from different particles may be doing concurrently the work for the same index. Thus you need to handle this on your own. Read Memory model and Atomic operations
Another approach, but limited, is using Blending
The idea is that each fragment increments the actual color value of the frame buffer. This can be done using GL_FUNC_ADD for glBlendEquationSeparate and using as fragment-output-color a value of 1/255 (normalized integer) for each RGB/a component.
Limitations come from the [0-255] range: Only up to 255 particles in the same pixel, the rest amount is clamped to this range and so "lost".
You have four components RGBA, thus four properties can be handled. But can have several renderbuffers in a FBO.
You can read the FBO by glReadPixels. Use glReadBuffer first with a GL_COLOR_ATTACHMENTi if you use a FBO instead of the default frame buffer.

OpenGL: Why do square textures take less memory

Question:
Why does the same amount of pixels take dramatically less video memory if stored in a square texture than in a long rectangular texture?
Example:
I'm creating 360 4x16384 size textures with the glTexImage2D command. Internal format is GL_RGBA. Video memory: 1328 MB.
If I'm creating 360 256x256 textures with the same data, the memory usage is less than 100MB.
Using an integrated Intel HD4000 GPU.

It's not about the texture being rectangular. It's about one of the dimensions being extremely small.
In order to select texels from textures in an optimal fashion, hardware will employ what's known as swizzling. The general idea is that it will restructure the bytes in the texture so that pixels that neighbor each other in 2 dimensions will be neighbors in memory too. But doing this requires that the texture be of a certain minimum size in both dimensions.
Now, the texture filtering hardware can ignore this minimum size and only fetch from pixels within the texture's actual size is. But that extra storage is still there, taking up space to no useful purpose.
Given what you're seeing, there's a good chance that Intel's swizzling hardware has a base minimum size of 32 or 64 pixels.
In OpenGL, there's not much you can do to detect this incongruity other than what you've done here.

Rendering visualization of spectrogram efficiently

I'm trying to find a clever way to render a large spectrogram (say, fullscreen). A spectrogram is a coordinate-system, where the x-axis is time, the y-axis is frequency and the colour intensity is the magnitude of the frequency component, and it looks like this (youtube).
What's interesting to note is that each frame, a new column (1 pixel wide) is new, but the whole rest of the spectrum is the same, only shifted left one pixel. Currently I'm just writing to a circular software buffer acting like an image, and drawing that - but it is obviously slow at high framerates and screensizes.
Is there any obvious solution to this problem, using OpenGL (or some software trick - has to be cross-platform, though)? Perhaps through some use of buffer on the GPU memory, with a shader that fills it (admittedly, i have a very vague understanding of OpenGL beyond drawing simple stuff)? It revolves around keeping the old data on the GPU memory as i see it.

Use a single channel texture for the waterfall (this is what you're drawing, a waterfall plot) in which you update one column or row at a time using glTexSubImage. By using GL_WRAP mode you can simply advance the texture coordinates beyond the bounds of the texture and it will, well, wrap. By moving the texture opposing to the update you can get the waterfall effect (i.e. moving spectrogram, with the updates coming in at the right edge).
To give the whole thing color, use the texture's values as an index into a transfer function LUT texture using a fragment shader.

You can use the GPU library for spectrogram calculations: nnAudio
https://github.com/KinWaiCheuk/nnAudio

Check GPU OpenGL Limits

I was wondering if there is an easy way to query (programatically) the GPU OpenGL Limits for the following features:
- maximum 2D texture size
- maximum 3D texture size
- maximum number of vertex shader attributes
- maximum number of varying floats
- number of texture image units (in vertex shader, and in fragment shader)
- maximum number of draw buffers
I need to know these numbers in advance before writing my GPU Research Project.

glGet() is your friend, with:
GL_MAX_3D_TEXTURE_SIZE
GL_MAX_TEXTURE_SIZE
GL_MAX_VERTEX_ATTRIBS
GL_MAX_VARYING_FLOATS
GL_MAX_TEXTURE_UNITS
GL_MAX_DRAW_BUFFERS
e.g.:
GLint result;
glGetIntegerv(GL_MAX_VARYING_FLOATS, &result);
Not quite sure what your project is setting out to achieve, but you might be interested in OpenCL if it's general purpose computing and you weren't already aware of it. In particular Cl/GL interop if there is a graphics element too and your hardware supports it.
As Damon pointed out in the comments in practice it may be more complex than this for texture sizes. The problems arise because rendering may fallback from hardware to software for some sizes of textures, and also because the size of a texture varies depending upon the pixel format used. To work around this it is possible to use GL_PROXY_TEXTURE_* with glTexImage*.

As a complement to what was said by "awoodland" and if you still do not know ... i think you should take a look at GLEW...
GLEW provides efficient run-time mechanisms for determining which OpenGL extensions are supported on the target platform.
http://glew.sourceforge.net/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

understanding how textures work with CUDA - c++

Related

Are Shader Storage Buffer Objects the right tool to have persistent memory between shader loops?

How to do particle binning in OpenGL?

OpenGL: Why do square textures take less memory

Rendering visualization of spectrogram efficiently

Check GPU OpenGL Limits

Categories

Resources