CUDA cube map textures - opengl

How to deal with OpenGL cube map textures in CUDA?
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernel and one need to use 2D textures with proper calculations and sampling in user code?

OK - I managed to solve the problem myself, though the solution isn't as simple as using another CUDA function.
To bind a CUDA texture reference with any texture, be it one obtained from OpenGL or D3D, one has to provide a CUDA array that is mapped to a resource, using cuGraphicsSubResourceGetMappedArray to retrieve it. As I mentioned in the question, it is simple in case of a one or two dimensional texture. But with other available types it is more complicated.
At any time we need a CUDA array that the reference is bound to. Same goes with the cube map texture. But in such a case the array has to be a 3D one. The problem is that CUDA driver API provides only the aforementioned function to retrieve a single layer from such a texture resource, and map it to a single, two dimensional array. To get what we want we have to make ourselves the 3D array containing all the layers (or faces in case of a cube map).
First of all we have to get arrays for each layer/face using the above function. Next step is to create the 3D array by call to cuArray3DCreate, fed with proper set of parameters (size/number of layers, level of detail, data format, number of channels per texel and some flags). Then we have to copy the layers' arrays to the 3D one with a series of calls to cuMemcpy3D, one for each layer/face array.
Finally, we set our target CUDA texture reference with cuTexRefSetArray, fed with the 3D array we created and copied to. Inside of the device code we create a reference with proper texture type and mode (float4 and cube map) and sample it with texCubemap.
Below I put a fragment of the function which does all that, available in full length in CIRT Repository (cirt_server.c file, function cirtTexImage3D).
if (result)
// Create a 3D array...
CUDA_ARRAY3D_DESCRIPTOR layeredTextureDescr;
layeredTextureDescr.Width = w;
layeredTextureDescr.Height = h;
layeredTextureDescr.Depth = d;
layeredTextureDescr.Format = map_type_to_format(type);
layeredTextureDescr.NumChannels = format == CIRT_RGB ? CIRT_RGBA : format;
layeredTextureDescr.Flags = map_target_to_flags(target);
if (result) result = LogCUDADriverCall(cuArray3DCreate(&hTexRefArray, &layeredTextureDescr),
FUN_NAME(": cuArray3DCreate_tex3D"), __FILE_LINE__);
// Copy the acquired layer/face arrays into the collective 3D one...
CUDA_MEMCPY3D layerCopyDescr;
layerCopyDescr.srcMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.srcXInBytes = 0;
layerCopyDescr.srcZ = 0;
layerCopyDescr.srcY = 0;
layerCopyDescr.srcLOD = 0;
layerCopyDescr.dstMemoryType = CU_MEMORYTYPE_ARRAY;
layerCopyDescr.dstLOD = 0;
layerCopyDescr.WidthInBytes = layeredTextureDescr.NumChannels * w;
layerCopyDescr.Height = h;
layerCopyDescr.Depth = target == CIRT_TEXTURE_CUBE_MAP ? 1 : d;
layerCopyDescr.dstArray = hTexRefArray;
for (i = 0; i < num_layers; ++i)
layer = ((num_layers == 6) ? CU_CUBEMAP_FACE_POSITIVE_X + i : i);
layerCopyDescr.dstXInBytes = 0;
layerCopyDescr.dstY = 0;
layerCopyDescr.dstZ = i;
layerCopyDescr.srcArray = hLayres[i];
if (result) result = LogCUDADriverCall(cuMemcpy3D(&layerCopyDescr),
FUN_NAME(": cuMemcpy3D_tex3D"), __FILE_LINE__);
// Finally bind the 3D array with texture reference...
if (result) LogCUDADriverCall(cuTexRefSetArray(hTexRef, hTexRefArray, CU_TRSA_OVERRIDE_FORMAT),
FUN_NAME(": cuTexRefSetArray_tex3D"), __FILE_LINE__);
if (hLayres)
if (result)
current->m_oTextureManager.m_cuTextureRes[current->m_oTextureManager.m_nTexCount++] = hTexResource;
I've checked it with cube maps only for now but it should work just fine with 3D texture as well.

I'm not real familiar with CUDA directly but I do have some experience in OpenGL and DirectX and I am also familiar with 3D Graphics Rendering APIs, Libraries and Pipelines and having the ability to setup and use those APIs.
When I look at your question(s):
How to deal with OpenGL cube map textures in CUDA?
And you proceed to explain it by this:
When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernal and one need to use 2D textures with proper calculations and sampling in user code?
I went to CUDA's website for their API SDK & Programming Documentations. And found the function in question cuGraphicsSubResourceGetMappedArray()
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
Get an array through which to access a subresource of a mapped graphics resource.
pArray - Returned array through which a subresource of resource may be accessed
resource - Mapped resource to access
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
mipLevel - Mipmap level for the subresource to access
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
If resource is not a texture then it cannot be accessed via an array and CUDA_ERROR_NOT_MAPPED_AS_ARRAY is returned. If arrayIndex is not a valid array index for resource then CUDA_ERROR_INVALID_VALUE is returned. If mipLevel is not a valid mipmap level for resource then CUDA_ERROR_INVALID_VALUE is returned. If resource is not mapped then CUDA_ERROR_NOT_MAPPED is returned.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
This function method was found in NVidia CUDA's DriverAPI and not in their RuntimeAPI. When understanding hardware with CUDA capability is that there is a difference between the Host and Device programmable pipelines which can be found here:
2. Heterogeneous Computing
CUDA programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.
While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.
However, the device is based on a distinctly different design from the host system, and it's important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.
2.1. Differences between Host and Device
The primary differences are in threading model and in separate physical memories:
Threading resources - Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper-Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
Threads - Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.
RAM - The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in What Runs on a CUDA-Enabled Device?
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document. Applications composed with these differences in mind can treat the host and device together as a cohesive heterogeneous system wherein each processing unit is leveraged to do the kind of work it does best: sequential work on the host and parallel work on the device.
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Now knowing that there are two different APIs for CUDAs API Libraries we have to understand the difference between the two found here: Difference Between the driver and runtime APIs
1. Difference between the driver and runtime APIs
The driver and runtime APIs are very similar and can for the most part be used interchangeably. However, there are some key differences worth noting between the two.
Complexity vs. control
The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.
In comparison, the driver API offers more fine-grained control, especially over contexts and module loading. Kernel launches are much more complex to implement, as the execution configuration and kernel parameters must be specified with explicit function calls. However, unlike the runtime, where all the kernels are automatically loaded during initialization and stay loaded for as long as the program runs, with the driver API it is possible to only keep the modules that are currently needed loaded, or even dynamically reload modules. The driver API is also language-independent as it only deals with cubin objects.
Context management
Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().
Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
Since this happens to be found in the DriverAPI it has more flexibility of control towards the programmer but also requires more responsibility to manage where the RuntimeAPI library does things more automatic but gives you less control.
This is apparent since you mentioned that you are working with their Kernels but from the description of their implementation of the function
CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
CUgraphicsResource resource,
unsigned int arrayIndex,
unsigned int mipLevel )
The documentation is telling me that the first parameter that this function takes is a returned array through which a subresource of resource may be accessed. The second parameter of this function is the mapped graphics resource itself. The third parameter in which I believe is the parameter that you had in question where it is an enumerated type to a face and you then asked: When one passes such an enum, then the returned texture array will contain only data of that particular face, right? From what I gather and understand from the documentations is that this is an index value to an array of your cube map resource.
Which can be seen from their documentation:
arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
which happens to be an unsigned int or an index location into the textures that make up that cube map a typical cube map will have 6 faces or at most 12 if both inside and outside of the cube are mapped. So if we look at a cube map as well as textures and their relationship with pseudo code we can see that:
// Texture
struct Texture {
unsigned pixelsWidth;
unsigned pixelsHeight;
// Other Texture member variables or fields here.
// Only interested in the actual size of the texture `width by height`
// where these would be used to map this texture to one of the 6 faces
// of a cube:
struct CubeMap {
Texture face[6];
// face[0] = frontFace
// face[1] = backFace
// face[2] = leftFace
// face[3] = rightFace
// face[4] = topFace
// face[5] = bottomFace
The cubemap object has an array of textures that makes up its face and according to the documents the function that you have in question with its third parameter is asking you for an index into this texture array and the overall function will return this:
Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
I hope this helps to answer your question in regards to the use of the third parameter into the function you are trying to use from their API.
The OP had asked when passing this enum CU_CUBEMAP_FACE_POSITIVE_X to the third parameter of the above function call will it return only that face of the cube map which happens to be a texture. When looking at their documentation about this enumerated value or type found here: enum CUarray_cubemap_face
enum CUarray_cubemap_face - Array indices for cube faces
Positive X face of cubemap
Negative X face of cubemap
Positive Y face of cubemap
Negative Y face of cubemap
Positive Z face of cubemap
Negative Z face of cubemap
Read more at:
Follow us: #GPUComputing on Twitter | NVIDIA on Facebook
It appears to me that when using this method to query or get texture information that is stored into an array of a cube map, that the requirement of the third parameter being this enumerated value; is nothing more than the 0-index into that array. So by passing in CU_CUBEMAP_FACE_POSITIVE_X as the third parameter to me doesn't necessarily mean that you will only get back that particular face's texture. It appears to me that since this is the 0th index that it will return the entire array of textures. The old C style of passing around arrays as if they were pointers.


How do we display pixel data calculated in an OpenCL kernel to the screen using OpenGL?

I am interested in writing a real-time ray tracing application in c++ and I heard that using OpenCL-OpenGL interoperability is a good way to do this (to make good use of the GPU), so I have started writing a c++ project using this interoperability and using GLFW for window management. I should mention that although I have some coding experience, I do not have so much in c++ and have not worked with OpenCL or OpenGL before attempting this project, so I would appreciate it if answers are given with this in mind (that is, beginner-friendly terminology is preferred).
So far I have been able to get OpenCL-OpenGL interoperability working with an example using a vertex buffer object. I have also demonstrated that I can create image data with an RGBA array (at least on the CPU), send this to an OpenGL texture with glTexImage2D() and display it using glBlitFramebuffer().
My problem is that I don't know how to create an OpenCL kernel that is able to calculate pixel data such that it can be given as the data parameter in glTexImage2D(). I understand that to use the interoperability, we must first create OpenGL objects and then create OpenCL objects from these to write the data on as these objects share memory, so I am assuming I must first create an empty OpenGL array object then create an OpenCL array object from this to apply an appropriate kernel to which would write the pixel data before using the OpenGL array object as the data parameter in glTexImage2D(), but I am not sure what kind of object to use and have not seen any examples demonstrating this. A simple example showing how OpenCL can create pixel data for an OpenGL texture image (assuming a valid OpenCL-OpenGL context) would be much appreciated. Please do not leave any line out as I might not be able to fill in the blanks!
It's also very possible that the method I described above for implementing a ray tracer is not possible or at least not recommended, so if this is the case please outline an advised alternate method for sending OpenCL kernel calculated pixel data to OpenGL and subsequently drawing this to the screen. The answer to this similar question does not go into enough detail for me and the CL/GL interop link is not working. The answer mentions that this can be achieved using a renderbuffer rather than a texture, but it says at the bottom of the Khronos OpenGL wiki for Renderbuffer Objects that the only way to send pixel data to them is via pixel transfer operations but I can not find any straightforward explanation for how to initialize data this way.
Note that I am using OpenCL c (no c++ bindings).
From your second para you are creating an OpenCL context with a platform specific combination of GLX_DISPLAY / WGL_HDC and GL_CONTEXT properties to interoperate with OpenGL, and you can create a vertex buffer object that can be read/written as necessary by both OpenGL and OpenCL.
That's most of the work. In OpenGL you can copy any VBO into a texture with
glTexSubImage2D(GL_TEXTURE_2D, level, x, y, width, height, format, size, NULL);
with the NULL at the end meaning to copy from GPU memory (the unpack buffer) rather than CPU memory.
As with copying from regular CPU memory, you might also need to change the pixel alignment if it isn't 32 bit.

Advise for Vulkan needed - how to efficiently switch texture per object/mesh in a game/app engine with dynamic content [duplicate]

I am in the middle of rendering different textures on multiple meshes of a model, but I do not have much clues about the procedures. Someone suggested for each mesh, create its own descriptor sets and call vkCmdBindDescriptorSets() and vkCmdDrawIndexed() for rendering like this:
// Pipeline with descriptor set layout that matches the shared descriptor sets
// Mesh A
vkCmdBindDescriptorSets(...&meshA.descriptorSet... );
// Mesh B
vkCmdBindDescriptorSets(...&meshB.descriptorSet... );
However, the above approach is quite different from the chopper sample and vulkan's samples that makes me have no idea where to start the change. I really appreciate any help to guide me to a correct direction.
You have a conceptual object which is made of multiple meshes which have different texturing needs. The general ways to deal with this are:
Change descriptor sets between parts of the object. Painful, but it works on all Vulkan-capable hardware.
Employ array textures. Each individual mesh fetches its data from a particular layer in the array texture. Of course, this restricts you to having each sub-mesh use textures of the same size. But it works on all Vulkan-capable hardware (up to 128 array elements, minimum). The array layer for a particular mesh can be provided as a push-constant, or a base instance if that's available.
Note that if you manage to be able to do it by base instance, then you can render the entire object with a multi-draw indirect command. Though it's not clear that a short multi-draw indirect would be faster than just baking a short sequence of drawing commands into a command buffer.
Employ sampler arrays, as Sascha Willems suggests. Presumably, the array index for the sub-mesh is provided as a push-constant or a multi-draw's draw index. The problem is that, regardless of how that array index is provided, it will have to be a dynamically uniform expression. And Vulkan implementations are not required to allow you to index a sampler array with a dynamically uniform expression. The base requirement is just a constant expression.
This limits you to hardware that supports the shaderSampledImageArrayDynamicIndexing feature. So you have to ask for that, and if it's not available, then you've got to work around that with #1 or #2. Or just don't run on that hardware. But the last one means that you can't run on any mobile hardware, since most of them don't support this feature as of yet.
Note that I am not saying you shouldn't use this method. I just want you to be aware that there are costs. There's a lot of hardware out there that can't do this. So you need to plan for that.
The person that suggested the above code fragment was me I guess ;)
This is only one way of doing it. You don't necessarily have to create one descriptor set per mesh or per texture. If your mesh e.g. uses 4 different textures, you could bind all of them at once to different binding points and select them in the shader.
And if you a take a look at NVIDIA's chopper sample, they do it pretty much the same way only with some more abstraction.
The example also sets up descriptor sets for the textures used :
VkDescriptorSet *textureDescriptors = m_renderer->getTextureDescriptorSets();
binds them a few lines later :
VkDescriptorSet sets[3] = { sceneDescriptor, textureDescriptors[0], m_transform_descriptor_set };
vkCmdBindDescriptorSets(m_draw_command[inCommandIndex], VK_PIPELINE_BIND_POINT_GRAPHICS, layout, 0, 3, sets, 0, NULL);
and then renders the mesh with the bound descriptor sets :
vkCmdDrawIndexedIndirect(m_draw_command[inCommandIndex], sceneIndirectBuffer, 0, inCount, sizeof(VkDrawIndexedIndirectCommand));
vkCmdDraw(m_draw_command[inCommandIndex], 1, 1, 0, 0);
If you take a look at initDescriptorSets you can see that they also create separate descriptor sets for the cubemap, the terrain, etc.
The LunarG examples should work similar, though if I'm not mistaken they never use more than one texture?

How do fragments get generated by rasterizer in OpenGL

I came across the description of rasterization and it basically says that when an object gets projected onto screen that what happens is that a scan takes place over all the pixels on the window/screen and decides if the pixel/fragment is within the triangle and hence determines that the pixel/fragment is inside the triangle and follows with further processing for the pixel/fragment such as colouring etc
Now since i am studying OpenGL and i do know that OpenGL probably has its own implementations of this process i was wondering whether this also takes place with OpenGL since of the "Scan-Conversion" process of vertices that i have read in OpenGL tutorial
Now another question related to this i have is that i know that the image/screen/window of pixels is an image or 2d array of pixels also known as the default framebuffer that is linear
So what i am wondering is if that is the case, how would projecting the 3 vertices of a triangle define which pixels are covered in side it?
Does the rasterizer draw the edges of a triangle first and then scans through each pixel or 2d array of pixels (also known as the default framebuffer) and sees if the points are between the lines using some mathematical method or some other simpler process happens?
and i do know that OpenGL probably has its own implementations of this process
OpenGL is just a specification document. What runs on a computer is an OpenGL implementation, most of the time as part of a GPU driver. The actual workload is carried out by a GPU…
this also takes place with OpenGL since of the "Scan-Conversion" process of vertices that i have read in OpenGL tutorial
most likely not. As a matter of fact last weekend I was attending a Khronos (the group that specifies OpenGL) event hosted by AMD and one of AMD's GPU engineers was lamenting that newbies have the scanline algorithm in mind with OpenGL, Direct3D, Mantel, Vulkan, etc., while GPUs do something entirely different.
2d array of pixels also known as the default framebuffer that is linear
actually the memory layout of pixels as used internally by the GPU is not linear (i.e. row-by-row) but follows a pattern that gives efficient localized access. For linear access GPUs have extremely efficient copy engines that allow for practically zero overhead conversion between the internal and linear format.
The exact layout used internally is a detail only the GPU engineers have insight into, though. But the fact that memory is not organized linearly but in a localized fashion is also one reason, that the traditional scanline algorithm is not used by GPUs.
So what i am wondering is if that is the case, how would projecting the 3 vertices of a triangle define which pixels are covered in side it?
Any method that satisfies the requirements of the OpenGL specification is allowed. The details are part of the OpenGL implementation, i.e. usually the combination of particular GPU model and driver version.
The scanline algorithm is what people did in software back in the 1990s, before modern GPUs. GPU developers figured out rather quickly that the algorithms you use for software rendering are vastly different from the algorithms you would implement in a VLSI implementation with billions of transistors. Algorithms optimized for hardware implementation tend to look fairly alien to anyone who comes from a software background anyway.
Another thing I'd like to clear up is that OpenGL doesn't say anything about "how" you render, it's just "what" you render. OpenGL implementations are free to do it however they please. We can find out "what" by reading the OpenGL standard, but "how" is buried in secrets kept by the GPU vendors.
Finally, before we start, the articles you linked are unrelated. They are about how ultrasonic scans work.
What do we know about scan conversion?
Scan conversion has as input a number of primitives. For our purposes, let's assume that they're all triangles (which is increasingly true these days).
Every triangle must be clipped by the clipping planes. This can add up to three additional sides to the triangle, in the worst case (turning it into a hexagon). This has to happen before perspective projection.
Every primitive must go through perspective projection. This process takes each vertex with homogeneous coordinates (X, Y, Z, W) and converts it to (X/W, Y/W, Z/W).
The framebuffer is usually organized hierarchically into tiles, not linearly like the way you do in software. Furthermore, the processing might be done at more than one hierarchical level. The reason why we use linear organization in software is because it takes extra cycles to compute memory addresses in a hierarchical layout. However, VLSI implementations do not suffer from this problem, they can simply wire up the bits in a register how they want to make an address from it.
So you can see that in software, tiles are "complicated and slow" but in hardware they're "easy and fast".
Some notes looking at the R5xx manual:
The R5xx series is positively ancient (2005) but the documentation is available online (search for "R5xx_Acceleration_v1.5.pdf"). It mentions two scan converters, so the pipeline looks something like this:
primitive output -> coarse scan converter -> quad scan converter -> fragment shader
The coarse scan converter appears to operate on larger tiles of configurable size (8x8 to 32x32), and has multiple selectable modes, an "intercept based" and a "bounding box based" mode.
The quad scan converter then takes the output of the coarse scan converter and outputs individual quads, which are groups of four samples. The depth values for each quad may be represented as four discrete values or as a plane equation. The plane equation allows the entire quad to be discarded quickly if the corresponding quad in the depth buffer also is specified as a plane equation. This is called "early Z" and it is a common optimization.
The fragment shader then works on one quad at a time. The quad might contain samples outside the triangle, which will then get discarded.
It's worth mentioning again that this is an old graphics card. Modern graphics cards are more complicated. For example, the R5xx doesn't even let you sample textures from the vertex shaders.
If you want an even more radically different picture, look up the PowerVR GPU implementations which use something called "tile-based deferred rendering". These modern and powerful GPUs are optimized for low cost and low power consumption, and they challenge a lot of your assumptions about how renderers work.
Quoting from GPU Gems: Parallel Prefix Sum (Scan) with CUDA, it describes how OpenGL does its scan
conversion and compares it with CUDA which I think suffices as the answer of my question:
Prior to the introduction of CUDA, several researchers implemented
scan using graphics APIs such as OpenGL and Direct3D (see Section
39.3.4 for more). To demonstrate the advantages CUDA has over these APIs for computations like scan, in this section we briefly describe
the work-efficient OpenGL inclusive-scan implementation of Sengupta et
al. (2006). Their implementation is a hybrid algorithm that performs a
configurable number of reduce steps as shown in Algorithm 5. It then
runs the double-buffered version of the sum scan algorithm previously
shown in Algorithm 2 on the result of the reduce step. Finally it
performs the down-sweep as shown in Algorithm 6.
Example 5. The Reduce Step of the OpenGL Scan Algorithm
1: for d = 1 to log2 n do
2: for all k = 1 to n/2 d – 1 in parallel do
3: a[d][k] = a[d – 1][2k] + a[d – 1][2k + 1]]
Example 6. The Down-Sweep Step of the OpenGL Scan Algorithm
1: for d = log2 n – 1 down to 0 do
2: for all k = 0 to n/2 d – 1 in parallel do
3: if i > 0 then
4: if k mod 2 U2260.GIF 0 then
5: a[d][k] = a[d + 1][k/2]
6: else
7: a[d][i] = a[d + 1][k/2 – 1]
The OpenGL scan computation is implemented using pixel shaders, and
each a[d] array is a two-dimensional texture on the GPU. Writing to
these arrays is performed using render-to-texture in OpenGL. Thus,
each loop iteration in Algorithm 5 and Algorithm 2 requires reading
from one texture and writing to another.
The main advantages CUDA has over OpenGL are its on-chip shared
memory, thread synchronization functionality, and scatter writes to
memory, which are not exposed to OpenGL pixel shaders. CUDA divides
the work of a large scan into many blocks, and each block is processed
entirely on-chip by a single multiprocessor before any data is written
to off-chip memory. In OpenGL, all memory updates are off-chip memory
updates. Thus, the bandwidth used by the OpenGL implementation is much
higher and therefore performance is lower, as shown previously in
Figure 39-7.

Reducing RAM usage with regard to Textures

Currently, My app is using a large amount of memory after loading textures (~200Mb)
I am loading the textures into a char buffer, passing it along to OpenGL and then killing the buffer.
It would seem that this memory is used by OpenGL, which is doing its own texture management internally.
What measures could I take to reduce this?
Is it possible to prevent OpenGL from managing textures internally?
One typical solution is to keep track of which textures you are needing at a given position of your camera or time-frame, and only load those when you need (opposed to load every single texture at the loading the app). You will have to have a "manager" which controls the loading-unloading and bounding of the respective texture number (e.g. a container which associates a string, name of the texture, with an integer) assigned by the glBindTexture)
Other option is to reduce the overall quality/size of the textures you are using.
It would seem that this memory is used by OpenGL,
which is doing its own texture management internally.
No, not texture management. It just need to keep the data somewhere. On modern systems the GPU is shared by several processes running simultanously. And not all of the data may fit into fast GPU memory. So the OpenGL implementation must be able to swap data out. The GPU fast memory is not storage, it's just another cache level. Just like the system memory is cache for system storage.
Also GPUs may crash and modern drivers reset them in situ, without the user noticing. For this they need a full copy of the data as well.
Is it possible to prevent OpenGL from managing textures internally?
No, because this would either be tedious to do, or break things. But what you can do, is loading only the textures you really need for drawing a given scene.
If you look through my writings about OpenGL, you'll notice that for years I tell people not to writing silly things like "initGL" functions. Put everything into your drawing code. You'll go through a drawing scheduling phase anyway (you must sort translucent objects far-to-near, frustum culling, etc.). That gives you the opportunity to check which textures you need, and to load them. You can even go as far and load only lower resolution mipmap levels so that when a scene is initially shown it has low detail, and load the higher resolution mipmaps in the background; this of course requires appropriate setting of minimum and maximum mip levels to be set as either texture or sampler parameter.

Per-line texture processing accelerated with OpenGL/OpenCL

I have a rendering step which I would like to perform on a dynamically-generated texture.
The algorithm can operate on rows independently in parallel. For each row, the algorithm will visit each pixel in left-to-right order and modify it in situ (no distinct output buffer is needed, if that helps). Each pass uses state variables which must be reset at the beginning of each row and persist as we traverse the columns.
Can I set up OpenGL shaders, or OpenCL, or whatever, to do this? Please provide a minimal example with code.
If you have access to GL 4.x-class hardware that implements EXT_shader_image_load_store or ARB_shader_image_load_store, I imagine you could pull it off. Otherwise, in-situ read/write of an image is generally not possible (though there are ways with NV_texture_barrier).
That being said, once you start wanting pixels to share state the way you do, you kill off most of your potential gains from parallelism. If the value you compute for a pixel is dependent on the computations of the pixel to its left, then you cannot actually execute each pixel in parallel. Which means that the only parallelism your algorithm actually has is per-row.
That's not going to buy you much.
If you really want to do this, use OpenCL. It's much friendlier to this kind of thing.
Yes, you can do it. No, you don't need 4.X hardware for that, you need fragment shaders (with flow control), framebuffer objects and floating point texture support.
You need to encode your data into 2D texture.
Store "state variable" in 1st pixel for each row, and encode the rest of the data into the rest of the pixels. It goes without saying that it is recommended to use floating point texture format.
Use two framebuffers, and render them onto each other in a loop using fragment shader that updates "state variable" at the first column, and performs whatever operation you need on another column, which is "current". To reduce amount of wasted resources you can limit rendering to columns you want to process. NVidia OpenGL SDK examples had "game of life", "GDGPU fluid", "GPU partciles" demos that work in similar fashion - by encoding data into texture and then using shaders to update it.
However, because you can do it, it doesn't mean you should do it and it doesn't mean that it is guaranteed to be fast. Some GPUs might have a very high memory texture memory read speed, but relatively slow computation speed (and vice versa) and not all GPUs have many conveyors for processing things in parallel.
Also, depending on your app, CUDA or OpenCL might be more suitable.