I wonder what is the best practice concerning cache management of the vertices.
Actually, I read numerous of articles on this topic but I'm not convinced yet by the best choice I should use.
I'm coding a small 3D rendering engine and my goal is to optimize this rendering by limiting the number of draw calls and of course the number of buffer bindings!
Until here, I gather within a single batch all the vertices of the objects sharing the same material properties (same lighting model properties and texture). if a VBO reallocation fails (GL_OUT_OF_MEMORY), then I create a new VBO to store the vertices of my object. Finally, I attach each batch to a VAO.
Pseudo-code:
for_each vbo in vbo_list
{
vbo->Bind();
for_each batch in vbo->getAttachedBatchList()
{
batch->BindVAO();
{
glDrawXXX(batch->GetOffset(), batch->GetLength());
}
}
}
Everything works well but is the technique I use is the most efficient one ?
I took a look to the following article (OpenGL ES - Mac):
https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html
This article advise to store Interleaved Vertex Data like below (for 3 given vertices):
VNTVNTVNT
In my case I use the following pattern:
VVVNNNTTT
Another article on the subject: https://www.opengl.org/wiki/Vertex_Specification_Best_Practices
According to you, what is the best choice in my case ?
And finally I have another interrogation according to vertex data alignement (the topic is covered into the first article). It said "Avoid Misaligned Vertex Data". Apparently, the advise is only relevant for the case VNTVNTVNT.
For example, if this case is the best choice the following structure declaration should be correct:
struct Vertex
{
float x, y, z; //12 bytes
float nx, ny, nz; //12 bytes
float s, t; //8 bytes
};
In the case sizeof(Vertex) = 32 bytes which a multiple of 4 bytes!
If I add the color components r, g, b:
struct Vertex
{
float x, y, z; //12 bytes
float nx, ny, nz; //12 bytes
float r, g, b; //12 bytes
float s, t; //8 bytes
};
We've now 44 bytes which also a multiple of 4 bytes!
So according to the article, if I store my data this way it shouldn't have this misaligned data problem.
In conclusion: is my pseudo-code is correct ? Is it correct to wait a GL_OUT_OF_MEMORY exception to create a new VBO? Is it better to respect a maximum size of allocation ? And finally which is the best way to store the data (Interleaved or not) and is my data alignment proposition is correct?
UPDATE:
It said on the second article: `"1MB to 4MB is a nice size according to one nVidia document". It seems to be too small! I wonder if there is a mistake because I know it's possible to store a much larger amount of data (much more than 100 Mo whithout any problem on current hardware). Plus, some famous models like the Dragon could not be stored on a single VBO (it would mean that the geometry of this mesh should be shared between 4 to 5 VBOs in the best case, so 5 draw calls to render it. I can't imagine, in this case, the number of VBO allocated in real video game scene!). What do you think of that ?
Related
I'm trying to find ways to copy multidimensional arrays from host to device in opencl and thought an approach was to use an image... which can be 1, 2, or 3 dimensional objects. However I'm confused because when reading a pixle from an array, they are using vector datatypes. Normally I would think double pointer, but it doesn't sound like that is what is meant by vector datatypes. Anyway here are my questions:
1) What is actually meant to vector datatype, why wouldn't we just specify 2 or 3 indices when denoting pixel coordinates? It looks like a single value such as float2 is being used to denote coordinates, but that makes no sense to me. I'm looking at the function read_imageui and read_image.
2) Can the input image just be a subset of the entire image and sampler be the subset of the input image? I don't understand how the coordinates are actually specified here either since read_image() only seams to take a single value for input and a single value for sampler.
3) If doing linear algebra, should I just bite the bullet and translate 1-D array data from the buffer into multi-dim arrays in opencl?
4) I'm still interested in images, so even if what I want to do is not best for images, could you still explain questions 1 and 2?
Thanks!
EDIT
I wanted to refine my question and ask, in the following khronos documentation they define...
int4 read_imagei (
image2d_t image,
sampler_t sampler,
int2 coord)
But nowhere can I find what image2d_t's definition or structure is supposed to be. The samething for sampler_t and int2 coord. They seem like structs to me or pointers to structs since opencl is supposed to be based on ansi c, but what are the fields of these structs or how do I note the coord with what looks like a scala?! I've seen the notation (int2)(x,y), but that's not ansi c, that looks like scala, haha. Things seem conflicting to me. Thanks again!
In general you can read from images in three different ways:
direct pixel access, no sampling
sampling, normalized coordinates
sampling, integer coordinates
The first one is what you want, that is, you pass integer pixel coordinates like (10, 43) and it will return the contents of the image at that point, with no filtering whatsoever, as if it were a memory buffer. You can use the read_image*() family of functions which take no sampler_t param.
The second one is what most people want from images, you specify normalized image coords between 0 and 1, and the return value is the interpolated image color at the specified point (so if your coordinates specify a point in between pixels, the color is interpolated based on surrounding pixel colors). The interpolation, and the way out-of-bounds coordinates are handled, are defined by the configuration of the sampler_t parameter you pass to the function.
The third one is the same as the second one, except the texture coordinates are not normalized, and the sampler needs to be configured accordingly. In some sense the third way is closer to the first, and the only additional feature it provides is the ability to handle out-of-bounds pixel coordinates (for instance, by wrapping or clamping them) instead of you doing it manually.
Finally, the different versions of each function, e.g. read_imagef, read_imagei, read_imageui are to be used depending on the pixel format of your image. If it contains floats (in each channel), use read_imagef, if it contains signed integers (in each channel), use read_imagei, etc...
Writing to an image on the other hand is straightforward, there are write_image{f,i,ui}() functions that take an image object, integer pixel coordinates and a pixel color, all very easy.
Note that you cannot read and write to the same image in the same kernel! (I don't know if recent OpenCL versions have changed that). In general I would recommend using a buffer if you are not going to be using images as actual images (i.e. input textures that you sample or output textures that you write to only once at the end of your kernel).
About the image2d_t, sampler_t types, they are OpenCL "pseudo-objects" that you can pass into a kernel from C (they are reserved types). You send your image or your sampler from the C side into clSetKernelArg, and the kernel gets back a sampler_t or an image2d_t in the kernel's parameter list (just like you pass in a buffer object and it gets a pointer). The objects themselves cannot be meaningfully manipulated inside the kernel, they are just handles that you can send into the read_image/write_image functions, along with a few others.
As for the "actual" low-level difference between images and buffers, GPU's often have specially reserved texture memory that is highly optimized for "read often, write once" access patterns, with special texture sampling hardware and texture caches to optimize scatter reads, mipmaps, etc..
On the CPU there is probably no underlying difference between an image and a buffer, and your runtime likely implements both as memory arrays while enforcing image semantics.
In an attempt to improve performance of display of an object which is very large (and filling up GPU ram), after some reasonably light maths, I discovered I have an opertunity to compress my vertex data down from 16-byte vertices down to 4 byte vertices (since the data could be conceptually be thought of as a mearly a transformed height map - impliying x and y location from the vertex id), where I can tightly pack the Z coordinate into, say, 30 bits, leaving 2 bits for a colour pallet index. That's the idea anyway. My question isn't with the coordinate packing, it's with the colour packing.
The colour pallet will be chosen by the c++ code that loads the model. Since it also loads the shader, I'm currently trying to write the colour lookup code as a switch statement, ie:
int colourIndex = (compressedVertex & Mask) >> bitOffset;
switch (colourIndex)
{
case 0: return vec4(....);
case 1: return vec4(....);
case 2: return vec4(....);
case 3: return vec4(....);
}
Where the model has more colours then 4, I'm comfortable sacrificing bits of height precision in order to fit more bits of colour pallet in (up to a point anyway). My measurements shows that using a switch statement for binding a 4 colour pallet is no slower then binding a 4 pixel 1D texture and using a sampler to read from it.
I've scaled this up to 32 colours so far, and it seems at least as fast as using a texture.
When is a good line in the sand to stop using switch and start using a texture for a lookup table? If It helps the application I'm developing for has an already enforced minimum requirement of OpenGl 3.3. Once the data is on the card it'll never be changed. Can I crank it up to 256 case statements? 1024? 32768? Where's the limit?
(Pre-emptive response: Yes I could continue experimenting and pick a value that works for me on my single, modern card using trial and error and some interpolating; but I'm interested in a more general idea of what is best practice and whether anyone else has tried something similar and knows it to work out in the wild?)
I avoid branching as much as possible in shaders. My advice is to use a texture to do the lookup.
You ask:
Can I crank it up to 256 case statements? 1024? 32768? Where's the limit?
and you say:
I've scaled this up to 32 colours so far, and it seems at least as fast as using a texture.
OpenGL thrives at looking up textures. It's designed to do that. It's not designed for a gigantic switch case statement. And as the commenters say it won't perform well across the board. A 64x64 pixel texture can give you 4096 lookups and in the long run, in my opinion, it's going to be faster over a larger number of lookups.
I'd like to import obj models into my opengl program. I have a class / data format that I use to pass attribute data into shaders:
class CustomVertex : public IVtxFmt
{
public:
float m_Position[3]; // x, y, z offset 0, size = 3*sizeof(float)
float m_Normal[3]; // nx, ny, nz; offset 3
float m_TexCoords[2]; // u, v offset 6
float m_Colour[4]; // r, g, b, a offset 8
float m_Tangent[3]; // r, g, b offset 12
float m_Bitangent[3]; // r, g, b offset 15
};
So I'm working with a model of a log cabin I downloaded from the Internet.
The log cabin has several vertices, normals, and texture coord definitions, followed by a list of face definitions.
So my first instinct was to parse the obj file and end up with
vector<vertex>
vector<Normal>
vector<TexCoord>
That's not straightforward to translate into my CustomVertex format, since there might be 210 vertices, 100 tex coords and 80 normals defined in the file.
After a list of ~390 faces in this format:
f 83/42/1 67/46/1 210/42/1
I encounter the following in the file:
#
# object tile00
#
followed by more vertex definitions.
So from this, I have inferred that a model might consist of several sub objects, each defined by a number of faces; each face defined by 3 x vertex / normal / texcoord index values.
So in order to arrive with a vector of CustomVertex, I'm thinking that I need to do the following:
create and populate:
vector <vertex>
vector <normal>
vector <texcoord>
vector <indices>
I need to create a CustomVertex for each unique v/vn/vt triple in the face definitions.
So I thought about creating a map:
std::vector<CustomVertex> and
std::map< nHashId, CustomVertex_index >
So my idea is that for each v/vn/vt I encounter, I create a hash of this string e.g. nHashId = hash("80/50/1")* and search the map for the hash. If none exists, I create a CustomVertex and add it to the vector, then I add the newly created hash and the CustomVertex_index into the map.
*: By creating a hash of the v/vn/vt string, I'm creating a unique numeric value that corresponds to that string, which I'm hoping is faster to search/compare in the map than the equivalent text.
If I come across a match to the hash, I consider that the customvertex already exists and instead of creating a new CustomVertex, I just add the CustomVertex_index entry to the indices vector and move on.
Since this seems like a computationally expensive exercise, I guess I'll be dumping my CustomVertex arrays (and corresponding indices arrays) to disk for later retrieval, rather than parse the obj file every time.
Before I ask my questions, may I point out that due to time constraints and not wanting to have to redesign my Vbo class (a non-trivial task), I'm stuck with the CustomVertex format - I know its possible to supply attributes in separate arrays to my shaders, but I had read that interleaving the data like I have with CustomVertex can enhance performance.
So to my questions:
1. Does my method seem sound or crazy? If crazy, please point out where I'm going wrong.
Can you spot any potential issues?
Has anyone done this before and can recommend a simpler way to achieve what I'm trying to?
Can you spot any potential issues?
You mean besides hash collisions? Because I don't see the part of your algorithm that handles that.
Has anyone done this before and can recommend a simpler way to achieve what I'm trying to?
There's a much simpler way: just compare the indices and not use hashes.
Instead of creating a string hash of "v/vn/vt", the idea is to only hash v as an integer. After that you get a bucket that contains all the "v/vn/vt" combinations that share the same v index.
If a hash collision happens(same v encountered), you would compare the collided combination with those in the bucket to see if it is really duplicated. If not, remember to add the collided combination to the bucket.
While trying to a parse a wavefront .obj file, I thought of two approaches:
Create an 2D array the size of the number of vertices. When a face uses a vertex, get it's coordinates from the array.
Get the starting position of the vertex list and then when a face uses a vertex, scan the lines until you reach the vertex.
IMO, option 1 will be very memory intensive, but much faster.
Since option 2 involves extensive file reading, (and because the number of vertices in most objects becomes very large) this will be much slower, but less memmory intensive.
The question is: Comparing the tradeoff between memory and speed, which option would be better suited to an average computer?
And, is there an alternative method?
I plan to use OpenGL along with GLFW to render the object.
IMO, Option 1 will be very memory intensive, but much faster.
You must get those vertices into memory anyway. But there's no need for a 2D array, which BTW would cause two pointer indirections, thus a major performance hit. Just use a simple std::vector<Vertex> for your data, the vector index is the index for the accompanying face list.
EDIT due to comment
class Vertex
{
union { struct { float x, y, z }; float pos[3] };
union { struct { float nx, ny, nz }; float normal[3] };
union { struct { float s, t }; float pos[2] };
Vertex &operator=();
}
std::vector<Vertex>;
Generally you read the list of vertices into an array. Parsing ASCII text is extremely slow; do it only once when loading the file and then store everything in arrays in memory.
Same goes with the triangles / faces. Each triangle generally is composed of a list of three vertex indexes. That should also be stored in an array.
You may find the OBJ reader in the VTK open source library to be useful: http://www.vtk.org/doc/nightly/html/classvtkOBJReader.html. We use it and have had no reason to write our own... Use VTK directly, or you may find studying the source code to be good for further inspiration of your own reader.
In my opinion, one of the major shortcomings with OBJ files is the use of ASCII. 3D ASCII files (be it STL, PLY, OBJ, etc.) are very slow to load if they are ASCII due to the string parsing. Binary format files are much faster and should always be used if performance is an issue: the load time for a good binary format is instantaneous.
Just load them into arrays. Memory should not be an issue. Your system (usually) has way more memory than your GPU. If you are running into memory problems, you are probably loading a model that is too detailed. (I am semi-assuming that you are going to make a game in OpenGL. If you have a specific need for such large model files, you will still have to work out a way to load the appropriate chunks.)
You shouldn't need a 2 dimensional array. Your models should be triangulated and then you can simply load the obj file using gluts obj loader. Simply store points, faces and normals in 3 seperate arrays/buffers. There is an example how you can do it here, but if you want to do it fast you should go for a binary format.
This is a pretty decent solution for prototyping, running a script that generates the arrays for use in OpenGL or your preferred rendering API. obj2opengl.pl is a perl script, you'll need perl installed that you can find here. GitHub link is here.
While running the perl script you may get a runtime error on line 154 concerning if(defined(#center)). Replace it with if(#center).
From the example, once the header file is generated with the data, you can use the it as shown:
/*
created with obj2opengl.pl
source file : ./banana.obj
vertices : 4032
faces : 8056
normals : 4032
texture coords : 4420
// include generated arrays
#import "./banana.h"
// set input data to arrays
glVertexPointer(3, GL_FLOAT, 0, bananaVerts);
glNormalPointer(GL_FLOAT, 0, bananaNormals);
glTexCoordPointer(2, GL_FLOAT, 0, bananaTexCoords);
// draw data
glDrawArrays(GL_TRIANGLES, 0, bananaNumVerts);
*/
So what I need is simple: each time we perform our shader (meaning on each pixel) I need to calculate random matrix of 1s and 0s with resolution == originalImageResolution. How to do such thing?
As for now I have created one for shadertoy random matrix resolution is set to 15 by 15 here because gpu makes chrome fall often when I try stuff like 200 by 200 while really I need full image resolution size
#ifdef GL_ES
precision highp float;
#endif
uniform vec2 resolution;
uniform float time;
uniform sampler2D tex0;
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * (43758.5453+ time));
}
vec3 getOne(){
vec2 p = gl_FragCoord.xy / resolution.xy;
vec3 one;
for(int i=0;i<15;i++){
for(int j=0;j<15;j++){
if(rand(p)<=0.5)
one = (one.xyz + texture2D(tex0,vec2(j,i)).xyz)/2.0;
}
}
return one;
}
void main(void)
{
gl_FragColor = vec4(getOne(),1.0);
}
And one for Adobe pixel bender:
<languageVersion: 1.0;>
kernel random
< namespace : "Random";
vendor : "Kabumbus";
version : 3;
description : "not as random as needed, not as fast as needed"; >
{
input image4 src;
output float4 outputColor;
float rand(float2 co, float2 co2){
return fract(sin(dot(co.xy ,float2(12.9898,78.233))) * (43758.5453 + (co2.x + co2.y )));
}
float4 getOne(){
float4 one;
float2 r = outCoord();
for(int i=0;i<200;i++){
for(int j=0;j<200;j++){
if(rand(r, float2(i,j))>=1.0)
one = (one + sampleLinear(src,float2(j,i)))/2.0;
}
}
return one;
}
void
evaluatePixel()
{
float4 oc = getOne();
outputColor = oc;
}
}
So my real problem is - my shaders make my GPU deiver fall. How to use GLSL for same purpose that I do now but with out failing and if possible faster?
Update:
What I want to create is called Single-Pixel Camera (google Compressive Imaging or Compressive Sensing), I want to create gpu based software implementation.
Idea is simple:
we have an image - NxM.
for each pixel in image we want GPU to performe the next operations:
to generate NxMmatrix of random values - 0s and 1s.
compute arithmetic mean of all pixels on original image whose coordinates correspond to coordinates of 1s in our random NxM matrix
output result of arithmetic mean as pixel color.
What I tried to implement in my shaders was simulate that wary process.
What is really stupid in trying to do this on gpu:
Compressive Sensing does not tall us to compute NxM matrix of such arithmetic mean values, it meeds just a peace of it (for example 1/3). So I put some pressure I do not need to on GPU. However testing on more data is not always a bad idea.
Thanks for adding more detail to clarify your question. My comments are getting too long so I'm going to an answer. Moving comments into here to keep them together:
Sorry to be slow, but I am trying to understand the problem and the goal. In your GLSL sample, I don't see a matrix being generated. I see a single vec3 being generated by summing a random selection (varying over time) of cells from a 15 x 15 texture (matrix). And that vec3 is recomputed for each pixel. Then the vec3 is used as the pixel color.
So I'm not clear whether you really want to create a matrix, or just want to compute a value for every pixel. The latter is in some sense a 'matrix', but computing a simple random value for 200 x 200 pixels would not strain your graphics driver. Also you said you wanted to use the matrix. So I don't think that's what you mean.
I'm trying to understand why you want a matrix - to preserve a consistent random basis for all the pixels? If so, you can either precompute a random texture, or use a consistent pseudorandom function like you have in rand() except not use time. You clearly know about that so I guess I still don't understand the goal. Why are you summing a random selection of cells from the texture, for each pixel?
I believe the reason your shader is crashing is that your main() function is exceeding its time limit - either for a single pixel, or for the whole set of pixels. Calling rand() 40,000 times per pixel (in a 200 * 200 nested loop) could certainly explain that!
If you had 200 x 200 pixels, and are calling sin() 40k times for each one, that's 160,000,000 calls per frame. Poor GPU!
I'm hopeful that if we understand the goal better, we'll be able to recommend a more efficient way to get the effect you want.
Update.
(Deleted this part, since it was mistaken. Even though many cells in the source matrix may each contribute less than a visually detectable amount of color to the result, the total of the many cells can contribute a visually detectable amount of color.)
New update based on updated question.
OK, (thinking "out loud" here so you can check whether I'm understanding correctly...) Since you need each of the random NxM values only once, there is no actual requirement to store them in a matrix; the values can simply be computed on demand and then thrown away. That's why your example code above does not actually generate a matrix.
This means we cannot get away from generating (NxM)^2 random values per frame, that is, NxM random values per pixel, and there are NxM pixels. So for N=M=200, that's 160 million random values per frame.
However, we can still optimize some things.
First, since your random values only need to be one bit each (you only need a boolean answer to decide whether to include each cell from the source texture into the mix), you can probably use a cheaper pseudo random number generator. The one you're using outputs much more random data per call than one bit. For example, you could call the same PRNG function as you're using now, but store the value and extract 32 random bits out of it. Or at least several, depending on how many are random enough. In addition, instead of using a sin() function, if you have extension GL_EXT_gpu_shader4 (for bitwise operators), you could use something like this:
.
int LFSR_Rand_Gen(in int n)
{
// <<, ^ and & require GL_EXT_gpu_shader4.
n = (n << 13) ^ n;
return (n * (n*n*15731+789221) + 1376312589) & 0x7fffffff;
}
Second, you are currently performing one divide operation per included cell (/2.0), which is probably relatively expensive, unless the compiler and GPU are able to optimize it into a bit shift (is that possible for floating point?). This also will not give the arithmetic mean of the input values, as discussed above... it will put much more weight on the later values and very little on the earlier ones. As a solution, keep a count of how many values are being included, and divide by that count once, after the loop is finished.
Whether these optimizations will be enough to enable your GPU driver to drive 200x200 * 200x200 pixels per frame, I don't know. They should definitely enable you to increase your resolution substantially.
Those are the ideas that occur to me off the top of my head. I am far from being a GPU expert though. It would be great if someone more qualified can chime in with suggestions.
P.S. In your comment, you jokingly (?) mentioned the option of precomputing N*M NxM random matrices. Maybe that's not a bad idea?? 40,000x40,000 is a big texture (40MB at least), but if you store 32 bits of random data per cell, that comes down to 1250 x 40,000 cells. Too bad vanilla GLSL doesn't help you with bitwise operators to extract the data, but even if you don't have the GL_EXT_gpu_shader4 extension you can still fake it. (Maybe you would also need a special extension then for non-square textures?)