I've been having a weird bug on an OpenGL/GLES application I'm developing. On a certain device (Samsung Galaxy S8), it seems glVertexAttribPointer results in jibberish if the stride parameter is set to 18 bytes. The jibberish disappears if I add two bytes of padding to each component (20 bytes in total).
Note that no glGetError is triggered regardless.
This bug does not occur on any other mobile device I've tested on, neither does it occur on my Windows computer running regular OpenGL.
My guess is that the stride is required to be a multiple four bytes, but I cannot seem to find any documentation verifying this.
Does anyone know if there are device specific requirements for the stride parameter?
(The 18 bytes consists of three float32 followed by three int16_t = 3*4bytes + 3*2 bytes)
Is the stride required by the specification to be aligned to 4 bytes? No.
Is there hardware that effectively has that requirement anyway? Yes, as evidenced by the fact that Vulkan has this requirement. So you should avoid misaligned data.
Related
I'm trying to upload RGB (no alpha channel) pixel data to the GPU using:
GLInternalFormat = GL_RGB8;
GLFormat = GL_RGB;
GLType = GL_UNSIGNED_BYTE;
Here's how my data is structured u_int8* pixels;
RGBRGBRGB -> row 1
RGBRGBRGB -> row 2
RGBRGBRGB -> row 3
The texture width is not a multiple of 4 for this particular texture.
I was able to make it work using glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
I understand now how GL_UNPACK_ALIGNMENT works. My question now is, why is not 1 the default in OpenGL (source: https://docs.gl/gl3/glPixelStore)
pname Initial value Value range
GL_UNPACK_ALIGNMENT integer 4 1, 2, 4, or 8
Is it safe to set it to 1? Otherwise, it'll break on textures that are not divisible by 4
why is the default value 4? Seems arbitrary...
A texture or other image is a sizable chunk of memory that you want copied from CPU memory to GPU or vice versa. (Back in the early 1990s when OpenGL first appeared they were fixed function graphics accelerators, not truly general purpose.) Just like copying a block of memory to/from your disk drive or network card, the copy should run as fast as possible.
In the early 1990s there were many more types of CPU around than today. Most of the ones that might be running OpenGL were 32 bit, and 64 bit RISCs were appearing.
Many of the RISC systems, including those sold by SGI, could only read and write 32 bit/4 byte values that were aligned on a 4 byte boundary. The M680x0 family which were also popular for workstations needed 2 byte alignment. Only the Intel x86 could read 32 bits from any boundary, and they still ran faster if 32 bit values were 4 byte aligned.
(On a system with a DMA controller that could run in parallel with the CPU, it would most likely have the same alignment and performance requirements as the CPU itself.)
So defaulting to 4 byte alignment gave the best performance on the most systems. Specifying alignment 1 would have to drop down to reading/writing a byte at a time. Not necessarily 4x slower but slower on most.
These early 1990s systems also had enough RAM and disk space that going from 24 bits per pixel to 32 bits wasn't too bad. You can still find image file/memory format definitions that have 8 unused bits just to get 4 byte alignment with 24 bit RGB.
I use memcpy() to write data to a device, with a logic analyzer/PCIe analyzer, I can see the actual stores.
My device gets more stores than expected.
For example,
auto *data = new uint8_t[1024]();
for (int i=0; i<50; i++){
memcpy((void *)(addr), data, i);
}
For i=9, I see these stores:
4B from byte 0 to 3
4B from byte 4 to 7
3B from byte 5 to 7
1B-aligned only, re-writing the same data -> inefficient and useless store
1B the byte 8
In the end, all the 9 Bytes are written but memcpy creates an extra store of 3B re-writing what it has already written and nothing more.
Is it the expected behavior? The question is for C and C++, I'm interested in knowing why this happens, it seems very inefficient.
Is it the expected behavior?
The expected behavior is that it can do anything it feels like (including writing past the end, especially in a "read 8 bytes into a register, modify the first byte in the register, then write 8 bytes" way) as long as the result works as if the rules for the C abstract machine were followed.
Using a logic analyzer/PCIe analyzer to see the actual stores is so far beyond the scope of "works as if the rules for the abstraction machine were followed" that it's unreasonable to have any expectations.
Specifically; you can't assume the writes will happen in any specific order, can't assume anything about the size of any individual write, can't assume writes won't overlap, can't assume there won't be writes past the end of the area, can't assume writes will actually occur at all (without volatile), and can't even assume that CHAR_BIT isn't larger than 8 (or that memcpy(dest, source, 10); isn't asking to write 20 octets/"8 bit bytes").
If you need guarantees about writes, then you need to enforce those guarantees yourself (e.g. maybe create a structure of volatile fields to force the compiler to ensure writes happen in a specific order, maybe use inline assembly with explicit fences/barriers, etc).
The following illustrates why memcpy may be implemented this way.
To copy 9 bytes, starting at a 4-byte aligned address, memcpy issues these instructions (described as pseudo code):
Load four bytes from source+0 and store four bytes to destination+0.
Load four bytes from source+4 and store four bytes to destination+4.
Load four bytes from source+5 and store four bytes to destination+5.
The processor implements the store instructions with these data transfer in the hardware:
Since destination+0 is aligned, store 4 bytes to destination+0.
Since destination+4 is aligned, store 4 bytes to destination+4.
Since destination+5 is not aligned, store 3 bytes to destination+3 and store 1 byte to destination+8.
This is an easy and efficient way to write memcpy:
If length is less than four bytes, jump to separate code for that.
Loop copying four bytes until fewer than four bytes are left.
if length is not a multiple of four, copy four bytes from source+length−4 to destination+length−4.
That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.
In C++,
Why is a boolean 1 byte and not 1 bit of size?
Why aren't there types like a 4-bit or 2-bit integers?
I'm missing out the above things when writing an emulator for a CPU
Because the CPU can't address anything smaller than a byte.
From Wikipedia:
Historically, a byte was the number of
bits used to encode a single character
of text in a computer and it is
for this reason the basic addressable
element in many computer
architectures.
So byte is the basic addressable unit, below which computer architecture cannot address. And since there doesn't (probably) exist computers which support 4-bit byte, you don't have 4-bit bool etc.
However, if you can design such an architecture which can address 4-bit as basic addressable unit, then you will have bool of size 4-bit then, on that computer only!
Back in the old days when I had to walk to school in a raging blizzard, uphill both ways, and lunch was whatever animal we could track down in the woods behind the school and kill with our bare hands, computers had much less memory available than today. The first computer I ever used had 6K of RAM. Not 6 megabytes, not 6 gigabytes, 6 kilobytes. In that environment, it made a lot of sense to pack as many booleans into an int as you could, and so we would regularly use operations to take them out and put them in.
Today, when people will mock you for having only 1 GB of RAM, and the only place you could find a hard drive with less than 200 GB is at an antique shop, it's just not worth the trouble to pack bits.
The easiest answer is; it's because the CPU addresses memory in bytes and not in bits, and bitwise operations are very slow.
However it's possible to use bit-size allocation in C++. There's std::vector specialization for bit vectors, and also structs taking bit sized entries.
Because a byte is the smallest addressible unit in the language.
But you can make bool take 1 bit for example if you have a bunch of them
eg. in a struct, like this:
struct A
{
bool a:1, b:1, c:1, d:1, e:1;
};
You could have 1-bit bools and 4 and 2-bit ints. But that would make for a weird instruction set for no performance gain because it's an unnatural way to look at the architecture. It actually makes sense to "waste" a better part of a byte rather than trying to reclaim that unused data.
The only app that bothers to pack several bools into a single byte, in my experience, is Sql Server.
You can use bit fields to get integers of sub size.
struct X
{
int val:4; // 4 bit int.
};
Though it is usually used to map structures to exact hardware expected bit patterns:
// 1 byte value (on a system where 8 bits is a byte)
struct SomThing
{
int p1:4; // 4 bit field
int p2:3; // 3 bit field
int p3:1; // 1 bit
};
bool can be one byte -- the smallest addressable size of CPU, or can be bigger. It's not unusual to have bool to be the size of int for performance purposes. If for specific purposes (say hardware simulation) you need a type with N bits, you can find a library for that (e.g. GBL library has BitSet<N> class). If you are concerned with size of bool (you probably have a big container,) then you can pack bits yourself, or use std::vector<bool> that will do it for you (be careful with the latter, as it doesn't satisfy container requirments).
Think about how you would implement this at your emulator level...
bool a[10] = {false};
bool &rbool = a[3];
bool *pbool = a + 3;
assert(pbool == &rbool);
rbool = true;
assert(*pbool);
*pbool = false;
assert(!rbool);
Because in general, CPU allocates memory with 1 byte as the basic unit, although some CPU like MIPS use a 4-byte word.
However vector deals bool in a special fashion, with vector<bool> one bit for each bool is allocated.
The byte is the smaller unit of digital data storage of a computer. In a computer the RAM has millions of bytes and anyone of them has an address. If it would have an address for every bit a computer could manage 8 time less RAM that what it can.
More info: Wikipedia
Even when the minimum size possible is 1 Byte, you can have 8 bits of boolean information on 1 Byte:
http://en.wikipedia.org/wiki/Bit_array
Julia language has BitArray for example, and I read about C++ implementations.
Bitwise operations are not 'slow'.
And/Or operations tend to be fast.
The problem is alignment and the simple problem of solving it.
CPUs as the answers partially-answered correctly are generally aligned to read bytes and RAM/memory is designed in the same way.
So data compression to use less memory space would have to be explicitly ordered.
As one answer suggested, you could order a specific number of bits per value in a struct. However what does the CPU/memory do afterward if it's not aligned? That would result in unaligned memory where instead of just +1 or +2, or +4, there's not +1.5 if you wanted to use half the size in bits in one value, etc. so it must anyway fill in or revert the remaining space as blank, then simply read the next aligned space, which are aligned by 1 at minimum and usually by default aligned by 4(32bit) or 8(64bit) overall. The CPU will generally then grab the byte value or the int value that contains your flags and then you check or set the needed ones. So you must still define memory as int, short, byte, or the proper sizes, but then when accessing and setting the value you can explicitly compress the data and store those flags in that value to save space; but many people are unaware of how it works, or skip the step whenever they have on/off values or flag present values, even though saving space in sent/recv memory is quite useful in mobile and other constrained enviornments. In the case of splitting an int into bytes it has little value, as you can just define the bytes individually (e.g. int 4Bytes; vs byte Byte1;byte Byte2; byte Byte3; byte Byte4;) in that case it is redundant to use int; however in virtual environments that are easier like Java, they might define most types as int (numbers, boolean, etc.) so thus in that case, you could take advantage of an int dividing it up and using bytes/bits for an ultra efficient app that has to send less integers of data (aligned by 4). As it could be said redundant to manage bits, however, it is one of many optimizations where bitwise operations are superior but not always needed; many times people take advantage of high memory constraints by just storing booleans as integers and wasting 'many magnitudes' 500%-1000% or so of memory space anyway. It still easily has its uses, if you use this among other optimizations, then on the go and other data streams that only have bytes or few kb of data flowing in, it makes the difference if overall you optimized everything to load on whether or not it will load,or load fast, at all in such cases, so reducing bytes sent could ultimately benefit you alot; even if you could get away with oversending tons of data not required to be sent in an every day internet connection or app. It is definitely something you should do when designing an app for mobile users and even something big time corporation apps fail at nowadays; using too much space and loading constraints that could be half or lower. The difference between not doing anything and piling on unknown packages/plugins that require at minumim many hundred KB or 1MB before it loads, vs one designed for speed that requires say 1KB or only fewKB, is going to make it load and act faster, as you will experience those users and people who have data constraints even if for you loading wasteful MB or thousand KB of unneeded data is fast.
Imagine we have 2 machines named: Alice and Bob. Alice supports operations with 64 bit unsigned integers while Bob operates with 32bit unsigned integers only.
Bob sends Alice request to create a task. For each task Alice assigns unique ID that is random but unique 64bit unsigned integer. Bob can create up to 2^32 tasks.
I need to add an ability for Bob to be able to delete tasks by ID. Therefore I need to set up a proxy that will substitute 64bit units with 32bit uints when message goes from Alice to Bob and restore 64bit uint from 32bit uint when message goes in opposite direction.
The problem is that I need to make conversion very efficient, I only have ~10MB of RAM to do this.
Is there any container that already solves that issue?
Update
The community asked for a clarification and the only way to clarify it is to describe real-world situation.
So, I'm working on OpenGL translator libraries that are part of the AOSP. In summary it allows to move rendering of Android system (e.g. running inside the VM) to the host system for acceleration reasons.
It's done by streaming all OpenGL commands (forth and back) from Target (Android) to Host (i.e. Win8 64bit).
OpenGL objects are represented as handles which are of type GLuint or unsigned int. Therefore size of the object and allowed values depend on whether system is 32bit or 64bit.
Since most Android systems are 32bit and most host systems are 64bit, the problem arise: in request of creating OpenGL object from Android, Host can create handle with value that cannot be represented as 32bit value. However, Android cannot ask for more than 2^32 - 1 objects for obvious reasons.
The only solution that came to my mind is to set up the proxy that will map 64bit handles to 32bit and vice versa.
The concrete piece of code that creates problem: https://android.googlesource.com/platform/sdk/+/master/emulator/opengl/host/libs/Translator/include/GLcommon/GLutils.h line 47.
Update 2
After exploring the problem a little bit further I've found that it's not an issue of GLuint (as noted by #KillianDS). However it's still issue of OpenGL.
There are functions that return pointers, not GLuint handles. E.g. eglCreateContext.
I need to find a way to exchange pointers between 64bit Host and 32bit Target.
Update 3
Finally I figured out that this concrete crash is not related to transition of handles between 32bit and 64bit machines. It is a bug in the Target part of translator that calls wrong function (glVertexAttribPointerData) with wrong argument.
According to table 2.2 in the latest core OpenGL spec a uint in OpenGL should always be 32 bits in width (the spec is about the same for ES). All OpenGL names/handles are as far as I know (and you also say in your question) uint's. So, it should be 32 bit on both your host and target.
Note that it is exactly because the actual bit width of unsigned int might differ between platforms that OpenGL has its own types that should conform to the spec.
Update
If the remaining handles are really only contexts and other window-system calls I'd keep it simple because we're not talking frequent operations, nor a huge amount of handles. These kind of operations are usually not done more than once per OpenGL application per GPU, which is likely 1 on any mobile phone. I think the easiest solution of all would be to use an array. Pseudocode
class context_creator
{
std::array<EGLContext, 1000> context_map; //8KB
public:
context_creator() : context_map{} {}
uint32_t allocate(...) {
for(unsigned i = 0; i < context_map.size(); i++) {
if(!context_map[i]) {
context_map[i] = eglCreateContext(...);
return i;
}
}
}
void deallocate(uint32_t handle) {
eglDeleteContext(context_map[handle]);
context_map[handle] = 0;
}
//Has to be called in every function where a context is a parameter.
EGLContext translate(uint32_t handle) const {
return context_map[handle];
}
}
One note, this won't work if 0 is a valid name for a context. I really don't know for WGL, but probably it's not. The advantage of this is that while the allocate isn't the fastest algorithm ever, the translate is O(1) and that's what most likely will be called most often.
Of course, variations exists:
You can use a more dynamic container (e.g. vector) instead of fixed size.
You can use a hashtable (like std::map) and just generate a unique index per call. This consumes more memory as you have to store the index also (it's imlicit in an array), but it solves the problem if 0 is a valid context name.
The uint in OpenGL should be 4 bytes i.e. 32 bits always in width so handles/names should be 32 bit on both your target and host
To begin there are 8 types of Buffer Objects in OpenGL:
GL_ARRAY_BUFFER
GL_ELEMENT_ARRAY_BUFFER
GL_COPY_READ_BUFFER
...
They are enums, or more specifically GLenum's. Where GLenum is a unsigned 32 bit integer that has values up to ~ 4,743,222,432 so to say.
Most of the uses of buffer objects involve binding them to a certain target like this: e.g.
glBindBuffer (GL_ARRAY_BUFFER, Buffers [size]);
[void glBindBuffer (GLenum target, GLuint buffer)] documentation
My question is - is that if its an enum its only value must be 0,1,2,3,4..7 respectively so why go all the way and make it a 32 bit integer if it has only values up to 7? Pardon my knowledge of CS and OpenGL, it just seems unethical.
Enums aren't used just for the buffers - but everywhere a symbolic constant is needed. Currently, several thousand enum values are assigned (look into your GL.h and the latest glext.h. Note that vendors get allocated their official enum ranges so they can implement vendor-specific extensions wihtout interfering with others - so a 32Bit enum space is not a bad idea. Furthermore, on modern CPU architechtures, using less than 32Bit won't be any more efficient, so this is not a problem performance-wise.
UPDATE:
As Andon M. Coleman pointed out, currently only 16Bit enumerant ranges are beeing allocated. It might be useful to link at the OpenGL Enumerant Allocation Policies, which also has the following remark:
Historically, enumerant values for some single-vendor extensions were allocated in blocks of 1000, beginning with the block [102000,102999] and progressing upward. Values in this range cannot be represented as 16-bit unsigned integers. This imposes a significant and unnecessary performance penalty on some implementations. Such blocks that have already been allocated to vendors will remain allocated unless and until the vendor voluntarily releases the entire block, but no further blocks in this range will be allocated.
Most of these seem to have been removed in favor of 16 Bit values, but 32 Bit values have been in use. In the current glext.h, one still can find some (obsolete) enumerants above 0xffff, like
#ifndef GL_PGI_misc_hints
#define GL_PGI_misc_hints 1
#define GL_PREFER_DOUBLEBUFFER_HINT_PGI 0x1A1F8
#define GL_CONSERVE_MEMORY_HINT_PGI 0x1A1FD
#define GL_RECLAIM_MEMORY_HINT_PGI 0x1A1FE
...
Why would you use a short anyway? What situation would you ever be in that you would even save more than 8k ram (if the reports of near a thousand GLenums is correct) by using a short or uint8_t istead of GLuint for enums and const declarations? Considering the trouble of potential hardware incompatibilities and potential cross platform bugs you would introduce, it's kind of odd to try to save something like 8k ram even in the context of the original 2mb Voodoo3d graphics hardware, much less SGL super-computer-farms OpenGL was created for.
Besides, modern x86 and GPU hardware aligns on 32 or 64 bits at a time, you would actually stall the operation of the CPU/GPU as 24 or 56 bits of the register would have to be zeroed out and THEN read/written to, whereas it could operate on the standard int as soon as it was copied in. From the start of OpenGL compute resources have tended to be more valuable than memory while you might do billions of state changes during a program's life you'd be saving about 10kb (kilobytes) of ram max if you replaced every 32 bit GLuint enum with a uint8_t one. I'm trying so hard not to be extra-cynical right now, heh.
For example, one valid reason for things like uint18_t and the like is for large data buffers/algorithms where data fits in that bit-depth. 1024 ints vs 1024 uint8_t variables on the stack is 8k, are we going to split hairs over 8k? Now consider a 4k raw bitmap image of 4000*2500*32 bits, we're talking a few hundred megs and it would be 8 times the size if we used 64 bit RGBA buffers in the place of standard 8 bit RGBA8 buffers, or quadruple in size if we used 32 bit RGBA encoding. Multiply that by the number of textures open or pictures saved and swapping a bit of cpu operations for all that extra memory makes sense, especially in the context of that type of work.
That is where using a non standard integer type makes sense. Unless you're on a 64k machine or something (like an old-school beeper, good luck running OpenGL on that) system if you're trying to save a few bits of memory on something like a const declaration or reference counter you're just wasting everyone's time.