Why is GL_UNPACK_ALIGNMENT default 4? - opengl

I'm trying to upload RGB (no alpha channel) pixel data to the GPU using:
GLInternalFormat = GL_RGB8;
GLFormat = GL_RGB;
GLType = GL_UNSIGNED_BYTE;
Here's how my data is structured u_int8* pixels;
RGBRGBRGB -> row 1
RGBRGBRGB -> row 2
RGBRGBRGB -> row 3
The texture width is not a multiple of 4 for this particular texture.
I was able to make it work using glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
I understand now how GL_UNPACK_ALIGNMENT works. My question now is, why is not 1 the default in OpenGL (source: https://docs.gl/gl3/glPixelStore)
pname Initial value Value range
GL_UNPACK_ALIGNMENT integer 4 1, 2, 4, or 8
Is it safe to set it to 1? Otherwise, it'll break on textures that are not divisible by 4
why is the default value 4? Seems arbitrary...

A texture or other image is a sizable chunk of memory that you want copied from CPU memory to GPU or vice versa. (Back in the early 1990s when OpenGL first appeared they were fixed function graphics accelerators, not truly general purpose.) Just like copying a block of memory to/from your disk drive or network card, the copy should run as fast as possible.
In the early 1990s there were many more types of CPU around than today. Most of the ones that might be running OpenGL were 32 bit, and 64 bit RISCs were appearing.
Many of the RISC systems, including those sold by SGI, could only read and write 32 bit/4 byte values that were aligned on a 4 byte boundary. The M680x0 family which were also popular for workstations needed 2 byte alignment. Only the Intel x86 could read 32 bits from any boundary, and they still ran faster if 32 bit values were 4 byte aligned.
(On a system with a DMA controller that could run in parallel with the CPU, it would most likely have the same alignment and performance requirements as the CPU itself.)
So defaulting to 4 byte alignment gave the best performance on the most systems. Specifying alignment 1 would have to drop down to reading/writing a byte at a time. Not necessarily 4x slower but slower on most.
These early 1990s systems also had enough RAM and disk space that going from 24 bits per pixel to 32 bits wasn't too bad. You can still find image file/memory format definitions that have 8 unused bits just to get 4 byte alignment with 24 bit RGB.

Related

Is it normal memcpy overwrites data it just wrote?

I use memcpy() to write data to a device, with a logic analyzer/PCIe analyzer, I can see the actual stores.
My device gets more stores than expected.
For example,
auto *data = new uint8_t[1024]();
for (int i=0; i<50; i++){
memcpy((void *)(addr), data, i);
}
For i=9, I see these stores:
4B from byte 0 to 3
4B from byte 4 to 7
3B from byte 5 to 7
1B-aligned only, re-writing the same data -> inefficient and useless store
1B the byte 8
In the end, all the 9 Bytes are written but memcpy creates an extra store of 3B re-writing what it has already written and nothing more.
Is it the expected behavior? The question is for C and C++, I'm interested in knowing why this happens, it seems very inefficient.
Is it the expected behavior?
The expected behavior is that it can do anything it feels like (including writing past the end, especially in a "read 8 bytes into a register, modify the first byte in the register, then write 8 bytes" way) as long as the result works as if the rules for the C abstract machine were followed.
Using a logic analyzer/PCIe analyzer to see the actual stores is so far beyond the scope of "works as if the rules for the abstraction machine were followed" that it's unreasonable to have any expectations.
Specifically; you can't assume the writes will happen in any specific order, can't assume anything about the size of any individual write, can't assume writes won't overlap, can't assume there won't be writes past the end of the area, can't assume writes will actually occur at all (without volatile), and can't even assume that CHAR_BIT isn't larger than 8 (or that memcpy(dest, source, 10); isn't asking to write 20 octets/"8 bit bytes").
If you need guarantees about writes, then you need to enforce those guarantees yourself (e.g. maybe create a structure of volatile fields to force the compiler to ensure writes happen in a specific order, maybe use inline assembly with explicit fences/barriers, etc).
The following illustrates why memcpy may be implemented this way.
To copy 9 bytes, starting at a 4-byte aligned address, memcpy issues these instructions (described as pseudo code):
Load four bytes from source+0 and store four bytes to destination+0.
Load four bytes from source+4 and store four bytes to destination+4.
Load four bytes from source+5 and store four bytes to destination+5.
The processor implements the store instructions with these data transfer in the hardware:
Since destination+0 is aligned, store 4 bytes to destination+0.
Since destination+4 is aligned, store 4 bytes to destination+4.
Since destination+5 is not aligned, store 3 bytes to destination+3 and store 1 byte to destination+8.
This is an easy and efficient way to write memcpy:
If length is less than four bytes, jump to separate code for that.
Loop copying four bytes until fewer than four bytes are left.
if length is not a multiple of four, copy four bytes from source+length−4 to destination+length−4.
That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.

glVertexAttribPointer and stride parameter requirements

I've been having a weird bug on an OpenGL/GLES application I'm developing. On a certain device (Samsung Galaxy S8), it seems glVertexAttribPointer results in jibberish if the stride parameter is set to 18 bytes. The jibberish disappears if I add two bytes of padding to each component (20 bytes in total).
Note that no glGetError is triggered regardless.
This bug does not occur on any other mobile device I've tested on, neither does it occur on my Windows computer running regular OpenGL.
My guess is that the stride is required to be a multiple four bytes, but I cannot seem to find any documentation verifying this.
Does anyone know if there are device specific requirements for the stride parameter?
(The 18 bytes consists of three float32 followed by three int16_t = 3*4bytes + 3*2 bytes)
Is the stride required by the specification to be aligned to 4 bytes? No.
Is there hardware that effectively has that requirement anyway? Yes, as evidenced by the fact that Vulkan has this requirement. So you should avoid misaligned data.

ARM V-8 with Scalable Vector Extension (SVE)

I come across this point that ARMv8 is now supporting variable length vector register from 128 bits to 2048 bits (scalable vector extension SVE).
It is always good to have bigger width of register to achieve the data level parallelism. But on what basis we need to select the size of register from 128 bits to 2048 bits for achieving maximum performance?
For example I want to do Sobel filtering with 3x3 mask on 1920 X 1080 Y image. What register width do I need to select?
The Scalable Vector Extension is a module for the aarch64 execution state that extends the A64 Instruction Set and is focused on High-Performance Computing and not on media, for that you have NEON.
The registers width will be decided by the Hardware designer/manufacturer depending on what that implementation is trying to solve/do. The possible vector length are: 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048.
From the programmers' point of view, the programming model is Vector Length Agnostic, meaning that the same application will work on implementations with different registers width (Vector lengths).
The specification is out, however, there is no hardware available with SVE implemented. For the time being, you can use the ARM Instruction Emulator (armie) to run your programs.
So answering your question, unless you are manufacturing hardware, you need not select any specific vector length, as that would vary from one implementation to another.
Now if you are testing using armie, then you can select whichever your want.
SVE essentially implicitly increments loop indices for you based on the hardware defined vector width, so you don't have to worry about it.
Check out the worked out Daxpy example at: https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf to understand what that means in more detail, and this minimal runnable example with an assertion.

Why booleans take a whole byte? [duplicate]

In C++,
Why is a boolean 1 byte and not 1 bit of size?
Why aren't there types like a 4-bit or 2-bit integers?
I'm missing out the above things when writing an emulator for a CPU
Because the CPU can't address anything smaller than a byte.
From Wikipedia:
Historically, a byte was the number of
bits used to encode a single character
of text in a computer and it is
for this reason the basic addressable
element in many computer
architectures.
So byte is the basic addressable unit, below which computer architecture cannot address. And since there doesn't (probably) exist computers which support 4-bit byte, you don't have 4-bit bool etc.
However, if you can design such an architecture which can address 4-bit as basic addressable unit, then you will have bool of size 4-bit then, on that computer only!
Back in the old days when I had to walk to school in a raging blizzard, uphill both ways, and lunch was whatever animal we could track down in the woods behind the school and kill with our bare hands, computers had much less memory available than today. The first computer I ever used had 6K of RAM. Not 6 megabytes, not 6 gigabytes, 6 kilobytes. In that environment, it made a lot of sense to pack as many booleans into an int as you could, and so we would regularly use operations to take them out and put them in.
Today, when people will mock you for having only 1 GB of RAM, and the only place you could find a hard drive with less than 200 GB is at an antique shop, it's just not worth the trouble to pack bits.
The easiest answer is; it's because the CPU addresses memory in bytes and not in bits, and bitwise operations are very slow.
However it's possible to use bit-size allocation in C++. There's std::vector specialization for bit vectors, and also structs taking bit sized entries.
Because a byte is the smallest addressible unit in the language.
But you can make bool take 1 bit for example if you have a bunch of them
eg. in a struct, like this:
struct A
{
bool a:1, b:1, c:1, d:1, e:1;
};
You could have 1-bit bools and 4 and 2-bit ints. But that would make for a weird instruction set for no performance gain because it's an unnatural way to look at the architecture. It actually makes sense to "waste" a better part of a byte rather than trying to reclaim that unused data.
The only app that bothers to pack several bools into a single byte, in my experience, is Sql Server.
You can use bit fields to get integers of sub size.
struct X
{
int val:4; // 4 bit int.
};
Though it is usually used to map structures to exact hardware expected bit patterns:
// 1 byte value (on a system where 8 bits is a byte)
struct SomThing
{
int p1:4; // 4 bit field
int p2:3; // 3 bit field
int p3:1; // 1 bit
};
bool can be one byte -- the smallest addressable size of CPU, or can be bigger. It's not unusual to have bool to be the size of int for performance purposes. If for specific purposes (say hardware simulation) you need a type with N bits, you can find a library for that (e.g. GBL library has BitSet<N> class). If you are concerned with size of bool (you probably have a big container,) then you can pack bits yourself, or use std::vector<bool> that will do it for you (be careful with the latter, as it doesn't satisfy container requirments).
Think about how you would implement this at your emulator level...
bool a[10] = {false};
bool &rbool = a[3];
bool *pbool = a + 3;
assert(pbool == &rbool);
rbool = true;
assert(*pbool);
*pbool = false;
assert(!rbool);
Because in general, CPU allocates memory with 1 byte as the basic unit, although some CPU like MIPS use a 4-byte word.
However vector deals bool in a special fashion, with vector<bool> one bit for each bool is allocated.
The byte is the smaller unit of digital data storage of a computer. In a computer the RAM has millions of bytes and anyone of them has an address. If it would have an address for every bit a computer could manage 8 time less RAM that what it can.
More info: Wikipedia
Even when the minimum size possible is 1 Byte, you can have 8 bits of boolean information on 1 Byte:
http://en.wikipedia.org/wiki/Bit_array
Julia language has BitArray for example, and I read about C++ implementations.
Bitwise operations are not 'slow'.
And/Or operations tend to be fast.
The problem is alignment and the simple problem of solving it.
CPUs as the answers partially-answered correctly are generally aligned to read bytes and RAM/memory is designed in the same way.
So data compression to use less memory space would have to be explicitly ordered.
As one answer suggested, you could order a specific number of bits per value in a struct. However what does the CPU/memory do afterward if it's not aligned? That would result in unaligned memory where instead of just +1 or +2, or +4, there's not +1.5 if you wanted to use half the size in bits in one value, etc. so it must anyway fill in or revert the remaining space as blank, then simply read the next aligned space, which are aligned by 1 at minimum and usually by default aligned by 4(32bit) or 8(64bit) overall. The CPU will generally then grab the byte value or the int value that contains your flags and then you check or set the needed ones. So you must still define memory as int, short, byte, or the proper sizes, but then when accessing and setting the value you can explicitly compress the data and store those flags in that value to save space; but many people are unaware of how it works, or skip the step whenever they have on/off values or flag present values, even though saving space in sent/recv memory is quite useful in mobile and other constrained enviornments. In the case of splitting an int into bytes it has little value, as you can just define the bytes individually (e.g. int 4Bytes; vs byte Byte1;byte Byte2; byte Byte3; byte Byte4;) in that case it is redundant to use int; however in virtual environments that are easier like Java, they might define most types as int (numbers, boolean, etc.) so thus in that case, you could take advantage of an int dividing it up and using bytes/bits for an ultra efficient app that has to send less integers of data (aligned by 4). As it could be said redundant to manage bits, however, it is one of many optimizations where bitwise operations are superior but not always needed; many times people take advantage of high memory constraints by just storing booleans as integers and wasting 'many magnitudes' 500%-1000% or so of memory space anyway. It still easily has its uses, if you use this among other optimizations, then on the go and other data streams that only have bytes or few kb of data flowing in, it makes the difference if overall you optimized everything to load on whether or not it will load,or load fast, at all in such cases, so reducing bytes sent could ultimately benefit you alot; even if you could get away with oversending tons of data not required to be sent in an every day internet connection or app. It is definitely something you should do when designing an app for mobile users and even something big time corporation apps fail at nowadays; using too much space and loading constraints that could be half or lower. The difference between not doing anything and piling on unknown packages/plugins that require at minumim many hundred KB or 1MB before it loads, vs one designed for speed that requires say 1KB or only fewKB, is going to make it load and act faster, as you will experience those users and people who have data constraints even if for you loading wasteful MB or thousand KB of unneeded data is fast.

OpenGL - How is GLenum a unsigned 32 bit Integer?

To begin there are 8 types of Buffer Objects in OpenGL:
GL_ARRAY_BUFFER​
GL_ELEMENT_ARRAY_BUFFER​
GL_COPY_READ_BUFFER
...
They are enums, or more specifically GLenum's. Where GLenum is a unsigned 32 bit integer that has values up to ~ 4,743,222,432 so to say.
Most of the uses of buffer objects involve binding them to a certain target like this: e.g.
glBindBuffer (GL_ARRAY_BUFFER, Buffers [size]);
[void glBindBuffer (GLenum target, GLuint buffer)] documentation
My question is - is that if its an enum its only value must be 0,1,2,3,4..7 respectively so why go all the way and make it a 32 bit integer if it has only values up to 7? Pardon my knowledge of CS and OpenGL, it just seems unethical.
Enums aren't used just for the buffers - but everywhere a symbolic constant is needed. Currently, several thousand enum values are assigned (look into your GL.h and the latest glext.h. Note that vendors get allocated their official enum ranges so they can implement vendor-specific extensions wihtout interfering with others - so a 32Bit enum space is not a bad idea. Furthermore, on modern CPU architechtures, using less than 32Bit won't be any more efficient, so this is not a problem performance-wise.
UPDATE:
As Andon M. Coleman pointed out, currently only 16Bit enumerant ranges are beeing allocated. It might be useful to link at the OpenGL Enumerant Allocation Policies, which also has the following remark:
Historically, enumerant values for some single-vendor extensions were allocated in blocks of 1000, beginning with the block [102000,102999] and progressing upward. Values in this range cannot be represented as 16-bit unsigned integers. This imposes a significant and unnecessary performance penalty on some implementations. Such blocks that have already been allocated to vendors will remain allocated unless and until the vendor voluntarily releases the entire block, but no further blocks in this range will be allocated.
Most of these seem to have been removed in favor of 16 Bit values, but 32 Bit values have been in use. In the current glext.h, one still can find some (obsolete) enumerants above 0xffff, like
#ifndef GL_PGI_misc_hints
#define GL_PGI_misc_hints 1
#define GL_PREFER_DOUBLEBUFFER_HINT_PGI 0x1A1F8
#define GL_CONSERVE_MEMORY_HINT_PGI 0x1A1FD
#define GL_RECLAIM_MEMORY_HINT_PGI 0x1A1FE
...
Why would you use a short anyway? What situation would you ever be in that you would even save more than 8k ram (if the reports of near a thousand GLenums is correct) by using a short or uint8_t istead of GLuint for enums and const declarations? Considering the trouble of potential hardware incompatibilities and potential cross platform bugs you would introduce, it's kind of odd to try to save something like 8k ram even in the context of the original 2mb Voodoo3d graphics hardware, much less SGL super-computer-farms OpenGL was created for.
Besides, modern x86 and GPU hardware aligns on 32 or 64 bits at a time, you would actually stall the operation of the CPU/GPU as 24 or 56 bits of the register would have to be zeroed out and THEN read/written to, whereas it could operate on the standard int as soon as it was copied in. From the start of OpenGL compute resources have tended to be more valuable than memory while you might do billions of state changes during a program's life you'd be saving about 10kb (kilobytes) of ram max if you replaced every 32 bit GLuint enum with a uint8_t one. I'm trying so hard not to be extra-cynical right now, heh.
For example, one valid reason for things like uint18_t and the like is for large data buffers/algorithms where data fits in that bit-depth. 1024 ints vs 1024 uint8_t variables on the stack is 8k, are we going to split hairs over 8k? Now consider a 4k raw bitmap image of 4000*2500*32 bits, we're talking a few hundred megs and it would be 8 times the size if we used 64 bit RGBA buffers in the place of standard 8 bit RGBA8 buffers, or quadruple in size if we used 32 bit RGBA encoding. Multiply that by the number of textures open or pictures saved and swapping a bit of cpu operations for all that extra memory makes sense, especially in the context of that type of work.
That is where using a non standard integer type makes sense. Unless you're on a 64k machine or something (like an old-school beeper, good luck running OpenGL on that) system if you're trying to save a few bits of memory on something like a const declaration or reference counter you're just wasting everyone's time.