C++: Is bool a 1-bit variable?

C++: Is bool a 1-bit variable? - c++

I was wondering if bools in C++ are actually 1-bit variables.
I am working on a PMM for my kernel and using (maybe multidimensional) bool-arrays would be quiet nice. But i don't want to waste space if a bool in C++ is 8 bit long...
EDIT: Is a bool[8] then 1 Byte long? Or 8 Bytes? Could i maybe declare something like bool bByte[8] __attribute__((packed)); when using gcc?
And as i said: I am coding a kernel. So i can't include the standard librarys.

No there's no such thing like a 1 bit variable.
The smallest unit that can be addressed in c++ is a unsigned char.
Is a bool[8] then 1 Byte long?
No.
Or 8 Bytes?
Not necessarily. Depends on the target machines number of bits taken for a unsigned char.
But i don't want to waste space if a bool in C++ is 8 bit long...
You can avoid wasting space when dealing with bits using std::bitset, or boost::dynamic_bitset if you need a dynamic sizing.
As pointed out by #zett42 in their comment you can also address single bits with a bitfield struct (but for reasons of cache alignement this will probably use even more space):
struct S {
// will usually occupy 4 bytes:
unsigned b1 : 1,
b2 : 1,
b3 : 1;
};

A bool uses at least one (and maybe more) byte of storage, so yes, at least 8 bits.
A vector<bool>, however, normally stores a bool in only one bit, with some cleverness in the form of proxy iterators and such to (mostly) imitate access to actual bool objects, even though that's not what they store. The original C++ standard required this. More recent ones have relaxed the requirements to allow a vector<bool> to actually be what you'd normally expect (i.e., just a bunch of bool objects). Despite the relaxed requirements, however, a fair number of implementations continue to store them in packed form in a vector<bool>.
Note, however, that the same is not true of other container types--for example, a list<bool> or deque<bool> cannot use a bit-packed representation.
Also note that due to the requirement for a proxy iterator (and such) a vector<bool> that uses a bit-packed representation for storage can't meet the requirements imposed on normal containers, so you need to be careful in what you expect from them.

The smallest unit of addressable memory is a char. A bool[N] or std::array<bool, N> will use as much space as a char[N] or std::array<char, N>.
It is permitted by the standard (although not required) that implementations of std::vector<bool> may be specialized to pack bits together.

Related

How the size of a struct containing bitset fields is calculated [duplicate]

It seems for std::bitset<1 to 32>, the size is set to 4 bytes. For sizes 33 to 64, it jumps straight up to 8 bytes. There can't be any overhead because std::bitset<32> is an even 4 bytes.
I can see aligning to byte length when dealing with bits, but why would a bitset need to align to word length, especially for a container most likely to be used in situations with a tight memory budget?
This is under VS2010.

The most likely explanation is that bitset is using a whole number of machine words to store the array.
This is probably done for memory bandwidth reasons: it is typically relatively cheap to read/write a word that's aligned at a word boundary. On the other hand, reading (and especially writing!) an arbitrarily-aligned byte can be expensive on some architectures.
Since we're talking about a fixed-sized penalty of a few bytes per bitset, this sounds like a reasonable tradeoff for a general-purpose library.

I assume that indexing into the bitset is done by grabbing a 32-bit value and then isolating the relevant bit because this is fastest in terms of processor instructions (working with smaller-sized values is slower on x86). The two indexes needed for this can also be calculated very quickly:
int wordIndex = (index & 0xfffffff8) >> 3;
int bitIndex = index & 0x7;
And then you can do this, which is also very fast:
int word = m_pStorage[wordIndex];
bool bit = ((word & (1 << bitIndex)) >> bitIndex) == 1;
Also, a maximum waste of 3 bytes per bitset is not exactly a memory concern IMHO. Consider that a bitset is already the most efficient data structure to store this type of information, so you would have to evaluate the waste as a percentage of the total structure size.
For 1025 bits this approach uses up 132 bytes instead of 129, for 2.3% overhead (and this goes down as the bitset site goes up). Sounds reasonable considering the likely performance benefits.

The memory system on modern machines cannot fetch anything else but words from memory, apart from some legacy functions that extract the desired bits. Hence, having the bitsets aligned to words makes them a lot faster to handle, because you do not need to mask out the bits you don't need when accessing it. If you do not mask, doing something like
bitset<4> foo = 0;
if (foo) {
// ...
}
will most likely fail. Apart from that, I remember reading some time ago that there was a way to cramp several bitsets together, but I don't remember exactly. I think it was when you have several bitsets together in a structure that they can take up "shared" memory, which is not applicable to most use cases of bitfields.

I had the same feature in Aix and Linux implementations. In Aix, internal bitset storage is char based:
typedef unsigned char _Ty;
....
_Ty _A[_Nw + 1];
In Linux, internal storage is long based:
typedef unsigned long _WordT;
....
_WordT _M_w[_Nw];
For compatibility reasons, we modified Linux version with char based storage
Check which implementation are you using inside bitset.h

Because a 32 bit Intel-compatible processor cannot access bytes individually (or better, it can by applying implicitly some bit mask and shifts) but only 32bit words at time.
if you declare
bitset<4> a,b,c;
even if the library implements it as char, a,b and c will be 32 bit aligned, so the same wasted space exist. But the processor will be forced to premask the bytes before letting bitset code to do its own mask.
For this reason MS used a int[1+(N-1)/32] as a container for the bits.

Maybe because it's using int by default, and switches to long long if it overflows? (Just a guess...)

If your std::bitset< 8 > was a member of a structure, you might have this:
struct A
{
std::bitset< 8 > mask;
void * pointerToSomething;
}
If bitset<8> was stored in one byte (and the structure packed on 1-byte boundaries) then the pointer following it in the structure would be unaligned, which would be A Bad Thing. The only time when it would be safe and useful to have a bitset<8> stored in one byte would be if it was in a packed structure and followed by some other one-byte fields with which it could be packed together. I guess this is too narrow a use case for it to be worthwhile providing a library implementation.
Basically, in your octree, a single byte bitset would only be useful if it was followed in a packed structure by another one to three single-byte members. Otherwise, it would have to be padded to four bytes anyway (on a 32-bit machine) to ensure that the following variable was word-aligned.

How do I organize members in a struct to waste the least space on alignment?

[Not a duplicate of Structure padding and packing. That question is about how and when padding occurs. This one is about how to deal with it.]
I have just realized how much memory is wasted as a result of alignment in C++. Consider the following simple example:
struct X
{
int a;
double b;
int c;
};
int main()
{
cout << "sizeof(int) = " << sizeof(int) << '\n';
cout << "sizeof(double) = " << sizeof(double) << '\n';
cout << "2 * sizeof(int) + sizeof(double) = " << 2 * sizeof(int) + sizeof(double) << '\n';
cout << "but sizeof(X) = " << sizeof(X) << '\n';
}
When using g++ the program gives the following output:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 24
That's 50% memory overhead! In a 3-gigabyte array of 134'217'728 Xs 1 gigabyte would be pure padding.
Fortunately, the solution to the problem is very simple - we simply have to swap double b and int c around:
struct X
{
int a;
int c;
double b;
};
Now the result is much more satisfying:
sizeof(int) = 4
sizeof(double) = 8
2 * sizeof(int) + sizeof(double) = 16
but sizeof(X) = 16
There is however a problem: this isn't cross-compatible. Yes, under g++ an int is 4 bytes and a double is 8 bytes, but that's not necessarily always true (their alignment doesn't have to be the same either), so under a different environment this "fix" could not only be useless, but it could also potentially make things worse by increasing the amount of padding needed.
Is there a reliable cross-platform way to solve this problem (minimize the amount of needed padding without suffering from decreased performance caused by misalignment)? Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Clarification
Due to misunderstanding and confusion, I'd like to emphasize that I don't want to "pack" my struct. That is, I don't want its members to be unaligned and thus slower to access. Instead, I still want all members to be self-aligned, but in a way that uses the least memory on padding. This could be solved by using, for example, manual rearrangement as described here and in The Lost Art of Packing by Eric Raymond. I am looking for an automated and as much cross-platform as possible way to do this, similar to what is described in proposal P1112 for the upcoming C++20 standard.

(Don't apply these rules without thinking. See ESR's point about cache locality for members you use together. And in multi-threaded programs, beware false sharing of members written by different threads. Generally you don't want per-thread data in a single struct at all for this reason, unless you're doing it to control the separation with a large alignas(128). This applies to atomic and non-atomic vars; what matters is threads writing to cache lines regardless of how they do it.)
Rule of thumb: largest to smallest alignof(). There's nothing you can do that's perfect everywhere, but by far the most common case these days is a sane "normal" C++ implementation for a normal 32 or 64-bit CPU. All primitive types have power-of-2 sizes.
Most types have alignof(T) = sizeof(T), or alignof(T) capped at the register width of the implementation. So larger types are usually more-aligned than smaller types.
Struct-packing rules in most ABIs give struct members their absolute alignof(T) alignment relative to the start of the struct, and the struct itself inherits the largest alignof() of any of its members.
Put always-64-bit members first (like double, long long, and int64_t). ISO C++ of course doesn't fix these types at 64 bits / 8 bytes, but in practice on all CPUs you care about they are. People porting your code to exotic CPUs can tweak struct layouts to optimize if necessary.
then pointers and pointer-width integers: size_t, intptr_t, and ptrdiff_t (which may be 32 or 64-bit). These are all the same width on normal modern C++ implementations for CPUs with a flat memory model.
Consider putting linked-list and tree left/right pointers first if you care about x86 and Intel CPUs. Pointer-chasing through nodes in a tree or linked list has penalties when the struct start address is in a different 4k page than the member you're accessing. Putting them first guarantees that can't be the case.
then long (which is sometimes 32-bit even when pointers are 64-bit, in LLP64 ABIs like Windows x64). But it's guaranteed at least as wide as int.
then 32-bit int32_t, int, float, enum. (Optionally separate int32_t and float ahead of int if you care about possible 8 / 16-bit systems that still pad those types to 32-bit, or do better with them naturally aligned. Most such systems don't have wider loads (FPU or SIMD) so wider types have to be handled as multiple separate chunks all the time anyway).
ISO C++ allows int to be as narrow as 16 bits, or arbitrarily wide, but in practice it's a 32-bit type even on 64-bit CPUs. ABI designers found that programs designed to work with 32-bit int just waste memory (and cache footprint) if int was wider. Don't make assumptions that would cause correctness problems, but for "portable performance" you just have to be right in the normal case.
People tuning your code for exotic platforms can tweak if necessary. If a certain struct layout is perf-critical, perhaps comment on your assumptions and reasoning in the header.
then short / int16_t
then char / int8_t / bool
(for multiple bool flags, especially if read-mostly or if they're all modified together, consider packing them with 1-bit bitfields.)
(For unsigned integer types, find the corresponding signed type in my list.)
A multiple-of-8 byte array of narrower types can go earlier if you want it to. But if you don't know the exact sizes of types, you can't guarantee that int i + char buf[4] will fill an 8-byte aligned slot between two doubles. But it's not a bad assumption, so I'd do it anyway if there was some reason (like spatial locality of members accessed together) for putting them together instead of at the end.
Exotic types: x86-64 System V has alignof(long double) = 16, but i386 System V has only alignof(long double) = 4, sizeof(long double) = 12. It's the x87 80-bit type, which is actually 10 bytes but padded to 12 or 16 so it's a multiple of its alignof, making arrays possible without violating the alignment guarantee.
And in general it gets trickier when your struct members themselves are aggregates (struct or union) with a sizeof(x) != alignof(x).
Another twist is that in some ABIs (e.g. 32-bit Windows if I recall correctly) struct members are aligned to their size (up to 8 bytes) relative to the start of the struct, even though alignof(T) is still only 4 for double and int64_t.
This is to optimize for the common case of separate allocation of 8-byte aligned memory for a single struct, without giving an alignment guarantee. i386 System V also has the same alignof(T) = 4 for most primitive types (but malloc still gives you 8-byte aligned memory because alignof(maxalign_t) = 8). But anyway, i386 System V doesn't have that struct-packing rule, so (if you don't arrange your struct from largest to smallest) you can end up with 8-byte members under-aligned relative to the start of the struct.
Most CPUs have addressing modes that, given a pointer in a register, allow access to any byte offset. The max offset is usually very large, but on x86 it saves code size if the byte offset fits in a signed byte ([-128 .. +127]). So if you have a large array of any kind, prefer putting it later in the struct after the frequently used members. Even if this costs a bit of padding.
Your compiler will pretty much always make code that has the struct address in a register, not some address in the middle of the struct to take advantage of short negative displacements.
Eric S. Raymond wrote an article The Lost Art of Structure Packing. Specifically the section on Structure reordering is basically an answer to this question.
He also makes another important point:
9. Readability and cache locality
While reordering by size is the simplest way to eliminate slop, it’s not necessarily the right thing. There are two more issues: readability and cache locality.
In a large struct that can easily be split across a cache-line boundary, it makes sense to put 2 things nearby if they're always used together. Or even contiguous to allow load/store coalescing, e.g. copying 8 or 16 bytes with one (unaliged) integer or SIMD load/store instead of separately loading smaller members.
Cache lines are typically 32 or 64 bytes on modern CPUs. (On modern x86, always 64 bytes. And Sandybridge-family has an adjacent-line spatial prefetcher in L2 cache that tries to complete 128-byte pairs of lines, separate from the main L2 streamer HW prefetch pattern detector and L1d prefetching).
Fun fact: Rust allows the compiler to reorder structs for better packing, or other reasons. IDK if any compilers actually do that, though. Probably only possible with link-time whole-program optimization if you want the choice to be based on how the struct is actually used. Otherwise separately-compiled parts of the program couldn't agree on a layout.
(#alexis posted a link-only answer linking to ESR's article, so thanks for that starting point.)

gcc has the -Wpadded warning that warns when padding is added to a structure:
https://godbolt.org/z/iwO5Q3:
<source>:4:12: warning: padding struct to align 'X::b' [-Wpadded]
4 | double b;
| ^
<source>:1:8: warning: padding struct size to alignment boundary [-Wpadded]
1 | struct X
| ^
And you can manually rearrange members so that there is less / no padding. But this is not a cross platform solution, as different types can have different sizes / alignments on different system (Most notably pointers being 4 or 8 bytes on different architectures). The general rule of thumb is go from largest to smallest alignment when declaring members, and if you're still worried, compile your code with -Wpadded once (But I wouldn't keep it on generally, because padding is necessary sometimes).
As for the reason why the compiler can't do it automatically is because of the standard ([class.mem]/19). It guarantees that, because this is a simple struct with only public members, &x.a < &x.c (for some X x;), so they can't be rearranged.

There really isn't a portable solution in the generic case. Baring minimal requirements the standard imposes, types can be any size the implementation wants to make them.
To go along with that, the compiler is not allowed to reorder class member to make it more efficient. The standard mandates that the objects must be laid out in their declared order (by access modifier), so that's out as well.
You can use fixed width types like
struct foo
{
int64_t a;
int16_t b;
int8_t c;
int8_t d;
};
and this will be the same on all platforms, provided they supply those types, but it only works with integer types. There are no fixed-width floating point types and many standard objects/containers can be different sizes on different platforms.

Mate, in case you have 3GB of data, you probably should approach an issue by other way then swapping data members.
Instead of using 'array of struct', 'struct of arrays' could be used.
So say
struct X
{
int a;
double b;
int c;
};
constexpr size_t ArraySize = 1'000'000;
X my_data[ArraySize];
is going to became
constexpr size_t ArraySize = 1'000'000;
struct X
{
int a[ArraySize];
double b[ArraySize];
int c[ArraySize];
};
X my_data;
Each element is still easily accessible mydata.a[i] = 5; mydata.b[i] = 1.5f;....
There is no paddings (except a few bytes between arrays). Memory layout is cache friendly. Prefetcher handles reading sequential memory blocks from a few separate memory regions.
That's not as unorthodox as it might looks at first glance. That approach is widely used for SIMD and GPU programming.
Array of Structures (AoS), Structure of Arrays

This is a textbook memory-vs-speed problem. The padding is to trade memory for speed. You can't say:
I don't want to "pack" my struct.
because pragma pack is the tool invented exactly to make this trade the other way: speed for memory.
Is there a reliable cross-platform way
No, there can't be any. Alignment is strictly platform-dependent issue. Sizeof different types is a platform-dependent issue. Avoiding padding by reorganizing is platform-dependent squared.
Speed, memory, and cross-platform - you can have only two.
Why doesn't the compiler perform such optimizations (swap struct/class members around to decrease padding)?
Because the C++ specifications specifically guarantee that the compiler won't mess up your meticulously organized structs. Imagine you have four floats in a row. Sometimes you use them by name, and sometimes you pass them to a method that takes a float[3] parameter.
You're proposing that compiler should shuffle them around, potentially breaking all the code since the 1970s. And for what reason? Can you guarantee that every programmer ever will actually want to save your 8 bytes per struct? I'm, for one, sure that if I have 3 GB array, I'm having bigger problems than a GB more or less.

Although the Standard grants implementations broad discretion to insert arbitrary amounts of space between structure members, that's because the authors didn't want to try to guess all the situations where padding might be useful, and the principle "don't waste space for no reason" was considered self-evident.
In practice, almost every commonplace implementation for commonplace hardware will use primitive objects whose size is a power of two, and whose required alignment is a power of two that is no larger than the size. Further, almost every such implementation will place each member of a struct at the first available multiple of its alignment that completely follows the previous member.
Some pedants will squawk that code which exploits that behavior is "non-portable". To them I would reply
C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine specific code is one of the strengths of C.
As a slight extension to that principle, the ability of code which need only run on 90% of machines to exploit features common to that 90% of machines--even though such code wouldn't exactly be "machine-specific"--is one of the strengths of C. The notion that C programmers shouldn't be expected to bend over backward to accommodate limitations of architectures which for decades have only been used in museums should be self-evident, but apparently isn't.

You can use #pragma pack(1), but the very reason of this is that the compiler optimizes. Accessing a variable through the full register is faster than accessing it to the least bit.
Specific packing is only useful for serialization and intercompiler compatibility, etc.
As NathanOliver correctly added, this might even fail on some platforms.

Range of pointer values on 64 bit systems

Recently I was reading about the small string optimization (SSO): What are the mechanics of short string optimization in libc++?. As we know, a string typically consists of 3 pointers, which is 24 bytes on a 64 bit system. The linked answer says that in libc++'s implementation, the very first bit of the first pointer is used to indicate whether the string is in "long" or "short" mode, i.e. heap allocation and external storage vs internal storage of up to some 22 characters.
This however assumes however that the first bit of the first pointer cannot ever meaningfully be part of the address, because whenever the string is in "long" mode, that bit will always be set (or unset, depending which convention was chosen). This seems reasonable on its face, since with 64 bit pointers that allows 2^64 addresses, larger than 1 followed by 18 zeroes in bytes, or more than 1 billion gigabytes.
So this is reasonable, though not certain. My question is: is this guaranteed somewhere? And if it is guaranteed, where is it guaranteed? By the architecture spec, or by something else? To take it a step further: how many bits is it safe to do this with? I have a vague recollection reading somewhere that only 48 bits are used, but I don't recall.
If there are some number of bits, e.g. 8 or 16 that are guaranteed to be untouched, that is certainly something that could be leveraged in some interesting ways. It would be nice to exploit this, but not at the cost of having code mysteriously failure on some machine.

As we know, a string typically consists of 3 pointers, which is 24 bytes on a 64 bit system.
This is not true with libc++. The __long structure, for "long strings" is defined as:
struct __long
{
size_type __cap_;
size_type __size_;
pointer __data_;
};
The short flag therefore goes into the capacity field, making the whole thing moot.
As for pointer tagging, there is no universal guarantee about the size of a pointer. On x86_64, the data structures that the CPU uses for virtual address translation only use 48 bits (or 52 with physical address extension), so virtual addresses never use the upper 16 (or 12) bits. Additionally, most operating systems map their kernel into every process and reserve some amount of the high end of the address space for it, so in practice, user-mode pointers are even more restricted. On Windows, the most significant hardware-usable bit of a pointer tells whether it belongs to kernel-space or user-space.
These limits can change in the future and will vary across platforms, so it would be bad form to use them in a platform-independent standard library. In general, it's much better practice to use the least-significant bits for pointer tagging, since your application is in control of these.

The "long-bit" isn't part of a pointer, but of the capacity:
struct __long
{
size_type __cap_;
size_type __size_;
pointer __data_;
};
The "trick" is that if you always allocate an even number of characters and reserve one for the nul terminator, the resulting capacity will always be an odd number. And you get the 1-bit for free!

C++ BOOL (typedef int) vs bool for performance

I read somewhere that using BOOL (typedef int) is better than using the standard c++ type bool because the size of BOOL is 4 bytes (i.e. a multiple of 4) and it saves alignment operations of variables into registers or something along those lines...
Is there any truth to this? I imagine that the compiler would pad the stack frames in order to keep alignments of multiple of 4s even if you use bool (1 byte)?
I'm by no means an expert on the underlying workings of alignments, registers, etc so I apologize in advance if I've got this completely wrong. I hope to be corrected. :)
Cheers!

First of all, sizeof(bool) is not necessarily 1. It is implementation-defined, giving the compiler writer freedom to choose a size that's suitable for the target platform.
Also, sizeof(int) is not necessarily 4.
There are multiple issues that could affect performance:
alignment;
memory bandwidth;
CPU's ability to efficiently load values that are narrower than the machine word.
What -- if any -- difference that makes to a particular piece of code can only be established by profiling that piece of code.

The only guaranteed size you can get in C++ is with char, unsigned char, and signed char 2), which are always exactly one byte and defined for every platform.0)1)
0) Though a byte does not have a defined size. sizeof(char) is always 1 byte, but might be 40 binary bits in fact
1) Yes, there is uint32_t and friends, but no, their definition is optional for actual C++ implementations. Use them, but you may get compile time errors if they are not available (compile time errors are always good)
2) char, unsigned char, and signed char are distinct types and it is not defined whether char is signed or not. Keep this in mind when overloading functions and writing templates.

There are three commonly accepted performance-driven practices in regards to booleans:
In if-statements order of checking the expressions matters and one needs to be careful about them.
If a check of a boolean expression causes a lot of branch mispredictions, then it should (if possible) be substituted with a bit twiddling hack.
Since boolean is a smallest data type, boolean variables should be declared last in structures and classes, so that padding does not add noticeable holes in the structure memory layout.
I've never heard about any performance gain from substituting a boolean with (unsigned?) integer however.

Why is std::bitset<8> 4 bytes big?

It seems for std::bitset<1 to 32>, the size is set to 4 bytes. For sizes 33 to 64, it jumps straight up to 8 bytes. There can't be any overhead because std::bitset<32> is an even 4 bytes.
I can see aligning to byte length when dealing with bits, but why would a bitset need to align to word length, especially for a container most likely to be used in situations with a tight memory budget?
This is under VS2010.

The most likely explanation is that bitset is using a whole number of machine words to store the array.
This is probably done for memory bandwidth reasons: it is typically relatively cheap to read/write a word that's aligned at a word boundary. On the other hand, reading (and especially writing!) an arbitrarily-aligned byte can be expensive on some architectures.
Since we're talking about a fixed-sized penalty of a few bytes per bitset, this sounds like a reasonable tradeoff for a general-purpose library.

I assume that indexing into the bitset is done by grabbing a 32-bit value and then isolating the relevant bit because this is fastest in terms of processor instructions (working with smaller-sized values is slower on x86). The two indexes needed for this can also be calculated very quickly:
int wordIndex = (index & 0xfffffff8) >> 3;
int bitIndex = index & 0x7;
And then you can do this, which is also very fast:
int word = m_pStorage[wordIndex];
bool bit = ((word & (1 << bitIndex)) >> bitIndex) == 1;
Also, a maximum waste of 3 bytes per bitset is not exactly a memory concern IMHO. Consider that a bitset is already the most efficient data structure to store this type of information, so you would have to evaluate the waste as a percentage of the total structure size.
For 1025 bits this approach uses up 132 bytes instead of 129, for 2.3% overhead (and this goes down as the bitset site goes up). Sounds reasonable considering the likely performance benefits.

The memory system on modern machines cannot fetch anything else but words from memory, apart from some legacy functions that extract the desired bits. Hence, having the bitsets aligned to words makes them a lot faster to handle, because you do not need to mask out the bits you don't need when accessing it. If you do not mask, doing something like
bitset<4> foo = 0;
if (foo) {
// ...
}
will most likely fail. Apart from that, I remember reading some time ago that there was a way to cramp several bitsets together, but I don't remember exactly. I think it was when you have several bitsets together in a structure that they can take up "shared" memory, which is not applicable to most use cases of bitfields.

I had the same feature in Aix and Linux implementations. In Aix, internal bitset storage is char based:
typedef unsigned char _Ty;
....
_Ty _A[_Nw + 1];
In Linux, internal storage is long based:
typedef unsigned long _WordT;
....
_WordT _M_w[_Nw];
For compatibility reasons, we modified Linux version with char based storage
Check which implementation are you using inside bitset.h

Because a 32 bit Intel-compatible processor cannot access bytes individually (or better, it can by applying implicitly some bit mask and shifts) but only 32bit words at time.
if you declare
bitset<4> a,b,c;
even if the library implements it as char, a,b and c will be 32 bit aligned, so the same wasted space exist. But the processor will be forced to premask the bytes before letting bitset code to do its own mask.
For this reason MS used a int[1+(N-1)/32] as a container for the bits.

Maybe because it's using int by default, and switches to long long if it overflows? (Just a guess...)

If your std::bitset< 8 > was a member of a structure, you might have this:
struct A
{
std::bitset< 8 > mask;
void * pointerToSomething;
}
If bitset<8> was stored in one byte (and the structure packed on 1-byte boundaries) then the pointer following it in the structure would be unaligned, which would be A Bad Thing. The only time when it would be safe and useful to have a bitset<8> stored in one byte would be if it was in a packed structure and followed by some other one-byte fields with which it could be packed together. I guess this is too narrow a use case for it to be worthwhile providing a library implementation.
Basically, in your octree, a single byte bitset would only be useful if it was followed in a packed structure by another one to three single-byte members. Otherwise, it would have to be padded to four bytes anyway (on a 32-bit machine) to ensure that the following variable was word-aligned.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js