Bit fields in C and C++: where are they used?

Bit fields in C and C++: where are they used? - c++

I am working with C and C++ for some time.
While learning the basics you can bump into such interesting thing as bit fields.
Usage of bit fields in programming practice has somehow controversial character.
In which kind of situations this low-level feature usage provides a real benefit, and are there concrete examples of using bit fields properly?

When working with embedded systems and microcontrollers, individual bits in a register may be associated with a processor setting or input/output. Using bit fields allows these individual bits to be worked with by name instead of doing bitwise operations on the entire register.
It's mostly an aesthetic feature but can increase code readability in some applications.

There are several use cases for bit fields, even on modern machines.
The first would be when you are handling register level logic. This is common when you are setting modes and how certain pieces of hardware work. This is even more common on embedded devices. On Arduino devices, for example, the "PinMode" logic is basically setting individual bits high or low to indicate whether a digital I/O pin is in "input" or "output" mode.
http://arduino.cc/en/Reference/pinMode
Secondly, when writing optimized, in-line assembly code in a C/C++ program. There are times where you want to take advantage of hardware-optimized instructions to speed up your program's execution as much as possible:
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
A final common example is when writing packet drivers or implementing specific protocols. I recently just posted a question on this, where it turns out I was using a 32-bit variable instead of an 8-bit variable composed of bitfields, which was causing my code to break:
Basic NTP Client in Windows in Visual C++
So, in short: when talking directly to hardware, or in networking code.

There's probably not much use for bit fields on a modern, high
performance machine, but for smaller machines, they can be very
useful to save memory, if you have large arrays of the
structures. Other than saving memory, however, there's no use
for them.

In addition to the other answers, in some scenarios, using bit fields can improve both memory usage and performance.
Save memory by packing together properties which need few bits to express the range of possible values. Why put 8 bool properties as 8 bool members, when a single byte gives to the ability to store 8 boolean values in each bit - instead of 8 bytes you use only 1, 7 bytes saved is quite significant. Naturally, you would typically use a 32, 64 bit or wider bitfields.
I have a similar scenario with a lots of objects with lots of properties which can be expressed in one or few bits, and for cases with high object count (millions) the memory savings are indeed significant.
Increase performance - although bitfields come with a small performance penalty (to access the actual value with shifting and masking) those operations are really fast. Packing more data and wasting less bits can give you better cache efficiency, which can result in performance gain that is greater than the bit fields access penalty.
Not only it may be faster to access individual bits than fetching another line from the cache and more probable to find the data in the cache if it is packed, but this will pollute the cache less, leaving more available space for other processes.

Related

Can modern x86 hardware not store a single byte to memory?

Speaking of the memory model of C++ for concurrency, Stroustrup's C++ Programming Language, 4th ed., sect. 41.2.1, says:
... (like most modern hardware) the machine could not load or store anything smaller than a word.
However, my x86 processor, a few years old, can and does store objects smaller than a word. For example:
#include <iostream>
int main()
{
char a = 5;
char b = 25;
a = b;
std::cout << int(a) << "\n";
return 0;
}
Without optimization, GCC compiles this as:
[...]
movb $5, -1(%rbp) # a = 5, one byte
movb $25, -2(%rbp) # b = 25, one byte
movzbl -2(%rbp), %eax # load b, one byte, not extending the sign
movb %al, -1(%rbp) # a = b, one byte
[...]
The comments are by me but the assembly is by GCC. It runs fine, of course.
Obviously, I do not understand what Stroustrup is talking about when he explains that hardware can load and store nothing smaller than a word. As far as I can tell, my program does nothing but load and store objects smaller than a word.
The thoroughgoing focus of C++ on zero-cost, hardware-friendly abstractions sets C++ apart from other programming languages that are easier to master. Therefore, if Stroustrup has an interesting mental model of signals on a bus, or has something else of this kind, then I would like to understand Stroustrup's model.
What is Stroustrup talking about, please?
LONGER QUOTE WITH CONTEXT
Here is Stroustrup's quote in fuller context:
Consider what might happen if a linker allocated [variables of char type like] c and b in the same word in memory and (like most modern hardware) the machine could not load or store anything smaller than a word.... Without a well-defined and reasonable memory model, thread 1 might read the word containing b and c, change c, and write the word back into memory. At the same time, thread 2 could do the same with b. Then, whichever thread managed to read the word first and whichever thread managed to write its result back into memory last would determine the result....
ADDITIONAL REMARKS
I do not believe that Stroustrup is talking about cache lines. Even if he were, as far as I know, cache coherency protocols would transparently handle that problem except maybe during hardware I/O.
I have checked my processor's hardware datasheet. Electrically, my processor (an Intel Ivy Bridge) seems to address DDR3L memory by some sort of 16-bit multiplexing scheme, so I don't know what that's about. It is not clear to me that that has much to do with Stroustrup's point, though.
Stroustrup is a smart man and an eminent scientist, so I do not doubt that he is taking about something sensible. I am confused.
See also this question. My question resembles the linked question in several ways, and the answers to the linked question are also helpful here. However, my question goes also to the hardware/bus model that motivates C++ to be the way it is and that causes Stroustrup to write what he writes. I do not seek an answer merely regarding that which the C++ standard formally guarantees, but also wish to understand why the C++ standard would guarantee it. What is the underlying thought? This is part of my question, too.

TL:DR: On every modern ISA that has byte-store instructions (including x86), they're atomic and don't disturb surrounding bytes. (I'm not aware of any older ISAs where byte-store instructions could "invent writes" to neighbouring bytes either.)
The actual implementation mechanism (in non-x86 CPUs) is sometimes an internal RMW cycle to modify a whole word in a cache line, but that's done "invisibly" inside a core while it has exclusive ownership of the cache line so it's only ever a performance problem, not correctness. (And merging in the store buffer can sometimes turn byte-store instructions into an efficient full-word commit to L1d cache.)
About Stroustrup's phrasing
I don't think it's a very accurate, clear or useful statement. It would be more accurate to say that modern CPUs can't load or store anything smaller than a cache line. (Although that's not true for uncacheable memory regions, e.g. for MMIO.)
It probably would have been better just to make a hypothetical example to talk about memory models, rather than implying that real hardware is like this. But if we try, we can maybe find an interpretation that isn't as obviously or totally wrong, which might have been what Stroustrup was thinking when he wrote this to introduce the topic of memory models. (Sorry this answer is so long; I ended up writing a lot while guessing what he might have meant and about related topics...)
Or maybe this is another case of high-level language designers not being hardware experts, or at least occasionally making mis-statements.
I think Stroustrup is talking about how CPUs work internally to implement byte-store instructions. He's suggesting that a CPU without a well-defined and reasonable memory model might implement a byte-store with a non-atomic RMW of the containing word in a cache line, or in memory for a CPU without cache.
Even this weaker claim about internal (not externally visible) behaviour is not true for high-performance x86 CPUs. Modern Intel CPUs have no throughput penalty for byte stores, or even unaligned word or vector stores that don't cross a cache-line boundary. AMD is similar.
If byte or unaligned stores had to do a RMW cycle as the store committed to L1D cache, it would interfere with store and/or load instruction/uop throughput in a way we could measure with performance counters. (In a carefully designed experiment that avoids the possibility of store coalescing in the store buffer before commit to L1d cache hiding the cost, because the store execution unit(s) can only run 1 store per clock on current CPUs.)
However, some high performance designs for non-x86 ISAs do use an atomic RMW cycle to internally commit stores to L1d cache. Are there any modern CPUs where a cached byte store is actually slower than a word store? The cache line stays in MESI Exclusive/Modified state the whole time, so it can't introduce any correctness problems, only a small performance hit. This is very different from doing something that could step on stores from other CPUs. (The arguments below about that not happening still apply, but my update may have missed some stuff that still argues that atomic cache-RMW is unlikely.)
(On many non-x86 ISAs, unaligned stores are not supported at all, or are used more rarely than in x86 software. And weakly-ordered ISAs allow more coalescing in store buffers, so not as many byte store instructions actually result in single-byte commit to L1d. Without these motivations for fancy (power hungry) cache-access hardware, word RMW for scattered byte stores is an acceptable tradeoff in some designs.)
Alpha AXP, a high-performance RISC design from 1992, famously (and uniquely among modern non-DSP ISAs) omitted byte load/store instructions until Alpha 21164A (EV56) in 1996. Apparently they didn't consider word-RMW a viable option for implementing byte stores, because one of the cited advantages for implementing only 32-bit and 64-bit aligned stores was more efficient ECC for the L1D cache. "Traditional SECDED ECC would require 7 extra bits over 32-bit granules (22% overhead) versus 4 extra bits over 8-bit granules (50% overhead)." (#Paul A. Clayton's answer about word vs. byte addressing has some other interesting computer-architecture stuff.) If byte stores were implemented with word-RMW, you could still do error detection/correction with word-granularity.
Current Intel CPUs only use parity (not ECC) in L1D for this reason. (At least some older Xeons could run with L1d in ECC mode at half capacity instead of the normal 32KiB, as discussed on RWT. It's not clear if anything's changed, e.g. in terms of Intel now using ECC for L1d). See also this Q&A about hardware (not) eliminating "silent stores": checking the old contents of cache before the write to avoid marking the line dirty if it matched would require a RMW instead of just a store, and that's a major obstacle.
It turns out some high-perf pipelined designs do use atomic word-RMW to commit to L1d, despite it stalling the memory pipeline, but (as I argue below) it's much less likely that any do an externally-visible RMW to RAM.
Word-RMW isn't a useful option for MMIO byte stores either, so unless you have an architecture that doesn't need sub-word stores for IO, you'd need some kind of special handling for IO (like Alpha's sparse I/O space where word load/stores were mapped to byte load/stores so it could use commodity PCI cards instead of needing special hardware with no byte IO registers).
As #Margaret points out, DDR3 memory controllers can do byte stores by setting control signals that mask out other bytes of a burst. The same mechanisms that get this information to the memory controller (for uncached stores) could also get that information passed along with a load or store to MMIO space. So there are hardware mechanisms for really doing
a byte store even on burst-oriented memory systems, and it's highly likely that modern CPUs will use that instead of implementing an RMW, because it's probably simpler and is much better for MMIO correctness.
How many and what size cycles will be needed to perform longword transferred to the CPU shows how a ColdFire microcontroller signals the transfer size (byte/word/longword/16-byte line) with external signal lines, letting it do byte loads/stores even if 32-bit-wide memory was hooked up to its 32-bit data bus. Something like this is presumably typical for most memory bus setups (but I don't know). The ColdFire example is complicated by also being configurable to use 16 or 8-bit memory, taking extra cycles for wider transfers. But nevermind that, the important point is that it has external signaling for the transfer size, to tell the memory HW which byte it's actually writing.
Stroustrup's next paragraph is
"The C++ memory model guarantees that two threads of execution can update and access separate memory locations without interfering with each other. This is exactly what we would naively expect. It is the compiler’s job to protect us from the sometimes very strange and subtle behaviors of modern hardware. How a compiler and hardware combination achieves that is up to the compiler. ..."
So apparently he thinks that real modern hardware may not provide "safe" byte load/store. The people who design hardware memory models agree with the C/C++ people, and realize that byte store instructions would not be very useful to programmers / compilers if they could step on neighbouring bytes.
All modern (non-DSP) architectures except early Alpha AXP have byte store and load instructions, and AFAIK these are all architecturally defined to not affect neighbouring bytes. However they accomplish that in hardware, software doesn't need to care about correctness. Even the very first version of MIPS (in 1983) had byte and half-word loads/stores, and it's a very word-oriented ISA.
However, he doesn't actually claim that most modern hardware needs any special compiler support to implement this part of the C++ memory model, just that some might. Maybe he really is only talking about word-addressable DSPs in that 2nd paragraph (where C and C++ implementations often use 16 or 32-bit char as exactly the kind of compiler workaround Stroustrup was talking about.)
Most "modern" CPUs (including all x86) have an L1D cache. They will fetch whole cache lines (typically 64 bytes) and track dirty / not-dirty on a per-cache-line basis. So two adjacent bytes are pretty much exactly the same as two adjacent words, if they're both in the same cache line. Writing one byte or word will result in a fetch of the whole line, and eventually a write-back of the whole line. See Ulrich Drepper's What Every Programmer Should Know About Memory. You're correct that MESI (or a derivative like MESIF/MOESI) makes sure this isn't a problem. (But again, this is because hardware implements a sane memory model.)
A store can only commit to L1D cache while the line is in the Modified state (of MESI). So even if the internal hardware implementation is slow for bytes and takes extra time to merge the byte into the containing word in the cache line, it's effectively an atomic read modify write as long as it doesn't allow the line to be invalidated and re-acquired between the read and the write. (While this cache has the line in Modified state, no other cache can have a valid copy). See #old_timer's comment making the same point (but also for RMW in a memory controller).
This is easier than e.g. an atomic xchg or add from a register that also needs an ALU and register access, since all the HW involved is in the same pipeline stage, which can simply stall for an extra cycle or two. That's obviously bad for performance and takes extra hardware to allow that pipeline stage to signal that it's stalling. This doesn't necessarily conflict with Stroustrup's first claim, because he was talking about a hypothetical ISA without a memory model, but it's still a stretch.
On a single-core microcontroller, internal word-RMW for cached byte stores would be more plausible, since there won't be Invalidate requests coming in from other cores that they'd have to delay responding to during an atomic RMW cache-word update. But that doesn't help for I/O to uncacheable regions. I say microcontroller because other single-core CPU designs typically support some kind of multi-socket SMP.
Many RISC ISAs don't support unaligned-word loads/stores with a single instruction, but that's a separate issue (the difficulty is handling the case when a load spans two cache lines or even pages, which can't happen with bytes or aligned half-words). More and more ISAs are adding guaranteed support for unaligned load/store in recent versions, though. (e.g. MIPS32/64 Release 6 in 2014, and I think AArch64 and recent 32-bit ARM).
The 4th edition of Stroustrup's book was published in 2013 when Alpha had been dead for years. The first edition was published in 1985, when RISC was the new big idea (e.g. Stanford MIPS in 1983, according to Wikipedia's timeline of computing HW, but "modern" CPUs at that time were byte-addressable with byte stores. Cyber CDC 6600 was word-addressable and probably still around, but couldn't be called modern.
Even very word-oriented RISC machines like MIPS and SPARC have byte store and byte load (with sign or zero extension) instructions. They don't support unaligned word loads, simplifying the cache (or memory access if there is no cache) and load ports, but you can load any single byte with one instruction, and more importantly store a byte without any architecturally-visible non-atomic rewrite of the surrounding bytes. (Although cached stores can
I suppose C++11 (which introduces a thread-aware memory model to the language) on Alpha would need to use 32-bit char if targeting a version of the Alpha ISA without byte stores. Or it would have to use software atomic-RMW with LL/SC when it couldn't prove that no other threads could have a pointer that would let them write neighbouring bytes.
IDK how slow byte load/store instructions are in any CPUs where they're implemented in hardware but not as cheap as word loads/stores. Byte loads are cheap on x86 as long as you use movzx/movsx to avoid partial-register false dependencies or merging stalls. On AMD pre-Ryzen, movsx/movzx needs an extra ALU uop, but otherwise zero/sign extension is handled right in the load port on Intel and AMD CPUs.) The main x86 downside is that you need a separate load instruction instead of using a memory operand as a source for an ALU instruction (if you're adding a zero-extended byte to a 32-bit integer), saving front-end uop throughput bandwidth and code-size. Or if you're just adding a byte to a byte register, there's basically no downside on x86. RISC load-store ISAs always need separate load and store instructions anyway. x86 byte stores are no more expensive that 32-bit stores.
As a performance issue, a good C++ implementation for hardware with slow byte stores might put each char in its own word and use word loads/stores whenever possible (e.g. for globals outside structs, and for locals on the stack). IDK if any real implementations of MIPS / ARM / whatever have slow byte load/store, but if so maybe gcc has -mtune= options to control it.
That doesn't help for char[], or dereferencing a char * when you don't know where it might be pointing. (This includes volatile char* which you'd use for MMIO.) So having the compiler+linker put char variables in separate words isn't a complete solution, just a performance hack if true byte stores are slow.
PS: More about Alpha:
Alpha is interesting for a lot of reasons: one of the few clean-slate 64-bit ISAs, not an extension to an existing 32-bit ISA. And one of the more recent clean-slate ISAs, Itanium being another from several years later which attempted some neat CPU-architecture ideas.
From the Linux Alpha HOWTO.
When the Alpha architecture was introduced, it was unique amongst RISC architectures for eschewing 8-bit and 16-bit loads and stores. It supported 32-bit and 64-bit loads and stores (longword and quadword, in Digital's nomenclature). The co-architects (Dick Sites, Rich Witek) justified this decision by citing the advantages:
Byte support in the cache and memory sub-system tends to slow down accesses for 32-bit and 64-bit quantities.
Byte support makes it hard to build high-speed error-correction circuitry into the cache/memory sub-system.
Alpha compensates by providing powerful instructions for manipulating bytes and byte groups within 64-bit registers. Standard benchmarks for string operations (e.g., some of the Byte benchmarks) show that Alpha performs very well on byte manipulation.

Not only are x86 CPUs capable of reading and writing a single byte, all modern general purpose CPUs are capable of it. More importantly most modern CPUs (including x86, ARM, MIPS, PowerPC, and SPARC) are capable of atomically reading and writing single bytes.
I'm not sure what Stroustrup was referring to. There used to be a few word addressable machines that weren't capable of 8-bit byte addressing, like the Cray, and as Peter Cordes mentioned early Alpha CPUs didn't support byte loads and stores, but today the only CPUs incapable of byte loads and stores are certain DSPs used in niche applications. Even if we assume he means most modern CPUs don't have atomic byte load and stores this isn't true of most CPUs.
However, simple atomic loads and stores aren't of much use in multithreaded programming. You also typically need ordering guarantees and a way to make read-modify-write operations atomic. Another consideration is that while CPU a may have byte load and store instructions, compiler isn't required to use them. A compiler, for example, could still generate the code Stroustrup describes, loading both b and c using a single word load instruction as an optimization.
So while you do need a well defined memory model, if only so the compiler is forced to generate the code you expect, the problem isn't that modern CPUs aren't capable of loading or storing anything smaller than a word.

The author seems to be concerned about thread 1 and thread 2 getting into a situation where the read-modify-writes (not in software, the software does two separate instructions of a byte size, somewhere down the line logic has to do a read-modify-write) instead of the ideal read modify write read modify write, becomes a read read modify modify write write or some other timing such that both read the pre-modified version and the last one to write wins. read read modify modify write write, or read modify read modify write write or read modify read write modify write.
The concern is to start with 0x1122 and one thread wants to make it 0x33XX the other wants to make it 0xXX44, but with for example a read read modify modify write write you end up with 0x1144 or 0x3322, but not 0x3344
A sane (system/logic) design just doesn't have that problem certainly not for a general purpose processor like this, I have worked on designs with timing issues like this but that is not what we are talking about here, completely different system designs for different purposes. The read-modify-write does not span a long enough distance in a sane design, and x86s are sane designs.
The read-modify-write would happen very near the first SRAM involved (ideally L1 when running an x86 in a typical fashion with an operating system capable of running C++ compiled multi-threaded programs) and happen within a few clock cycles as the ram is at the speed of the bus ideally. And as Peter pointed out this is considered to be the whole cache line that experiences this, within the cache, not a read-modify-write between the processor core and the cache.
The notion of "at the same time" even with multi-core systems isn't necessarily at the same time, eventually you get serialized because performance isn't based on them being parallel from beginning to end, it is based on keeping the busses loaded.
The quote is saying variables allocated to the same word in memory, so that is the same program. Two separate programs are not going to share an address space like that. so
You are welcome to try this, make a multithreaded program that one writes to say address 0xnnn00000 the other writes to address 0xnnnn00001, each does a write, then a read or better several writes of the same value than one read, check the read was the byte they wrote, then repeats with a different value. Let that run for a while, hours/days/weeks/months. See if you trip up the system...use assembly for the actual write instructions to make sure it is doing what you asked (not C++ or any compiler that does or claims it will not put these items in the same word). Can add delays to allow for more cache evictions, but that reduces your odds of "at the same time" collisions.
Your example so long as you insure you are not sitting on two sides of a boundary (cache, or other) like 0xNNNNFFFFF and 0xNNNN00000, isolate the two byte writes to addresses like 0xNNNN00000 and 0xNNNN00001 have the instructions back to back and see if you get a read read modify modify write write. Wrap a test around it, that the two values are different each loop, you read back the word as a whole at whatever delay later as you desire and check the two values. Repeat for days/weeks/months/years to see if it fails. Read up on your processors execution and microcode features to see what it does with this instruction sequence and as needed create a different instruction sequence that tries to get the transactions initiated within a handful or so clock cycles on the far side of the processor core.
EDIT
the problem with the quotes is that this is all about language and the use of. "like most modern hardware" puts the whole of the topic/text in a touchy position, it is too vague, one side can argue all I have to do is find one case that is true to make all the rest true, likewise one side could argue if I find one case the all of the rest is not true. Using the word like kind of messes with that as a possible get out of jail free card.
The reality is that a significant percentage of our data is stored in DRAM in 8 bit wide memories, just that we don't access them as 8 bit wide normally we access 8 of them at a time, 64 bits wide. In some number of weeks/months/years/decades this statement will be incorrect.
The larger quote says "at the same time" and then says read ... first, write ... last, well first and last and at the same time don't make sense together, is it parallel or serial? The context as a whole is concerned about the above read read modify modify write write variations where you have one writing last and depending on when that one read determines if both modifications happened or not. Not about at the same time which "like most modern hardware" doesn't make sense things that start off actually parallel in separate cores/modules eventually get serialized if they are aiming at the same flip-flop/transistor in a memory, one eventually has to wait for the other to go first. Being physics based I don't see this being incorrect in the coming weeks/months/years.

This is correct. An x86_64 CPU, just like an original x86 CPU, is not able to read or write anything smaller than an (in this case 64-bit) word from rsp. to memory. And it will not typically read or write less than a whole cache line, though there are ways to bypass the cache, especially in writing (see below).
In this context, though, Stroustrup refers to potential data races (lack of atomicity on an observable level). This correctness issue is irrelevant on x86_64, because of the cache coherency protocol, which you mentioned. In other words, yes, the CPU is limited to whole word transfers, but this is transparently handled, and you as a programmer generally do not have to worry about it. In fact, the C++ language, starting from C++11, guarantees that concurrent operations on distinct memory locations have well-defined behavior, i.e. the one you'd expect. Even if the hardware did not guarantee this, the implementation would have to find a way by generating possibly more complex code.
That said, it can still be a good idea to keep the fact that whole words or even cache lines are always involved at the machine level in the back of your head, for two reasons.
First, and this is only relevant for people who write device drivers, or design devices, memory-mapped I/O may be sensitive to the way it is accessed. As an example, think of a device that exposes a 64-bit write-only command register in the physical address space. It may then be necessary to:
Disable caching. It is not valid to read a cache line, change a single word, and write back the cache line. Also, even if it were valid, there would still be a great risk that commands might be lost because the CPU cache is not written back soon enough. At the very least, the page needs to be configured as "write-through", which means writes take immediate effect. Therefore, an x86_64 page table entry contains flags that control the CPU's caching behavior for this page.
Ensure that the whole word is always written, on the assembly level. E.g. consider a case where you write the value 1 into the register, followed by a 2. A compiler, especially when optimizing for space, might decide to overwrite only the least significant byte because the others are already supposed to be zero (that is, for ordinary RAM), or it might instead remove the first write because this value appears to be immediately overwritten anyway. However, neither is supposed to happen here. In C/C++, the volatile keyword is vital to prevent such unsuitable optimizations.
Second, and this is relevant for almost any developer writing multi-threaded programs, the cache coherency protocol, while neatly averting disaster, can have a huge performance cost if it is "abused".
Here's a – somewhat contrived – example of a very bad data structure. Assume you have 16 threads parsing some text from a file. Each thread has an id from 0 to 15.
// shared state
char c[16];
FILE *file[16];
void threadFunc(int id)
{
while ((c[id] = getc(file[id])) != EOF)
{
// ...
}
}
This is safe because each thread operates on a different memory location. However, these memory locations would typically reside on the same cache line, or at most are split over two cache lines. The cache coherency protocol is then used to properly synchronize the accesses to c[id]. And herein lies the problem, because this forces every other thread to wait until the cache line becomes exclusively available before doing anything with c[id], unless it is already running on the core that "owns" the cache line. Assuming several, e.g. 16, cores, cache coherency will typically transfer the cache line from one core to another all the time. For obvious reasons, this effect is known as "cache line ping-pong". It creates a horrible performance bottleneck. It is the result of a very bad case of false sharing, i.e. threads sharing a physical cache line without actually accessing the same logical memory locations.
In contrast to this, especially if one took the extra step of ensuring that the file array resides on its own cache line, using it would be completely harmless (on x86_64) from a performance perspective because the pointers are only read from, most the time. In this case, multiple cores can "share" the cache line as read-only. Only when any core tries to write to the cache line, it has to tell the other cores that it is going to "seize" the cache line for exclusive access.
(This is greatly simplified, as there are different levels of CPU caches, and several cores might share the same L2 or L3 cache, but it should give you a basic idea of the problem.)

Not sure what Stroustrup meant by "WORD".
Maybe it is the minimum size of memory storage of the machine?
Anyway not all machines were created with 8bit (BYTE) resolution.
In fact I recommend this awesome article by Eric S. Raymond describing some of the history of computers:
http://www.catb.org/esr/faqs/things-every-hacker-once-knew/
"... It used also to be generally known that 36-bit architectures
explained some unfortunate features of the C language. The original
Unix machine, the PDP-7, featured 18-bit words corresponding to
half-words on larger 36-bit computers. These were more naturally
represented as six octal (3-bit) digits."

Stroustrup is not saying that no machine can perform loads and stores smaller than their native word size, he is saying that a machine couldn't.
While this seems surprising at first, it's nothing esoteric.
For starter, we will ignore the cache hierarchy, we will take that into account later.
Assume there are no caches between the CPU and the memory.
The big problem with memory is density, trying to put more bits possible into the smallest area.
In order to achieve that it is convenient, from an electrical design point of view, to expose a bus as wider as possible (this favours the reuse of some electrical signals, I haven't looked at the specific details though).
So, in architecture where big memories are needed (like the x86) or a simple low-cost design is favourable (for example where RISC machines are involved), the memory bus is larger than the smallest addressable unit (typically the byte).
Depending on the budget and legacy of the project the memory can expose a wider bus alone or along with some sideband signals to select a particular unit into it.
What does this mean practically?
If you take a look at the datasheet of a DDR3 DIMM you'll see that there are 64 DQ0–DQ63 pins to read/write the data.
This is the data bus, 64-bit wide, 8 bytes at a time.
This 8 bytes thing is very well founded in the x86 architecture to the point that Intel refers to it in the WC section of its optimisation manual where it says that data are transferred from the 64 bytes fill buffer (remember: we are ignoring the caches for now, but this is similar to how a cache line gets written back) in bursts of 8 bytes (hopefully, continuously).
Does this mean that the x86 can only write QWORDS (64-bit)?
No, the same datasheet shows that each DIMM has the DM0–DM7 ,DQ0–DQ7 and DQS0–DQS7 signals to mask, direct and strobe each of the 8 bytes in the 64-bit data bus.
So x86 can read and write bytes natively and atomically.
However, now it's easy to see that this could not be the case for every architecture.
For instance, the VGA video memory was DWORD (32-bit) addressable and making it fit in the byte addressable world of the 8086 led to the messy bit-planes.
In general specific purpose architecture, like DSPs, could not have a byte addressable memory at the hardware level.
There is a twist: we have just talked about the memory data bus, this is the lowest layer possible.
Some CPUs can have instructions that build a byte addressable memory on top of a word addressable memory.
What does that mean?
It's easy to load a smaller part of a word: just discard the rest of the bytes!
Unfortunately, I can't recall the name of the architecture (if it even existed at all!) where the processor simulated a load of an unaligned byte by reading the aligned word containing it and rotating the result before saving it in a register.
With stores, the matter is more complex: if we can't simply write the part of the word that we just updated we need to write the unchanged remaining part too.
The CPU, or the programmer, must read the old content, update it and write it back.
This is a Read-Modify-Write operation and it is a core concept when discussing atomicity.
Consider:
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
foo[0] = 1; foo[1] = 2;
Is there a data race?
This is safe on x86 because they can write bytes, but what if the architecture cannot?
Both threads would have to read the whole foo array, modify it and write it back.
In pseudo-C this would be
/* Assume unsigned char is 1 byte and a word is 4 bytes */
unsigned char foo[4] = {};
/* Thread 0 Thread 1 */
/* What a CPU would do (IS) What a CPU would do (IS) */
int tmp0 = *((int*)foo) int tmp1 = *((int*)foo)
/* Assume little endian Assume little endian */
tmp0 = (tmp0 & ~0xff) | 1; tmp1 = (tmp1 & ~0xff00) | 0x200;
/* Store it back Store it back */
*((int*)foo) = tmp0; *((int*)foo) = tmp1;
We can now see what Stroustrup was talking about: the two stores *((int*)foo) = tmpX obstruct each other, to see this consider this possible execution sequence:
int tmp0 = *((int*)foo) /* T0 */
tmp0 = (tmp0 & ~0xff) | 1; /* T1 */
int tmp1 = *((int*)foo) /* T1 */
tmp1 = (tmp1 & ~0xff00) | 0x200; /* T1 */
*((int*)foo) = tmp1; /* T0 */
*((int*)foo) = tmp0; /* T0, Whooopsy */
If the C++ didn't have a memory model these kinds of nuisances would have been implementation specific details, leaving the C++ a useless programming language in a multithreading environment.
Considering how common is the situation depicted in the toy example, Stroustrup stressed out the importance of a well-defined memory model.
Formalizing a memory model is hard work, it's an exhausting, error-prone and abstract process so I also see a bit of pride in the words of Stroustrup.
I have not brushed up on the C++ memory model but updating different array elements is fine.
That's a very strong guarantee.
We have left out the caches but that doesn't really change anything, at least for the x86 case.
The x86 writes to memory through the caches, the caches are evicted in lines of 64 bytes.
Internally each core can update a line at any position atomically unless a load/store crosses a line boundary (e.g. by writing near the end of it).
This can be avoided by naturally aligning data (can you prove that?).
In a multi-code/socket environment, the cache coherency protocol ensures that only a CPU at a time is allowed to freely write to a cached line of memory (the CPU that has it in the Exclusive or Modified state).
Basically, the MESI family of protocol use a concept similar to locking found the DBMSs.
This has the effect, for the writing purpose, of "assigning" different memory regions to different CPUs.
So it doesn't really affect the discussion of above.

Performance benefit of replacing multiple bools with one int and using bit masking?

I have a C++ application where I use multiple bools through, to check conditions for IF statements. Using cachegrind my branch misprediction is about 4%, so not too bad. However, I do need to try and increase the performance.
Would it be worthwhile to replace 12x bools with a single int. I am on 64-bit Red Hat and I believe bools are represented using 4-byte ints. Therefore I am using 48 bytes, rather than 12 bits.
If I was to use bit masking I think I would still need to store bit patterns for accessing specific bits in the overall int. Would the need to store these bit patterns offset the bytes saves from reducing number of pools and therefore make this idea pointless?

Although the only way to find out for sure is to try it out, there are several considerations that may influence your decision.
First, the amount of storage would go down: you would not have to "store bit patterns for accessing specific bits in the overall int", because these patterns would become constants inside your program "baked into" the binary code.
Second, you should look at the use pattern of your flags. If you often check combinations of several flags, you may be able to replace some of these checks with a single masking operation.
Third, you should consider the aspect of writing the data back: with separate bool values each write goes to its own location, while a solution with flags would be writing to the same byte or two each time that you need to modify your flags. On the other hand, modifying several flags at once can be done in a single write.
Finally, you should consider the question of readability: your program is bound to become more complex after this change. The the gains in performance may be too small in comparison to losses of readability, because the code will run faster when the hardware become faster in a few years, but less readable code would remain less readable forever.

The only solution to know for sure is profiling. Using ints may even be slower on some architectures since accessing single bits may involve some bit shifting and masking.

Making a program portable between machines that have different number of bits in a "machine byte"

We are all fans of portable C/C++ programs.
We know that sizeof(char) or sizeof(unsigned char) is always 1 "byte". But that 1 "byte" doesn't mean a byte with 8 bits. It just means a "machine byte", and the number of bits in it can differ from machine to machine. See this question.
Suppose you write out the ASCII letter 'A' into a file foo.txt. On any normal machine these days, which has a 8-bit machine byte, these bits would get written out:
01000001
But if you were to run the same code on a machine with a 9-bit machine byte, I suppose these bits would get written out:
001000001
More to the point, the latter machine could write out these 9 bits as one machine byte:
100000000
But if we were to read this data on the former machine, we wouldn't be able to do it properly, since there isn't enough room. Somehow, we would have to first read one machine byte (8 bits), and then somehow transform the final 1 bit into 8 bits (a machine byte).
How can programmers properly reconcile these things?
The reason I ask is that I have a program that writes and reads files, and I want to make sure that it doesn't break 5, 10, 50 years from now.

How can programmers properly reconcile these things?
By doing nothing. You've presented a filesystem problem.
Imagine that dreadful day when the first of many 9-bit machines is booted up, ready to recompile your code and process that ASCII letter A that you wrote to a file last year.
To ensure that a C/C++ compiler can reasonably exist for this machine, this new computer's OS follows the same standards that C and C++ assume, where files have a size measured in bytes.
...There's already a little problem with your 8-bit source code. There's only about a 1-in-9 chance each source file is a size that can even exist on this system.
Or maybe not. As is often the case for me, Johannes Schaub - litb has pre-emptively cited the standard regarding valid formats for C++ source code.
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character
not in the basic source character set (2.2) is replaced by the
universal-character-name that des- ignates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are handled
equivalently.)
"In an implementation-defined manner." That's good news...as long as some method exists to convert your source code to any 1:1 format that can be represented on this machine, you can compile it and run your program.
So here's where your real problem lies. If the creators of this computer were kind enough to provide a utility to bit-extend 8-bit ASCII files so they may be actually stored on this new machine, there's already no problem with the ASCII letter A you wrote long ago. And if there is no such utility, then your program already needs maintenance and there's nothing you could have done to prevent it.
Edit: The shorter answer (addressing comments that have since been deleted)
The question asks how to deal with a specific 9-bit computer...
With hardware that has no backwards-compatible 8-bit instructions
With an operating system that doesn't use "8-bit files".
With a C/C++ compiler that breaks how C/C++ programs have historically written text files.
Damian Conway has an often-repeated quote comparing C++ to C:
"C++ tries to guard against Murphy, not Machiavelli."
He was describing other software engineers, not hardware engineers, but the intention is still sound because the reasoning is the same.
Both C and C++ are standardized in a way that requires you to presume that other engineers want to play nice. Your Machiavellian computer is not a threat to your program because it's a threat to C/C++ entirely.
Returning to your question:
How can programmers properly reconcile these things?
You really have two options.
Accept that the computer you describe would not be appropriate in the world of C/C++
Accept that C/C++ would not be appropriate for a program that might run on the computer you describe

Only way to be sure is to store data in text files, numbers as strings of number characters, not some amount of bits. XML using UTF-8 and base 10 should be pretty good overall choice for portability and readability, as it is well defined. If you want to be paranoid, keep the XML simple enough, so that in a pinch it can be easily parsed with simple custom parser, in case a real XML parser is not readily available for your hypothetical computer.
When parsing numbers, and it is bigger than what fits in your numeric data type, well, that's an error situation you need to handle as you see fit in the context. Or use a "big int" library, which can then handle arbitrarily large numbers (with an order of magnitude performance hit compared to "native" numeric data types, of course).
If you need to store bit fields, then store bit fields, that is number of bits and then bit values in whatever format.
If you have a specific numeric range, then store the range, so you can explicitly check if they fit in available numeric data types.
Byte is pretty fundamental data unit, so you can not really transfer binary data between storages with different amount of bits, you have to convert, and to convert you need to know how the data is formatted, otherwise you simply can not convert multi-byte values correctly.
Adding actual answer:
In you C code, do not handle byte buffers, except in isolated functions which you will then modify as appropriate for CPU architecture. For example .JPEG handling functions would take either a struct wrapping the image data in unspecified way, or a file name to read the image from, but never a raw char* to byte buffer.
Wrap strings in a container which does not assume encoding (presumably it will use UTF-8 or UTF-16 on 8-bit byte machine, possibly currently non-standard UTF-9 or UTF-18 on 9-bit byte machine, etc).
Wrap all reads from external sources (network, disk files, etc) into functions which return native data.
Create code where no integer overflows happen, and do not rely on overflow behavior in any algorithm.
Define all-ones bitmasks using ~0 (instead of 0xFFFFFFFF or something)
Prefer IEEE floating point numbers for most numeric storage, where integer is not required, as those are independent of CPU architecture.
Do not store persistent data in binary files, which you may have to convert. Instead use XML in UTF-8 (which can be converted to UTF-X without breaking anything, for native handling), and store numbers as text in the XML.
Same as with different byte orders, except much more so, only way to be sure is to port your program to actual machine with different number of bits, and run comprehensive tests. If this is really important, then you may have to first implement such a virtual machine, and port C-compiler and needed libraries for it, if you can't find one otherwise. Even careful (=expensive) code review will only take you part of the way.

if you're planning to write programs for Quantum Computers(which will be available in the near future for us to buy), then start learning Quantum Physics and take a class on programming them.
Unless you're planning for a boolean computer logic in the near future, then.. my question is how will you make it sure that the filesystem available today will not be the same tomorrow? or how a file stored with 8 bit binary will remain portable in the filesystems of tomorrow?
If you want to keep your programs running through generations, my suggestion is create your own computing machine, with your own filesystem and your own operating system, and change the interface as the needs of tomorrow change.
My problem is, the computer system I programmed a few years ago doesn't exist(Motorola 68000) anymore for normal public, and the program heavily relied on the machine's byte order and assembly language. Not portable anymore :-(

If you're talking about writing and reading binary data, don't bother. There is no portability guarantee today, other than that data you write from your program can be read by the same program compiled with the same compiler (including command-line settings). If you're talking about writing and reading textual data, don't worry. It works.

First: The original practical goal of portability is to reduce work; therefore if portability requires more effort than non-portability to achieve the same end result, then writing portable code in such case is no longer advantageous. Do not target 'portability' simply out of principle. In your case, a non-portable version with well-documented notes regarding the disk format is a more efficient means of future-proofing. Trying to write code that somehow caters to any possible generic underlying storage format will probably render your code nearly incomprehensible, or so annoying to maintain that it will fall out of favor for that reason (no need to worry about future-proofing if no one wants to use it anyway 20 yrs from now).
Second: I don't think you have to worry about this, because the only realistic solution to running 8-bit programs on a 9-bit machine (or similar) is via Virtual Machines.
It is extremely likely that anyone in the near or distant future using some 9+ bit machine will be able to start up a legacy x86/arm virtual machine and run your program that way. Hardware 25-50 years from now should have no problem what-so-ever of running entire virtual machines just for the sake of executing a single program; and that program will probably still load, execute, and shutdown faster than it does today on current native 8-bit hardware. (some cloud services today in fact, already trend toward starting entire VMs just to service individual tasks)
I strongly suspect this is the only means by which any 8-bit program would be run on 9/other-bit machines, due to the points made in other answers regarding the fundamental challenges inherent to simply loading and parsing 8-bit source code or 8-bit binary executables.
It may not be remotely resembling "efficient" but it would work. This also assumes, of course, that the VM will have some mechanism by which 8-bit text files can be imported and exported from the virtual disk onto the host disk.
As you can see, though, this is a huge problem that extends well beyond your source code. The bottom line is that, most likely, it will be much cheaper and easier to update/modify or even re-implement-from-scratch your program on the new hardware, rather than to bother trying to account for such obscure portability issues up-front. The act of accounting for it almost certainly requires more effort than just converting the disk formats.

8-bit bytes will remain until end of time, so don't sweat it. There will be new types, but this basic type will never ever change.

I think the likelihood of non-8-bit bytes in future computers is low. It would require rewriting so much, and for so little benefit. But if it happens...
You'll save yourself a lot of trouble by doing all calculations in native data types and just rewriting inputs. I'm picturing something like:
template<int OUTPUTBITS, typename CALLABLE>
class converter {
converter(int inputbits, CALLABLE datasource);
smallestTypeWithAtLeast<OUTPUTBITS> get();
};
Note that this can be written in the future when such a machine exists, so you need do nothing now. Or if you're really paranoid, make sure get just calls datasource when OUTPUTBUTS==inputbits.

Kind of late but I can't resist this one. Predicting the future is tough. Predicting the future of computers can be more hazardous to your code than premature optimization.
Short Answer
While I end this post with how 9-bit systems handled portability with 8-bit bytes this experience also makes me believe 9-bit byte systems will never arise again in general purpose computers.
My expectation is that future portability issues will be with hardware having a minimum of 16 or 32 bit access making CHAR_BIT at least 16.
Careful design here may help with any unexpected 9-bit bytes.
QUESTION to /. readers: is anyone out there aware of general purpose CPUs in production today using 9-bit bytes or one's complement arithmetic? I can see where embedded controllers may exist, but not much else.
Long Answer
Back in the 1990s's the globalization of computers and Unicode made me expect UTF-16, or larger, to drive an expansion of bits-per-character: CHAR_BIT in C. But as legacy outlives everything I also expect 8-bit bytes to remain an industry standard to survive at least as long as computers use binary.
BYTE_BIT: bits-per-byte (popular, but not a standard I know of)
BYTE_CHAR: bytes-per-character
The C standard does not address a char consuming multiple bytes. It allows for it, but does not address it.
3.6 byte: (final draft C11 standard ISO/IEC 9899:201x)
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.
NOTE 1: It is possible to express the address of each individual byte of an object uniquely.
NOTE 2: A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.
Until the C standard defines how to handle BYTE_CHAR values greater than one, and I'm not talking about “wide characters”, this the primary factor portable code must address and not larger bytes. Existing environments where CHAR_BIT is 16 or 32 are what to study. ARM processors are one example. I see two basic modes for reading external byte streams developers need to choose from:
Unpacked: one BYTE_BIT character into a local character. Beware of sign extensions.
Packed: read BYTE_CHAR bytes into a local character.
Portable programs may need an API layer that addresses the byte issue. To create on the fly and idea I reserve the right to attack in the future:
#define BYTE_BIT 8 // bits-per-byte
#define BYTE_CHAR (CHAR_BIT/BYTE_BIT) //bytes-per-char
size_t byread(void *ptr,
size_t size, // number of BYTE_BIT bytes
int packing, // bytes to read per char
// (negative for sign extension)
FILE *stream);
size_t bywrite(void *ptr,
size_t size,
int packing,
FILE *stream);
size number BYTE_BIT bytes to transfer.
packing bytes to transfer per char character. While typically 1 or BYTE_CHAR it could indicate BYTE_CHAR of the external system, which can be smaller or larger than the current system.
Never forget endianness clashes.
Good Riddance To 9-Bit Systems:
My prior experience with writing programs for 9-bit environments lead me to believe we will not see such again, unless you happen to need a program to run on a real old legacy system somewhere. Likely in a 9-bit VM on a 32/64-bit system. Since year 2000 I sometimes make a quick search for, but have not seen, references to current current descendants of the old 9-bit systems.
Any, highly unexpected in my view, future general purpose 9-bit computers would likely either have an 8-bit mode, or 8-bit VM (#jstine), to run programs under. The only exception would be special purpose built embedded processors, which general purpose code would not likely to run on anyway.
In days of yore one 9-bit machine was the PDP/15. A decade of wrestling with a clone of this beast make me never expect to see 9-bit systems arise again. My top picks on why follow:
The extra data bit came from robbing the parity bit in core memory. Old 8-bit core carried a hidden parity bit with it. Every manufacturer did it. Once core got reliable enough some system designers switched the already existing parity to a data bit in a quick ploy to gain a little more numeric power and memory addresses during times of weak, non MMU, machines. Current memory technology does not have such parity bits, machines are not so weak, and 64-bit memory is so big. All of which should make the design changes less cost effective then the changes were back then.
Transferring data between 8-bit and 9-bit architectures, including off-the-shelf local I/O devices, and not just other systems, was a continuous pain. Different controllers on the same system used incompatible techniques:
Use the low order 16-bits of 18 bit words.
Use the low-order 8 bits of 9-bit bytes where the extra high-order bit might be set to the parity from bytes read from parity sensitive devices.
Combine the low-order 6 bits of three 8-bit bytes to make 18 bit binary words.
Some controllers allowed selecting between 18-bit and 16-bit data transfers at run time. What future hardware, and supporting system calls, your programs would find just can't be predicted in advance.
Connecting to the 8-bit Internet will be horrid enough by itself to kill any 9-bit dreams someone has. They got away with it back then as machines were less interconnected in those times.
Having something other than an even multiple of 2 bits in byte-addressed storage brings up all sorts of troubles. Example: if you want an array of thousands of bits in 8-bit bytes you can unsigned char bits[1024] = { 0 }; bits[n>>3] |= 1 << (n&7);. To fully pack 9-bits you must do actual divides, which brings horrid performance penalties. This also applies to bytes-per-word.
Any code not actually tested on 9-bit byte hardware may well fail on it's first actual venture into the land of unexpected 9-bit bytes, unless the code is so simple that refactoring it in the future for 9-bits is only a minor issue. The prior byread()/bywrite() may help here but it would likely need an additional CHAR_BIT mode setting to set the transfer mode, returning how the current controller arranges the requested bytes.
To be complete anyone who wants to worry about 9-bit bytes for the educational experience may need to also worry about one's complement systems coming back; something else that seems to have died a well deserved death (two zeros: +0 and -0, is a source of ongoing nightmares... trust me). Back then 9-bit systems often seemed to be paired with one's complement operations.

In a programming language, a byte is always 8-bits. So, if a byte representation has 9-bits on some machine, for whatever reason, its up to the C compiler to reconcile that. As long as you write text using char, - say, if you write/read 'A' to a file, you would be writing/reading only 8-bits to the file. So, you should not have any problem.

using 64 bits integers in 64 bits compilers and OSes

I have a doubt about when to use 64 bits integers when targeting 64 bits OSes.
Has anyone done conclusive studies focused on the speed of the generated code?
It is better to use 64 bits integers as params for funcs or methods? (Ex: uint64 myFunc(uint64 myVar))
If we use 64 bits integers as params it takes more memory but maybe it will be more efficient.
What about if we know that some value should be always less than, for example, 10. We still continue using 64 bit integers for this param?
It is better to use 64 bits integers as return types?
Is there some penalty for using 32-bit as return value?
It is better to use 64 bits integers for loops? (for(size_t i=0; i<...)) In this case, I suppose it.
Is there some penalty for using 32-bit variables for loops?
It is better to use 64 bits integers as indexes for pointers? (Ex: myMemory[index]) In this case, I suppose it.
Is there some penalty for using 32-bit variables for indexes?
It is better to use 64 bits integers to store data in classes or structs? (that we won't want to save to disk or something like this)
It is better to use 64 bits for a bool type?
What about conversions between 64 bits integers and floats? Will be better to use doubles now?
Until now doubles are slower than floats.
Is there some penalty every time we access a 32-bit variable?
Regards!

I agree with #MarkB but want to provide more detail on some topics.
On x64, there are more registers available (twice as many). The standard calling conventions have therefore been designed to take more parameters in registers by default. So as long as the number of parameters is not excessive (typically 4 or fewer), their types will make no difference. They will be promoted to 64 bit and passed in registers anyway.
Space will be allocated on the stack for those 64 bit registers even though they are passed in registers. This is by design to make their storage locations simple and contiguous with the those of surplus parameters. The surplus parameters will be placed on the stack regardless, so size may matter in those cases.
This issue is particularly important for memory data structures. Using 64 bit where 32 bit is sufficient will waste memory, and more importantly, occupy space in cache lines. The cache impact is not simple though. If your data access pattern is sequential, that's when you will pay for it by essentially making half of your cache unusable. (Assuming you only needed half of each 64 bit quantity.)
If your access pattern is random, there is no impact on cache performance. This is because every access occupies a full cache line anyway.
There can be a small impact in accessing integers that are smaller than word size. However, pipelining and multiple issue of instructions will make it so that the extra instruction (zero or sign extend) will almost always become completely hidden and go unobserved.
The upshot of all this is simple: choose the integer size that matters for your problem. For parameters, the compiler can promote them as needed. For memory structure, smaller is typically better.

You have managed to cram a ton of questions into one question here. It looks to me like all your questions basically concern micro-optimizations. As such I'm going to make a two-part answer:
Don't worry about size from a performance perspective but instead use types that are indicative of the data that they will contain and trust the compiler's optimizer to sort it out.
If performance becomes a concern at some point during development, profile your code. Then you can make algorithmic adjustments as appropriate and if the profiler shows that integer operations are causing a problem you can compare different sizes side-by-side for comparison purposes.

Use int and trust the platform and compiler authors that they have done their job and chose the most efficient representation for it. On most 64-bit platforms it is 32-bits which means that it's no less efficient than 64-bit types.

What is the optimal size for a boolean variable

I have come to believe that the optimal size for a boolean variable is the natural width of the data, ie in C/C++ it is int. So for modern processors this is normally 32 bits. At the machine level declaring it as a byte for example requires a 32 bit fetch and then a mask.
However I have seen that a BOOL in iOS is 8 bits. I had assumed that people who used bytes were using left-over ideas from 8 bit processors.
I realise this question depends on the use and for most of the time the language defined boolean is the best bet, but there are times when you need to define your own, such as when you are converting code arriving from an external source or you want to write cross platform code.
It is also significant that if a boolean value is going to be packed into a serial stream, for sending over a serial line such as ethernet or storing it may be optimal to pack the boolean in fewer bits. But I feel that it is likely that it is optimal to pack and unpack from a processor optimal size.
So my question is am I correct in thinking that the optimal size for a boolean on a 32bit processor is 32 bits and if so why does iOS use 8 bits.

Yup you are right it depends. The big advantage of using an 8-bit is that you can pack more into a struct nicely.
Of course you'd be best off using flags in such a case.
The big issue, though, is that with a C/C++ "bool" you don't necessarily know how big it is. This means that you can't make assumptions about a struct (such as binary writing to disk) without the possibility of it breaking on another platform. In such a case using a known sized variable can be very useful and you may as well use as little space as possible if you are going to dump the structure to disk.

The notion of an 8-bit quantity involving a 32-bit fetch followed by hardware masking is mostly obsolete. In reality, a fetch from memory (on a modern processor) will normally be one L2 cache line (typically around 64-128 bytes). That being the case, essentially every size of item you deal with involves fetching a big chunk of data, and then using only some subset of what you fetched (but, assuming your data is more or less contiguous, probably using more of that data subsequently).
C++ attempts (not necessarily successfully) to optimize this a bit for you. An individual bool can be anywhere from one byte on up, though on most typical implementation, it's either one byte or four bytes. The (much reviled) std::vector<bool> uses some tricks to give a (sort of) vector-like interface, but still store each bool in one bit. In the process it loses the ability to be treated as a generic sequence container -- but when you're storing a lot of bools, and can live with the restrictions of using it in an array-like manner, it can actually be a lot more useful than many people believe.
When/if you want to retain normal container semantics and don't mind the extra storage space to keep them their native size, you can use another container (e.g., std::deque<bool>) instead. Especially if you only need to store a small collection of bools, this can often be a superior alternative.

It is architecture dependent, but on many 32 bit architectures 8 bit addressing is no less efficient than 32 bit; the "fetching and masking" as such is performed in hardware logic.
The optimal size in terms of storage space is of course 1 bit. You might for example use bit-fields or bit masking to pack multiple booleans in a single word. Some architectures such as 8051 have bit addressable memory. The more modern ARM Cortex-M architecture employs a technique called bit-banding that allows memory and hardware registers to be bit addressable

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js