Is Byte Really The Minimum Addressable Unit? - c++

Section 3.6 of C11 standard defines "byte" as "addressable unit of data storage ... to hold ... character".
Section 1.7 of C++11 standard defines "byte" as "the fundamental storage unit in the C++ memory model ... to contain ... character".
Both definitions does not say that "byte" is the minimum addressable unit. Is this because standards intentionally want to abstract from a specific machine ? Can you provide a real example of machine where C/C++ compiler were decided to have "byte" longer/shorter than the minimum addressable unit ?

A byte is the smallest addressable unit in strictly conforming C code. Whether the machine on which the C implementation executes a program supports addressing smaller units is irrelevant to this; the C implementation must present a view in which bytes are the smallest addressable unit in strictly conforming C code.
A C implementation may support addressing smaller units as an extension, such as simply by defining the results of certain pointer operations that are otherwise undefined by the C standard.

One example of a real machine and its compiler where the minimal addressable unit is smaller than a byte is the 8051 family. One compiler I was used to is Keil C51.
The minimal addressable unit is a bit. You can define a variable of this type, you can read and write it. However, the syntax to define the variable is non-standard. Of course, C51 needs several extensions to support all of this. BTW, pointers to bits are not allowed.
For example:
unsigned char bdata bitsAdressable;
sbit bitAddressed = bitsAdressable^5;
void f(void) {
bitAddressed = 1;
}
bit singleBit;
void g(bit value) {
singleBit = value;
}

Both definitions does not say that "byte" is the minimum addressable unit.
That's because they don't need to. Byte-wise types (char, unsigned char, std::byte, etc) have sufficient restrictions that enforce this requirement.
The size of byte-wise types is explicitly defined to be precisely 1:
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.
The alignment of byte-wise types is the smallest alignment possible:
Furthermore, the narrow character types (6.9.1) shall have the weakest alignment requirement
This doesn't have to be an alignment of 1, of course. Except... it does.
See, if the alignment were higher than 1, that would mean that a simple byte array wouldn't work. Array indexing is based on pointer arithmetic, and pointer arithmetic determines the next address based on sizeof(T). But if alignof(T) is greater than sizeof(T), then the second element in any array of T would be misaligned. That's not allowed.
So even though the standard doesn't explicitly say that the alignment of bytewise types is 1, other requirements ensure that it must be.
Overall, this means that every pointer to an object has an alignment at least as restrictive as a byte-wise type. So no object pointer can be misaligned, relative to the alignment of byte-wise types. All valid, non-NULL pointers (pointers to a live object or to a past-the-end pointer) must therefore be at least aligned enough to point to a char.
Similarly, the difference between two pointers is defined in C++ as the difference between the array indices of the elements pointed to by those pointers (pointer arithmetic in C++ requires that the two pointers point into the same array). Additive pointer arithmetic is as previously stated based on the sizeof the type being pointed to.
Given all of these facts, even if an implementation has pointers whose addresses can address values smaller than char, it is functionally impossible for the C++ abstract model to generate a pointer and still have that pointer count as valid (pointing to an object/function, a past-the-end of an array, or be NULL). You could create such a pointer value with a cast from an integer. But you would be creating an invalid pointer value.
So while technically there could be smaller addresses on the machine, you could never actually use them in a valid, well-formed C++ program.
Obviously compiler extensions could do anything. But as far as conforming programs are concerned, it simply isn't possible to generate valid pointers that are misaligned for byte-wise types.

I programmed both the TMS34010 and its successor TMS34020 graphics chips back in the early 1990's and they had a flat address space and were bit addressable i.e. addresses indexed each bit. This was very useful for computer graphics of the time and back when memory was a lot more precious.
The embedded C-compiler didn't really have away to access individual bits directly, since from a (standard) C language point of view the byte was still the smallest unit as pointed out in a previous post.
Thus if you want to read/write a stream of bits in C, you need to read/write (at least) a byte at a time and buffer (for example when writing a Arithmetic or Huffman Coder).

(Thank you everyone who commented and answered, every word helps)
Memory model of a programming language and memory model of the target machine are different things.
Yes, byte is the minimum addressable unit in context of memory model of programming language.
No, byte is not the minimum addressable unit in context of memory model of machine. For example, there are machines where minimum addressable unit is longer or shorter than the "byte" of programming language:
longer: HP Saturn - 4-bit unit vs 8-bit byte gcc (thanks Nate).
shorter: IBM 360 - 36-bit unit vs 6-bit byte (thanks Antti)
longer: Intel 8051 - 1-bit unit vs 8-bit byte (thanks Busybee)
longer: Ti TMS34010 - 1-bit unit vs 8-bit byte (thanks Wcochran)

Related

what is the difference in defining the 'byte' in terms of computer memory and in terms of C++?

This is with reference to the text from C++ Primer Plus by Stephen Prata-
A byte means a 8-bit unit of memory in the sense of unit of measurement that describes the amount of memory in a computer.
However, C++ defines byte differently. The C++ byte consists of at least enough adjacent bits to accommodate the basic character set for the implementation.
Can you please explain if a C++ compiler have 16-bit byte whereas the system has 8-bit byte then how will the program run on such system?
What the author wants to say about the size of a byte is that, quoting from Wikipedia:
The popularity of major commercial computing architectures has aided in the ubiquitous acceptance of the 8-bit size.
On the other hand, the unit of memory in C++ is given by the built-in type char; under some implementation, a char may not be an 8-bit memory chunk; though, in your C++ program every sizeof(T) will be expressed in multiples of sizeof(char), that is equal to 1 by definition.
The number of bit in a byte for a particular implementation is recorded into the macro CHAR_BIT, defined inside the standard header <climits>. It is guaranteed that char is at least 8-bits.
Finally, this is the definition of byte given by the C++ Standard (§1.7, intro.memory) :
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain
any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8
encoding form and is composed of a contiguous sequence of bits, the number of which is implementationdefined.
The least significant bit is called the low-order bit; the most significant bit is called the high-order
bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every
byte has a unique address.
A byte means a 8-bit unit of memory.
That is incorrect.
However, C++ defines byte differently.
That is also incorrect.
In both C++ terminology and general parlance, a byte is the minimum unit of memory. An 8-bit byte is known as an octet.
Can you please explain if a C++ compiler have 16-bit byte whereas the system has 8-bit byte then how will the program run on such system?
It won't. If you compile a program for an architecture whose bytes are 16-bit, it will not run on a computer with an architecture whose bytes are 8-bit.
You have to compile for the processor you're using.
There used to be machines that had either variable byte size, or a byte size smaller than 8. The spec leaves it open to implementation on the given hardware.
The DEC PDP-10 had a 36 bit word size, and you could specify the size of a byte (usually 5 7 bit bytes to the word...)
http://pdp10.nocrew.org/docs/instruction-set/Byte.html

Fortran storage_size intrinsic function

I am looking at the storage_size intrinsic function introduced in Fortran 2008 to obtain the size of a user-defined type man storage size. It returns the size in bits, not bytes. I am wondering what the rationale is behind returning the size in bits instead of bytes.
Since I need the size in bytes, I am simply going to divide the result by 8. Is it safe to assume that the size returned will always be divisible by 8?
It is not even safe to expect byte is always 8 bits (see CHARACTER_STORAGE_SIZE in module iso_fortran_env)! For rationale for the storage_size() contact someone from SC22/WG5 or X3J3, but one of the former members always says (on comp.lang.fortran) these questions don't have much sense and a single clear answer. There was often just someone pushing this variant and not the other.
My guess would be the symmetry with the former function bit_size() is one of the reasons. And why is there bit_size() and not byte_size()? I would guess you do not have to multiply it with the byte size (and check how large is one byte) and you can apply the bit manipulation procedures instantly.
To your last question. Yes, on a machine with 8-bit bytes (other machines do not have Fortran 2008 compilers AFAIK) the bit size will always be divisible by 8 as one byte is the smallest addressable piece of memory and structures cannot use just part of one byte.

Is size_t the word size?

Is size_t the word size of the machine that compiled the code?
Parsing with g++, my compiler views size_t as an long unsigned int. Does the compiler internally choose the size of size_t, or is size_t actually typdefed inside some pre-processor macro in stddef.h to the word size before the compiler gets invoked?
Or am I way off track?
In the C++ standard, [support.types] (18.2) /6: "The type size_t is an implementation-defined unsigned integer type that is large enough to contain the size in bytes of any object."
This may or may not be the same as a "word size", whatever that means.
No; size_t is not necessarily whatever you mean by 'the word size' of the machine that will run the code (in the case of cross-compilation) or that compiled the code (in the normal case where the code will run on the same type of machine that compiled the code). It is an unsigned integer type big enough to hold the size (in bytes) of the largest object that the implementation can allocate.
Some history of sizeof and size_t
I don't know when size_t was introduced exactly, but it was between 1979 and 1989. The 1st Edition of K&R The C Programming Language from 1978 has no mention of size_t. The 7th Edition Unix Programmer's Manual has no mention of size_t at all, and that dates from 1979. The book "The UNIX Programming Environment" by Kernighan and Pike from 1984 has no mention of size_t in the index (nor of malloc() or free(), somewhat to my surprise), but that is only indicative, not conclusive. The C89 standard certainly has size_t.
The C99 Rationale documents some information about sizeof() and size_t:
6.5.3.4 The sizeof operator
It is fundamental to the correct usage of functions such as malloc and fread that
sizeof(char) be exactly one. In practice, this means that a byte in C terms is the smallest
unit of storage, even if this unit is 36 bits wide; and all objects are composed of an integer
number of these smallest units. Also applies if memory is bit addressable.
C89, like K&R, defined the result of the sizeof operator to be a constant of an unsigned integer type. Common implementations, and common usage, have often assumed that the
resulting type is int. Old code that depends on this behavior has never been portable to
implementations that define the result to be a type other than int. The C89 Committee did not
feel it was proper to change the language to protect incorrect code.
The type of sizeof, whatever it is, is published (in the library header <stddef.h>) as
size_t, since it is useful for the programmer to be able to refer to this type. This requirement
implicitly restricts size_t to be a synonym for an existing unsigned integer type. Note also
that, although size_t is an unsigned type, sizeof does not involve any arithmetic operations
or conversions that would result in modulus behavior if the size is too large to represent as a
size_t, thus quashing any notion that the largest declarable object might be too big to span even with an unsigned long in C89 or uintmax_t in C99. This also restricts the
maximum number of elements that may be declared in an array, since for any array a of N
elements,
N == sizeof(a)/sizeof(a[0])
Thus size_t is also a convenient type for array sizes, and is so used in several library functions. [...]
7.17 Common definitions
<stddef.h> is a header invented to provide definitions of several types and macros used widely in conjunction with the library: ptrdiff_t, size_t, wchar_t, and NULL.
Including any header that references one of these macros will also define it, an exception to the
usual library rule that each macro or function belongs to exactly one header.
Note that this specifically mentions that the <stddef.h> was invented by the C89 committee. I've not found words that say that size_t was also invented by the C89 committee, but if it was not, it was a codification of a fairly recent development in C.
In a comment to bmargulies answer, vonbrand says that 'it [size_t] is certainly an ANSI-C-ism'. I can very easily believe that it was an innovation with the original ANSI (ISO) C, though it is mildly odd that the rationale doesn't state that.
Not necessarily. The C ISO spec (§17.1/2) defines size_t as
size_t, which is the unsigned integer type of the result of the sizeof operator
In other words, size_t has to be large enough to hold the size of any expression that could be produced from sizeof. This could be the machine word size, but it could be dramatically smaller (if, for example, the compiler limited the maximum size of arrays or objects) or dramatically larger (if the compiler were to let you create objects so huge that a single machine word could not store the size of that object).
Hope this helps!
size_t was, orignally, just a typedef in sys/types.h (traditionally on Unix/Linux). It was assumed to be 'big enough' for, say, the maximum size of a file, or the maximum allocation with malloc. However, over time, standard committees grabbed it, and so it wound up copied into many different header files, protected each time with its own #ifdef protection from multiple definition. On the other hand, the emergence of 64-bit systems with very big potential file sizes clouded its role. So it's a bit of a palimpset.
Language standards now call it out as living in stddef.h. It has no necessary relationship to the hardware word size, and no compiler magic. See other answers with respect to what those standards say about how big it is.
Such definitions are all implementation defined. I would use the sizeof(char *), or maybe sizeof(void *), if I needed a best guess size. The best this gives is the apparent word size software uses... what the hardware really has may be different (e.g., a 32-bit system may support 64-bit integers by software).
Also if you are new to the C languages see stdint.h for all sorts of material on integer sizes.
Although the definition does not directly state what type exactly size_t is, and does not even require a minimum size, it indirectly gives some good hints. A size_t must be able to contain the size in bytes of any object, in other words, it must be able to contain the size of the largest possible object.
The largest possible object is an array (or structure) with a size equal to the entire available address space. It is not possible to reference a larger object in a meaningful manner, and apart from the availability of swap space there is no reason why it should need to be any smaller.
Therefore, by the wording of the definition, size_t must be at least 32 bits on a 32 bit architecture, and at least 64 bits on a 64 bit system. It is of course possible for an implementation to choose a larger size_t, but this is not usually the case.

Why do the sizes of data types change as the Operating System changes?

This question was asked to me in an interview, that size of char is 2 bytes in some OS, but in some operating system it is 4 bytes or different.
Why is that so?
Why is it different from other fundamental types, such as int?
That was probably a trick question. The sizeof(char) is always 1.
If the size differs, it's probably because of a non-conforming compiler, in which case the question should be about the compiler itself, not about the C or C++ language.
5.3.3 Sizeof [expr.sizeof]
1 The sizeof operator yields the number of bytes in the object
representation of its operand. The operand is either an expression,
which is not evaluated, or a parenthesized type-id. The sizeof
operator shall not be applied to an expression that has function or
incomplete type, or to an enumeration type before all its enumerators
have been declared, or to the parenthesized name of such types, or to
an lvalue that designates a bit-field. sizeof(char), sizeof(signed
char) and sizeof(unsigned char) are 1. The result of sizeof applied to any other fundamental type (3.9.1) is
implementation-defined. (emphasis mine)
The sizeof of other types than the ones pointed out are implementation-defined, and they vary for various reasons. An int has better range if it's represented in 64 bits instead of 32, but it's also more efficient as 32 bits on a 32-bit architecture.
The physical sizes (in terms of the number of bits) of types are usually dictated by the target hardware.
For example, some CPUs can access memory only in units not smaller than 16-bit. For the best performance, char can then be defined a 16-bit integer. If you want 8-bit chars on this CPU, the compiler has to generate extra code for packing and unpacking of 8-bit values into and from 16-bit memory cells. That extra packing/unpacking code will make your code bigger and slower.
And that's not the end of it. If you subdivide 16-bit memory cells into 8-bit chars, you effectively introduce an extra bit in addresses/pointers. If normal addresses are 16-bit in the CPU, where do you stick this extra, 17th bit? There are two options:
make pointers bigger (32-bit, of which 15 are unused) and waste memory and reduce the speed further
reduce the range of addressable address space by half, wasting memory, and loosing speed
The latter option can sometimes be practical. For example, if the entire address space is divided in halves, one of which is used by the kernel and the other by user applications, then application pointers will never use one bit in their addresses. You can use that bit to select an 8-bit byte in a 16-bit memory cell.
C was designed to run on as many different CPUs as possible. This is why the physical sizes of char, short, int, long, long long, void*, void(*)(), float, double, long double, wchar_t, etc can vary.
Now, when we're talking about different physical sizes in different compilers producing code for the same CPU, this becomes more of an arbitrary choice. However, it may be not that arbitrary as it may seem. For example, many compilers for Windows define int = long = 32 bits. They do that to avoid programmer's confusion when using Windows APIs, which expect INT = LONG = 32 bits. Defining int and long as something else would contribute to bugs due to loss of programmer's attention. So, compilers have to follow suit in this case.
And lastly, the C (and C++) standard operates with chars and bytes. They are the same concept size-wise. But C's bytes aren't your typical 8-bit bytes, they can legally be bigger than that as explained earlier. To avoid confusion you may use the term octet, whose name implies the number 8. A number of protocols uses this word for this very purpose.

Is the byte alignment requirement of a given data type guaranteed to be a power of 2?

Is the byte alignment requirement of a given data type guaranteed to be a power of 2?
Is there something that provides this guarantee other than it "not making sense otherwise" because it wouldn't line up with system page sizes?
(background: C/C++, so feel free to assume data type is a C or C++ type and give C/C++ specific answers.)
Alignment requirement are based on the hardware. Most, if not all, "modern" chips have addresses that are divisible by 8, not just a power of 2. In the past there were non-divisible by 8 chips (I know of a 36 bit architecture).
Things you can assume about alignment, per the C standard:
The alignment requirement of any type divides the size of that type (as determined by sizeof).
The character types char, signed char, and unsigned char have no alignment requirement. (This is actually just a special case of the first point.)
In the modern real world, integer and pointer types have sizes that are powers of two, and their alignment requirements are usually equal to their sizes (the only exception being long long on 32-bit machines). Floating point is a bit less uniform. On 32-bit machines, all floating point types typically have an alignment of 4, whereas on 64-bit machines, the alignment requirement of floating point types is typically equal to the size of the type (4, 8, or 16).
The alignment requirement of a struct should be the least common multiple of the alignment requirements of its members, but a compiler is allowed to impose stricter alignment. However, normally each cpu architecture has an ABI standard that includes alignment rules, and compilers which do not adhere to the standard will generate code that cannot be linked with code built by compilers which follow the ABI standard, so it would be very unusual for a compiler to break from the standard except for very special-purpose use.
By the way, a useful macro that will work on any sane compiler is:
#define alignof(T) ((char *)&((struct { char x; T t; } *)0)->t - (char *)0)
The alignment of a field inside a "struct", optimized for size could very well be on a odd boundary. other then that your "It wouldn't make sense" would probably apply, but I think there is NO guarantee, especially if the program was small model, optimized for size. - Joe
The standard doesn't require alignment, but allows struct/unions/bit fields to silently add padding bytes to get a correct alignment. The compiler is also free to align all your data types on even addresses should it desire.
That being said, this is CPU dependent, and I don't believe there exists a CPU that has an alignment requirement on odd addresses. There are plenty of CPUs with no alignment requirements however, and the compiler may then place variables at any address.
In short, no. It depends on the hardware.
However, most modern CPUs either do byte alignment (e.g., Intel x86 CPUs), or word alignment (e.g., Motorola, IBM/390, RISC, etc.).
Even with word alignment, it can be complicated. For example, a 16-bit word would be aligned on a 2-byte (even) address, a 32-bit word on a 4-byte boundary, but a 64-bit value may only require 4-byte alignment instead of an 8-byte aligned address.
For byte-aligned CPUs, it's also a function of the compiler options. The default alignmen for struct members can usually be specified (usually also with a compiler-specific #pragma).
For basic data types (ints, floats, doubles) usually the alignment matches the size of the type. For classes/structs, the alignment is at least the lowest common multiple of the alignment of all its members (that's the standard)
In Visual Studio you can set your own alignment for a type, but it has to be a power of 2, between 1 and 8192.
In GCC there is a similar mechanism, but it has no such requirement (at least in theory)