Is size_t the word size? - c++

Is size_t the word size of the machine that compiled the code?
Parsing with g++, my compiler views size_t as an long unsigned int. Does the compiler internally choose the size of size_t, or is size_t actually typdefed inside some pre-processor macro in stddef.h to the word size before the compiler gets invoked?
Or am I way off track?

In the C++ standard, [support.types] (18.2) /6: "The type size_t is an implementation-defined unsigned integer type that is large enough to contain the size in bytes of any object."
This may or may not be the same as a "word size", whatever that means.

No; size_t is not necessarily whatever you mean by 'the word size' of the machine that will run the code (in the case of cross-compilation) or that compiled the code (in the normal case where the code will run on the same type of machine that compiled the code). It is an unsigned integer type big enough to hold the size (in bytes) of the largest object that the implementation can allocate.
Some history of sizeof and size_t
I don't know when size_t was introduced exactly, but it was between 1979 and 1989. The 1st Edition of K&R The C Programming Language from 1978 has no mention of size_t. The 7th Edition Unix Programmer's Manual has no mention of size_t at all, and that dates from 1979. The book "The UNIX Programming Environment" by Kernighan and Pike from 1984 has no mention of size_t in the index (nor of malloc() or free(), somewhat to my surprise), but that is only indicative, not conclusive. The C89 standard certainly has size_t.
The C99 Rationale documents some information about sizeof() and size_t:
6.5.3.4 The sizeof operator
It is fundamental to the correct usage of functions such as malloc and fread that
sizeof(char) be exactly one. In practice, this means that a byte in C terms is the smallest
unit of storage, even if this unit is 36 bits wide; and all objects are composed of an integer
number of these smallest units. Also applies if memory is bit addressable.
C89, like K&R, defined the result of the sizeof operator to be a constant of an unsigned integer type. Common implementations, and common usage, have often assumed that the
resulting type is int. Old code that depends on this behavior has never been portable to
implementations that define the result to be a type other than int. The C89 Committee did not
feel it was proper to change the language to protect incorrect code.
The type of sizeof, whatever it is, is published (in the library header <stddef.h>) as
size_t, since it is useful for the programmer to be able to refer to this type. This requirement
implicitly restricts size_t to be a synonym for an existing unsigned integer type. Note also
that, although size_t is an unsigned type, sizeof does not involve any arithmetic operations
or conversions that would result in modulus behavior if the size is too large to represent as a
size_t, thus quashing any notion that the largest declarable object might be too big to span even with an unsigned long in C89 or uintmax_t in C99. This also restricts the
maximum number of elements that may be declared in an array, since for any array a of N
elements,
N == sizeof(a)/sizeof(a[0])
Thus size_t is also a convenient type for array sizes, and is so used in several library functions. [...]
7.17 Common definitions
<stddef.h> is a header invented to provide definitions of several types and macros used widely in conjunction with the library: ptrdiff_t, size_t, wchar_t, and NULL.
Including any header that references one of these macros will also define it, an exception to the
usual library rule that each macro or function belongs to exactly one header.
Note that this specifically mentions that the <stddef.h> was invented by the C89 committee. I've not found words that say that size_t was also invented by the C89 committee, but if it was not, it was a codification of a fairly recent development in C.
In a comment to bmargulies answer, vonbrand says that 'it [size_t] is certainly an ANSI-C-ism'. I can very easily believe that it was an innovation with the original ANSI (ISO) C, though it is mildly odd that the rationale doesn't state that.

Not necessarily. The C ISO spec (ยง17.1/2) defines size_t as
size_t, which is the unsigned integer type of the result of the sizeof operator
In other words, size_t has to be large enough to hold the size of any expression that could be produced from sizeof. This could be the machine word size, but it could be dramatically smaller (if, for example, the compiler limited the maximum size of arrays or objects) or dramatically larger (if the compiler were to let you create objects so huge that a single machine word could not store the size of that object).
Hope this helps!

size_t was, orignally, just a typedef in sys/types.h (traditionally on Unix/Linux). It was assumed to be 'big enough' for, say, the maximum size of a file, or the maximum allocation with malloc. However, over time, standard committees grabbed it, and so it wound up copied into many different header files, protected each time with its own #ifdef protection from multiple definition. On the other hand, the emergence of 64-bit systems with very big potential file sizes clouded its role. So it's a bit of a palimpset.
Language standards now call it out as living in stddef.h. It has no necessary relationship to the hardware word size, and no compiler magic. See other answers with respect to what those standards say about how big it is.

Such definitions are all implementation defined. I would use the sizeof(char *), or maybe sizeof(void *), if I needed a best guess size. The best this gives is the apparent word size software uses... what the hardware really has may be different (e.g., a 32-bit system may support 64-bit integers by software).
Also if you are new to the C languages see stdint.h for all sorts of material on integer sizes.

Although the definition does not directly state what type exactly size_t is, and does not even require a minimum size, it indirectly gives some good hints. A size_t must be able to contain the size in bytes of any object, in other words, it must be able to contain the size of the largest possible object.
The largest possible object is an array (or structure) with a size equal to the entire available address space. It is not possible to reference a larger object in a meaningful manner, and apart from the availability of swap space there is no reason why it should need to be any smaller.
Therefore, by the wording of the definition, size_t must be at least 32 bits on a 32 bit architecture, and at least 64 bits on a 64 bit system. It is of course possible for an implementation to choose a larger size_t, but this is not usually the case.

Related

Is Byte Really The Minimum Addressable Unit?

Section 3.6 of C11 standard defines "byte" as "addressable unit of data storage ... to hold ... character".
Section 1.7 of C++11 standard defines "byte" as "the fundamental storage unit in the C++ memory model ... to contain ... character".
Both definitions does not say that "byte" is the minimum addressable unit. Is this because standards intentionally want to abstract from a specific machine ? Can you provide a real example of machine where C/C++ compiler were decided to have "byte" longer/shorter than the minimum addressable unit ?
A byte is the smallest addressable unit in strictly conforming C code. Whether the machine on which the C implementation executes a program supports addressing smaller units is irrelevant to this; the C implementation must present a view in which bytes are the smallest addressable unit in strictly conforming C code.
A C implementation may support addressing smaller units as an extension, such as simply by defining the results of certain pointer operations that are otherwise undefined by the C standard.
One example of a real machine and its compiler where the minimal addressable unit is smaller than a byte is the 8051 family. One compiler I was used to is Keil C51.
The minimal addressable unit is a bit. You can define a variable of this type, you can read and write it. However, the syntax to define the variable is non-standard. Of course, C51 needs several extensions to support all of this. BTW, pointers to bits are not allowed.
For example:
unsigned char bdata bitsAdressable;
sbit bitAddressed = bitsAdressable^5;
void f(void) {
bitAddressed = 1;
}
bit singleBit;
void g(bit value) {
singleBit = value;
}
Both definitions does not say that "byte" is the minimum addressable unit.
That's because they don't need to. Byte-wise types (char, unsigned char, std::byte, etc) have sufficient restrictions that enforce this requirement.
The size of byte-wise types is explicitly defined to be precisely 1:
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.
The alignment of byte-wise types is the smallest alignment possible:
Furthermore, the narrow character types (6.9.1) shall have the weakest alignment requirement
This doesn't have to be an alignment of 1, of course. Except... it does.
See, if the alignment were higher than 1, that would mean that a simple byte array wouldn't work. Array indexing is based on pointer arithmetic, and pointer arithmetic determines the next address based on sizeof(T). But if alignof(T) is greater than sizeof(T), then the second element in any array of T would be misaligned. That's not allowed.
So even though the standard doesn't explicitly say that the alignment of bytewise types is 1, other requirements ensure that it must be.
Overall, this means that every pointer to an object has an alignment at least as restrictive as a byte-wise type. So no object pointer can be misaligned, relative to the alignment of byte-wise types. All valid, non-NULL pointers (pointers to a live object or to a past-the-end pointer) must therefore be at least aligned enough to point to a char.
Similarly, the difference between two pointers is defined in C++ as the difference between the array indices of the elements pointed to by those pointers (pointer arithmetic in C++ requires that the two pointers point into the same array). Additive pointer arithmetic is as previously stated based on the sizeof the type being pointed to.
Given all of these facts, even if an implementation has pointers whose addresses can address values smaller than char, it is functionally impossible for the C++ abstract model to generate a pointer and still have that pointer count as valid (pointing to an object/function, a past-the-end of an array, or be NULL). You could create such a pointer value with a cast from an integer. But you would be creating an invalid pointer value.
So while technically there could be smaller addresses on the machine, you could never actually use them in a valid, well-formed C++ program.
Obviously compiler extensions could do anything. But as far as conforming programs are concerned, it simply isn't possible to generate valid pointers that are misaligned for byte-wise types.
I programmed both the TMS34010 and its successor TMS34020 graphics chips back in the early 1990's and they had a flat address space and were bit addressable i.e. addresses indexed each bit. This was very useful for computer graphics of the time and back when memory was a lot more precious.
The embedded C-compiler didn't really have away to access individual bits directly, since from a (standard) C language point of view the byte was still the smallest unit as pointed out in a previous post.
Thus if you want to read/write a stream of bits in C, you need to read/write (at least) a byte at a time and buffer (for example when writing a Arithmetic or Huffman Coder).
(Thank you everyone who commented and answered, every word helps)
Memory model of a programming language and memory model of the target machine are different things.
Yes, byte is the minimum addressable unit in context of memory model of programming language.
No, byte is not the minimum addressable unit in context of memory model of machine. For example, there are machines where minimum addressable unit is longer or shorter than the "byte" of programming language:
longer: HP Saturn - 4-bit unit vs 8-bit byte gcc (thanks Nate).
shorter: IBM 360 - 36-bit unit vs 6-bit byte (thanks Antti)
longer: Intel 8051 - 1-bit unit vs 8-bit byte (thanks Busybee)
longer: Ti TMS34010 - 1-bit unit vs 8-bit byte (thanks Wcochran)

Gcc extension or macro to check the bits used for some fundamental types at compile time

At compile time, with some using static_asserts I would like to check the size in bits of some simple type like unsigned int or char, the important thing that it will be granted to happen at compile time given my usage .
I haven't found anything about this in the gcc manual nor I have any knowledge of a similar feature offered by clang, anyone knows how to check the number of bits used by a type ?
No sizeof please, my focus is on the bits and compile time .
No sizeof please, my focus is on the bits and compile time .
Keep an open mind ;-P
#include <cstdint>
static_assert(sizeof(X) * CHAR_BIT == 32, "type X must be 32 bits in size");
1. How to find the number of bits in a type without using the CHAR_BIT macro
If the type is a numeric type (like int and char), you can get the number of significant bits using std::numeric_limits<T>::digits, assuming that T is a binary type (that is, that std::numeric_limits<T>::radix == 2). Those are constexpr so they can be used in static_assert.
It is possible that the implementation is not capable of using all of the stored bits in some numeric type (other than char), in which case the number of significant digits may not relate to the physical size in bits. Also, the sign bit doesn't count, so you need to add std::numeric_limits<T>::is_signed to get the number of non-padding bits.
Since char types are not allowed to have padding and char, signed char and unsigned char are required to be exactly the same size, std::numeric_limits<unsigned char>::digits must be the number of bits in a char, otherwise known as the required macro CHAR_BIT. So you could use the two expressions interchangeably, and consequently the bit-size (physical, not meaningful) of any type T will be sizeof(T)*std::numeric_limits<unsigned char>::digits.
I don't believe that the compiler itself needs to know what the bitsize of char is (although most compilers probably do). It does need to know what sizeof(T) is for every primitive type. There is no standard-mandated way of figuring out what the value of std::numeric_limits<unsigned char>::digits is without including some header file.
2. Why you shouldn't worry about it.
In a freestanding environment, <numeric_limits> is not required, but <climits> still is, so you can count on the CHAR_BIT even in a freestanding environment, while you can only count on std::numeric_limits<unsigned char>::digits in a hosted environment.
In other words, the compiler is obliged to have some way of providing the results of #include <climits>, because that header is required by the standard even in freestanding environments (that is, environments without a standard library or even operating system). That's the "built-in" you are looking for; even if you don't provide <climits> in your standard library implementation, and even if you don't have a standard library handy, the compiler must still arrange for the macro CHAR_BIT to be correctly defined following the occurrence of #include <climits>. How it does that is up to the compiler; <climits> does not have to be an actual file.
Notes
None of the above will work with C, but then neither will static_assert so I am assuming that tagging this question as C was an oversight. As #mafso points out in a comment, C11 does have a static_assert declaration, but it only works with C11-style constant expressions, not C++-style constant expressions. C++ constant expressions can use things like constexpr functions, which might be built-in. C constant expressions, on the other hands, can only involve integer literals. They are still useful (for non-purists) because macro expansion happens first, and the macro can expand to an integer literal (or even an expression involving several integer literals).
According to this document, the gnu compiler will define these macros for you:
__CHAR_BIT__ // bits
__SIZEOF_INT__ // bytes
__SIZEOF_LONG__
__SIZEOF_LONG_LONG__
etc...
You can define your own Bit macros from the Byte macros by just multiplying by 8.
Edit: Since you apparently need to know the "word size" and consider pointers to be the same size as a "word", then use this:
__SIZEOF_POINTER__

converting size_t into long, Is there any disadvantage?

Is there any disadvantage of converting size_t to long? Because, I am writing an program that maintains linked_list in a file. So I traverse to another node based on size_t and I also keep track of total number of lists as size_t. Hence, obviously there is going to be some conversion or addition of long and size_t. Is there any disadvantage of this? If there is then I will make everything as long instead of size_t, even the sizes. Please advise.
The "long" type, unfortunately, doesn't have a good theoretical basis. Originally it was introduced on 32 bit unix ports to differentiate it from the 16 bit "int" assumed by the existing PDP11 software. Then later "int" was changed to 32 bits on those platforms (and "short" was introduced) and "long" and "int" became synonyms, which they were for a very long time.
Now, on 64 bit unix-like platforms (Linux, the BSDs, OS X, iOS and whatever proprietary unixes people might still care about) "long" is a 64 bit quantity. But, sadly, not on windows: there was too much legacy "code" in the existing headers that made the sizeof(int)==sizeof(long) assumption, so they went with an abomination called "LLP64" and left long as 32 bits. Sigh.
But "size_t" isn't like that. It has always meant precisely one thing: it's the unsigned type that stores the native pointer size in the address space. If you have an unsigned (! -- use ssize_t or ptrdiff_t if you need signed arithmetic) pointer that needs an integer representation (i.e. you need to store the memory size of an object), this is what you use.
It's not a problem now, but it may be in the future depending on where you'll port your app. That's because size_t is defined to be large enough to store offsets of pointers, so if you have a 64-bit pointer, size_t will be 64 bits too. Now, long may or may not be 64 bits, because the size rules for fundamental types in C/C++ give room to some variations.
But if you're to write these values to a file, you have to choose a specific size anyway, so there's no option other than convert to long (or long long, if needed). Better yet, use one of the new size-specific types like int32_t.
My advice: somewhere in the header of your file, store the sizeof for the type you converted the size_t to. By doing that, if in the future you decide to use a larger one, you can still support the old size. And for the current version of the program, you can check if the size is supported or not, and issue an error if not.
Is there any disadvantage of converting size_t to long?
Theoretically long can be smaller than size_t. Also, long is signed. size_t is unsigned. So if you start using them both in same expression, compiler like g++ will complain about it. A lot. Theoretically it might lead to unexpected errors due to signed-to-unsigned assignments.
obviously there is going to be some conversion or addition of long
I don't see why there's supposed to be some conversion or addition to long. You can keep using size_t for all arithmetical operations. You can typedef it as "ListIndex" or whatever and keep using it throughout the code. If you mix types (long and size_t), g++/mignw will nag you to death about it.
Alternatively, you could select specific type which has guaranteed size. Newer compilers have cstdint header which includes types like uint64_t (it is extremely unlikely that you encounter file larger than 2^64, for example). If your compiler doesn't have the header, it should be available in boost.

Why is uint_8 etc. used in C/C++?

I've seen some code where they don't use primitive types int, float, double etc. directly.
They usually typedef it and use it or use things like
uint_8 etc.
Is it really necessary even these days? Or is C/C++ standardized enough that it is preferable to use int, float etc directly.
Because the types like char, short, int, long, and so forth, are ambiguous: they depend on the underlying hardware. Back in the days when C was basically considered an assembler language for people in a hurry, this was okay. Now, in order to write programs that are portable -- which means "programs that mean the same thing on any machine" -- people have built special libraries of typedefs and #defines that allow them to make machine-independent definitions.
The secret code is really quite straight-forward. Here, you have uint_8, which is interpreted
u for unsigned
int to say it's treated as a number
_8 for the size in bits.
In other words, this is an unsigned integer with 8 bits (minimum) or what we used to call, in the mists of C history, an "unsigned char".
uint8_t is rather useless, because due to other requirements in the standard, it exists if and only if unsigned char is 8-bit, in which case you could just use unsigned char. The others, however, are extremely useful. int is (and will probably always be) 32-bit on most modern platforms, but on some ancient stuff it's 16-bit, and on a few rare early 64-bit systems, int is 64-bit. It could also of course be various odd sizes on DSPs.
If you want a 32-bit type, use int32_t or uint32_t, and so on. It's a lot cleaner and easier than all the nasty legacy hacks of detecting the sizes of types and trying to use the right one yourself...
Most code I read, and write, uses the fixed-size typedefs only when the size is an important assumption in the code.
For example if you're parsing a binary protocol that has two 32-bit fields, you should use a typedef guaranteed to be 32-bit, if only as documentation.
I'd only use int16 or int64 when the size must be that, say for a binary protocol or to avoid overflow or keep a struct small. Otherwise just use int.
If you're just doing "int i" to use i in a for loop, then I would not write "int32" for that. I would never expect any "typical" (meaning "not weird embedded firmware") C/C++ code to see a 16-bit "int," and the vast majority of C/C++ code out there would implode if faced with 16-bit ints. So if you start to care about "int" being 16 bit, either you're writing code that cares about weird embedded firmware stuff, or you're sort of a language pedant. Just assume "int" is the best int for the platform at hand and don't type extra noise in your code.
The sizes of types in C are not particularly well standardized. 64-bit integers are one example: a 64-bit integer could be long long, __int64, or even int on some systems. To get better portability, C99 introduced the <stdint.h> header, which has types like int32_t to get a signed type that is exactly 32 bits; many programs had their own, similar sets of typedefs before that.
C and C++ purposefully don't define the exact size of an int. This is because of a number of reasons, but that's not important in considering this problem.
Since int isn't set to a standard size, those who want a standard size must do a bit of work to guarantee a certain number of bits. The code that defines uint_8 does that work, and without it (or a technique like it) you wouldn't have a means of defining an unsigned 8 bit number.
The width of primitive types often depends on the system, not just the C++ standard or compiler. If you want true consistency across platforms when you're doing scientific computing, for example, you should use the specific uint_8 or whatever so that the same errors (or precision errors for floats) appear on different machines, so that the memory overhead is the same, etc.
C and C++ don't restrict the exact size of the numeric types, the standards only specify a minimum range of values that has to be represented. This means that int can be larger than you expect.
The reason for this is that often a particular architecture will have a size for which arithmetic works faster than other sizes. Allowing the implementor to use this size for int and not forcing it to use a narrower type may make arithmetic with ints faster.
This isn't going to go away any time soon. Even once servers and desktops are all fully transitioned to 64-bit platforms, mobile and embedded platforms may well be operating with a different integer size. Apart from anything else, you don't know what architectures might be released in the future. If you want your code to be portable, you have to use a fixed-size typedef anywhere that the type size is important to you.

Why the sizeof(bool) is not defined to be one, by the Standard itself?

Size of char, signed char and unsigned char is defined to be 1 byte, by the C++ Standard itself. I'm wondering why it didn't define the sizeof(bool) also?
C++03 Standard $5.3.3/1 says,
sizeof(char), sizeof(signed char) and
sizeof(unsigned char) are 1; the
result of sizeof applied to any other
fundamental type (3.9.1) is
implementation-defined. [Note: in
particular,sizeof(bool) and
sizeof(wchar_t) are
implementation-defined.69)
I understand the rationale that sizeof(bool) cannot be less than one byte. But is there any rationale why it should be greater than 1 byte either? I'm not saying that implementations define it to be greater than 1, but the Standard left it to be defined by implementation as if it may be greater than 1.
If there is no reason sizeof(bool) to be greater than 1, then I don't understand why the Standard didn't define it as just 1 byte, as it has defined sizeof(char), and it's all variants.
The other likely size for it is that of int, being the "efficient" integer type for the platform.
On architectures where it makes any difference whether the implementation chooses 1 or sizeof(int) there could be a trade-off between size (but if you're happy to waste 7 bits per bool, why shouldn't you be happy to waste 31? Use bitfields when size matters) vs. performance (but when is storing and loading bool values going to be a genuine performance issue? Use int explicitly when speed matters). So implementation flexibility wins - if for some reason 1 would be atrocious in terms of performance or code size, it can avoid it.
As #MSalters pointed out, some platforms work more efficiently with larger data items.
Many "RISC" CPUs (e.g., MIPS, PowerPC, early versions of the Alpha) have/had a considerably more difficult time working with data smaller than one word, so they do the same. IIRC, with at least some compilers on the Alpha a bool actually occupied 64 bits.
gcc for PowerPC Macs defaulted to using 4 bytes for a bool, but had a switch to change that to one byte if you wanted to.
Even for the x86, there's some advantage to using a 32-bit data item. gcc for the x86 has (or at least used to have -- I haven't looked recently at all) a define in one of its configuration files for BOOL_TYPE_SIZE (going from memory, so I could have that name a little wrong) that you could set to 1 or 4, and then re-compile the compiler to get a bool of that size.
Edit: As for the reason behind this, I'd say it's a simple reflection of a basic philosophy of C and C++: leave as much room for the implementation to optimize/customize its behavior as reasonable. Require specific behavior only when/if there's an obvious, tangible benefit, and unlikely to be any major liability, especially if the change would make it substantially more difficult to support C++ on some particular platform (though, of course, if the platform is sufficiently obscure, it might get ignored).
Many platforms cannot effectively load values smaller than 32 bits. They have to load 32 bits, and use a shift-and-mask operation to extract 8 bits. You wouldn't want this for single bools, but it's OK for strings.
The operation resulted in 'sizeof' is MADUs (minimal addresible unit), not bytes. So family processors C54 *. C55 * Texas Instuments, the expression 1 MADU = 2 bytes.
For this platform sizeof (bool) = sizeof (char) = 1 MADUs = 2 bytes.
This does not violate the C + + standard, but clarifies the situation.