Reliably determine the size of char

Reliably determine the size of char - c++

I was wondering how to reliably determine the size of a character in a portable way. AFAIK sizeof(char) can not be used because this yields always 1, even on system where the byte has 16 bit or even more or less.
For example when dealing with bits, where you need to know exactly how big it is, I was wondering if this code would give the real size of a character, independent on what the compiler thinks of it. IMO the pointer has to be increased by the compiler to the correct size, so we should have the correct value. Am I right on this, or might there be some hidden problem with pointer arithmetics, that would yield also wrong results on some systems?
int sizeOfChar()
{
char *p = 0;
p++;
int size_of_char = (int)p;
return size_of_char;
}

There's a CHAR_BIT macro defined in <limits.h> that evaluates to exactly what its name suggests.
IMO the pointer has to be increased by the compiler to the correct size, so we should have the correct value
No, because pointer arithmetic is defined in terms of sizeof(T) (the pointer target type), and the sizeof operator yields the size in bytes. char is always exactly one byte long, so your code will always yield the NULL pointer plus one (which may not be the numerical value 1, since NULL is not required to be 0).

I think it's not clear what you consider to be "right" (or "reliable", as in the title).
Do you consider "a byte is 8 bits" to be the right answer? If so, for a platform where CHAR_BIT is 16, then you would of course get your answer by just computing:
const int octets_per_char = CHAR_BIT / 8;
No need to do pointer trickery. Also, the trickery is tricky:
On an architecture with 16 bits as the smallest addressable piece of memory, there would be 16 bits at address 0x00001, another 16 bits at address 0x0001, and so on.
So, your example would compute the result 1, since the pointer would likely be incremented from 0x0000 to 0x0001, but that doesn't seem to be what you expect it to compute.
1 I use a 16-bit address space for brevity, it makes the addresses easier to read.

The size of one char (aka byte ) in bits is determined by the macro CHAR_BIT in <limits.h> (or <climits> in C++).
The sizeof operator always returns the size of a type in bytes, not in bits.
So if on some system CHAR_BIT is 16 and sizeof(int) is 4, that means an int has 64 bits on that system.

Related

Typecasting char to long

Say I have a variable, a
char a = 0x01;
and I want to cast this to a long, as in
long b;
b = (long)a;
Will the upper 3 bytes in b be guaranteed to be 0? With my setup they are 0, but I'm not sure if this is compiler-dependent.

Yes, b is guaranteed to have the value 0x1 after this assignment even without the cast. The assignment operator in c++ is generally semantic or value driven, it will copy the value or state, rather than preform bit wise copy (even if the two are sometimes equivalent, such as for trivial types).
In some cases, specially because of operator overloading, this may not be the case. Developers are very strongly encouraged to keep to this concept when they design new types, but a careless programmer could overload the assignment operator for non-fundamental types to do anything he/she wants.

As a long can represent all values for a char (be it signed or unsigned) the conversion is guaranteed to not change the value.
If you initially have a positive value, because either char is signed in you architecture or because the char values is between 0 and 127 (assuming 8 bit characters), the resulting long is guaranteed to be positive and less that 256. So in an architecture where long is 4 bytes large, the 3 high order bytes are guaranteed to be 0.
If char is signed and if the initial value is negative, things will be different! The value will be unchanged and will still be negative. In a common 2'complement architecture, the 3 high order bits will be 0xFF

The answer already given is right, but I thought I'd add that for C++, it is recommended to use one of the C++-specific casting notations, to make it abundantly clear what you are doing. Here, you would use:
long b;
b = static_cast<long>(a);
This makes it very clear what you are doing (a cast whereby how the cast is performed is calculated at compile time to a long), and you know that the "right" sort of cast will be performed.

char a = 0x01;
long b;
b = (long)a;
C and C++ are two different (but closely related) languages. Their rules happen to be the same in this case.
The cast (not "typecast") is not necessary. The assignment could, and probably should, be written as:
b = a;
which causes an implicit conversion from char to long. Since the value being converted is within the representable range of type long, the result of the conversion is 1. The result of the conversion is specified in terms of values, not representations.
The representation of the value 1 in type long probably has a 1 in the low-order bit, and 0s in all the other bits. (And the position of the low-order bit can vary; some systems are big-endian, some are little-endian, and there are other possibilities.)
There is no guarantee that type long even has three high-order bytes. Type long is at least 32 bits wide, but a byte can be wider than 8 bits. It's even possible that there are values of type char that exceed LONG_MAX (if plain char is signed and long is 1 byte, which implies CHAR_BIT >= 32).
It's also possible that the representation of type long includes padding bits, bits that do not contribute to the value. It's guaranteed that the sign bit is 0, the low-order value bit is 1, and all other value bits are 0, but if there are padding bits their values are not guaranteed. (Some combinations of padding bits can result in a trap representation that does not represent any value, but that can't happen in this particular case.)
Most of these exotic possibilities are very unlikely to occur in real life. C implementations for some DSPs do have bytes wider than 8 bits, but any system you're using almost certainly has 8-bit bytes.
The point is that the result of the conversion is defined in terms of values, not representations, and 99% of the time that's all you need to care about. If you write:
char a = 1; /* same as 0x01 */
long b = a;
printf("b = %ld\n", b);
it will print b = 1, even if you're using some exotic system where the value 1 is represented strangely.

b will be 1; this is always, compiler and endianness-independent, true. Additionally, the following expressions will be true:
b == 1
b == 01
b == 0x1
b == 0x00000001
b == 0x00000000000000000000000000000000000000000000000000001
The right hand side in all cases is an int constant with the value 1; not more, not less. Note that the zeroes do not represent bytes in memory (an int most likely does not have the number of bytes the last expression appears to suggest). The hexadecimal notation is just another way to write down a 1, exactly like 1.
In particular, we don't know where in memory the byte with the value 1 is located, because that is architecture dependent. It may be the one at the address of the int, or it may be the other end, or even in between.
Now comes the sweet thing: C does not care how the memory in an int is laid out. None of the ways to write an integer constant is architecture dependent. That seems self-understood with decimal constants — did we expect that the meaning of int i = 1 is architecture dependent? Certainly not. Nor is int i = 0x00000001;. The same is true for the bit shift operators: << shifts towards more significant bits, >> towards less significant bits. The digits in (decimal or hexadecimal) integer constants are ordered so that the most significant digits are on the left side, aligning with the "direction" indicated by the arrow-like bit shift operators. That may or may not reflect your machine's int representation; on a PC it does not.
Bottom line: If you use the standard C (or C++) means to test the "upper 3 bytes", you are home free, and the following is always true, independent of the implementation or architecture:
char a = 0x01;
long b = a;
(b & 0x11) == 1 // least significant byte is 1
(b & 0x00000011) == 1 // exactly the same as above
(b & 0x11111100) == 0 // more significant three bytes are all 0
It's possible that your long has more bits, but that is implementation dependent. How many more there are: they are all zero, save for the least significant one.

Can I rely on sizeof(uint32_t) == 4?

I know I can rely on sizeof(char) == 1, but what about sizeof(uint32_t) and sizeof(uint8_t)?
Shouldn't it be 32bit (8bit) in size guessing from the name?
Thanks!

The fixed-sized types will always be exactly that size. If you're on some weird platform that doesn't have integer types of that size then they will be undefined.
Note that it doesn't necessarily follow that sizeof(uint32_t) == 4, if CHAR_BIT != 8; again this only occurs on weird platforms.

Absolutely not, at least with regards to sizeof(uint32_t).
sizeof returns the number of bytes, not the number of bits,
and if a byte is 16 bits on the platform, sizeof(uint32_t)
will be 2, not 4; if a byte is 32 bits (and such platforms
actually exist), sizeof(uint32_t) will be 1 (and uint32_t
could be a typedef to unsigned char).
Of course, in such cases, uint8_t won't be defined.

NO, it is entirely dependent on your compiler and architecture(ie, it will give different results where you dont have integer types). If you get CHAR_BIT == 8 then yes you can rely on that. So if you are on 64bit architecture will be aligned to 64 bit boundaries same as on 32 bit architecture everything is aligned to 4 bytes.
So If the data size is critical (eg a set of bit flags) always use the fixed-size types.

Char vs unsigned char for byte arrays

When storing "byte arrays" (blobs...) is it better to use char or unsigned char for the items (unsigned char a.k.a. uint8_t)? (Standard says that sizeof of both is precisely 1 Byte.)
Does it matter at all? Or one is more convenient or prevalent than the other? Maybe, what libraries like Boost do use?

If char is signed, then performing arithmetic on a byte value with the high bit set will result in sign extension when promoting to int; so, for example:
char c = '\xf0';
int res = (c << 24) | (c << 16) | (c << 8) | c;
will give 0xfffffff0 instead of 0xf0f0f0f0. This can be avoided by masking with 0xff.
char may still be preferable if you're interfacing with libraries that use it instead of unsigned char.
Note that a cast from char * to/from unsigned char * is always safe (3.9p2). A philosophical reason to favour unsigned char is that 3.9p4 in the standard favours it, at least for representing byte arrays that could hold memory representations of objects:
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).

Theoretically, the size of a byte in C++ is dependant on the compiler-settings and target platform, but it is guaranteed to be at least 8 bits, which explains why sizeof(uint8_t) is required to be 1.
Here's more precisely what the standard has to say about it
§1.71
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation-defined. The least
significant bit is called the low-order bit; the most significant bit
is called the high-order bit. The memory available to a C++ program
consists of one or more sequences of contiguous bytes. Every byte has
a unique address.
So, if you are working on some special hardware where bytes are not 8 bits, it may make a practical difference. Otherwise, I'd say that it's a matter of taste and what information you want to communicate via the choice of type.

One of the other problems with potentially using a signed value for blobs is that the value will depend on the sign representation, which is not part of the standard. So, it's easier to invoke undefined behavior.
For example...
signed char x = 0x80;
int y = 0xffff00ff;
y |= (x << 8); // UB
The actual arithmetic value would also strictly depend two's complement, which may give some people surprises. Using unsigned explicitly avoids these problems.

makes no practcial difference although maybe from a readability point of view it is more clear if the type is unsigned char implying values 0..255.

Is sizeof(int) always equal to sizeof(void*) [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
sizeof (int) == sizeof (void*)?
I was wondering whether it is guaranteed that, in both 32-bit and 64-bit systems, sizeof(int) is always equal to sizeof(void*) (i.e. 32 and 64 bits respectively).
Additionally, I need to know whether it is always guaranteed that a long int can accommodate the bits of an int and a void* together, e.g.
long int lint = (((int)integer)<<sizeof(int)) | (void*)ptr;

I was wondering whether it is guaranteed that, in both 32-bit and 64-bit systems, sizeof(int) is always equal to sizeof(void*)
No.
I need to know whether it is always guaranteed that a long int can accommodate the bits of an int and a void* together
No. A quick proof is to consider that sizeof(long int) == sizeof(int) on many modern platforms, probably including the one you're using.
The more important question is why you think you "need to know" this; the fact that you're asking such questions makes me concerned that your code is likely to be ... wobbly.

The size of an int is implementation dependent and though it may turn out to be equal to the size of a pointer in many systems, there is no guarantee.
If you decide you need code to depend on this, you can include something such as:
if (sizeof(int) != sizeof(void *))
{
fprintf(stderr, "ERROR: size assumptions are invalid; this program cannot continue.\n");
exit(-1);
}

I was wondering whether it is guaranteed that, in both 32-bit and 64-bit systems, sizeof(int) is always equal to sizeof(void*) (i.e. 32 and 64 bits respectively).
No.
Additionally, I need to know whether it is always guaranteed that a long int can accommodate the bits of an int and a void* together
No.
What you are looking for is: std::intptr_t
sizeof(std::intptr_t) == sizeof(void*)
std::intptr_t is defined as an integer of a size sufficient to hold a pointer.
Technically its optional part of the standard.
But you can usually find in the header file <cstdint> See: 18.4.1 Header <cstdint> synopsis [cstdint.syn]

Recent C99 standard provides a <stdint.h> header defining an intptr_t integral type guaranteed to have the same size as pointers.
On my Debian/AMD64/Sid, sizeof(int) is 4 bytes, but sizeof(void*) and sizeof(intptr_t) and sizeof(long) are all 8 bytes.

Understanding sizeof(char) in 32 bit C compilers

(sizeof) char always returns 1 in 32 bit GCC compiler.
But since the basic block size in 32 bit compiler is 4, How does char occupy a single byte when the basic size is 4 bytes???
Considering the following :
struct st
{
int a;
char c;
};
sizeof(st) returns as 8 as agreed with the default block size of 4 bytes (since 2 blocks are allotted)
I can never understand why sizeof(char) returns as 1 when it is allotted a block of size 4.
Can someone pls explain this???
I would be very thankful for any replies explaining it!!!
EDIT : The typo of 'bits' has been changed to 'bytes'. I ask Sorry to the person who made the first edit. I rollbacked the EDIT since I did not notice the change U made.
Thanks to all those who made it a point that It must be changed especially #Mike Burton for downvoting the question and to #jalf who seemed to jump to conclusions over my understanding of concepts!!

sizeof(char) is always 1. Always. The 'block size' you're talking about is just the native word size of the machine - usually the size that will result in most efficient operation. Your computer can still address each byte individually - that's what the sizeof operator is telling you about. When you do sizeof(int), it returns 4 to tell you that an int is 4 bytes on your machine. Likewise, your structure is 8 bytes long. There is no information from sizeof about how many bits there are in a byte.
The reason your structure is 8 bytes long rather than 5 (as you might expect), is that the compiler is adding padding to the structure in order to keep everything nicely aligned to that native word length, again for greater efficiency. Most compilers give you the option to pack a structure, either with a #pragma directive or some other compiler extension, in which case you can force your structure to take minimum size, regardless of your machine's word length.
char is size 1, since that's the smallest access size your computer can handle - for most machines an 8-bit value. The sizeof operator gives you the size of all other quantities in units of how many char objects would be the same size as whatever you asked about. The padding (see link below) is added by the compiler to your data structure for performance reasons, so it is larger in practice than you might think from just looking at the structure definition.
There is a wikipedia article called Data structure alignment which has a good explanation and examples.

It is structure alignment with padding. c uses 1 byte, 3 bytes are non used. More here

Sample code demonstrating structure alignment:
struct st
{
int a;
char c;
};
struct stb
{
int a;
char c;
char d;
char e;
char f;
};
struct stc
{
int a;
char c;
char d;
char e;
char f;
char g;
};
std::cout<<sizeof(st) << std::endl; //8
std::cout<<sizeof(stb) << std::endl; //8
std::cout<<sizeof(stc) << std::endl; //12
The size of the struct is bigger than the sum of its individual components, since it was set to be divisible by 4 bytes by the 32 bit compiler. These results may be different on different compilers, especially if they are on a 64 bit compiler.

First of all, sizeof returns a number of bytes, not bits. sizeof(char) == 1 tells you that a char is eight bits (one byte) long. All of the fundamental data types in C are at least one byte long.
Your structure returns a size of 8. This is a sum of three things: the size of the int, the size of the char (which we know is 1), and the size of any extra padding that the compiler added to the structure. Since many implementations use a 4-byte int, this would imply that your compiler is adding 3 bytes of padding to your structure. Most likely this is added after the char in order to make the size of the structure a multiple of 4 (a 32-bit CPU access data most efficiently in 32-bit chunks, and 32 bits is four bytes).
Edit: Just because the block size is four bytes doesn't mean that a data type can't be smaller than four bytes. When the CPU loads a one-byte char into a 32-bit register, the value will be sign-extended automatically (by the hardware) to make it fill the register. The CPU is smart enough to handle data in N-byte increments (where N is a power of 2), as long as it isn't larger than the register. When storing the data on disk or in memory, there is no reason to store every char as four bytes. The char in your structure happened to look like it was four bytes long because of the padding added after it. If you changed your structure to have two char variables instead of one, you should see that the size of the structure is the same (you added an extra byte of data, and the compiler added one fewer byte of padding).

All object sizes in C and C++ are defined in terms of bytes, not bits. A byte is the smallest addressable unit of memory on the computer. A bit is a single binary digit, a 0 or a 1.
On most computers, a byte is 8 bits (so a byte can store values from 0 to 256), although computers exist with other byte sizes.
A memory address identifies a byte, even on 32-bit machines. Addresses N and N+1 point to two subsequent bytes.
An int, which is typically 32 bits covers 4 bytes, meaning that 4 different memory addresses exist that each point to part of the int.
In a 32-bit machine, all the 32 actually means is that the CPU is designed to work efficiently with 32-bit values, and that an address is 32 bits long. It doesn't mean that memory can only be addressed in blocks of 32 bits.
The CPU can still address individual bytes, which is useful when dealing with chars, for example.
As for your example:
struct st
{
int a;
char c;
};
sizeof(st) returns 8 not because all structs have a size divisible by 4, but because of alignment. For the CPU to efficiently read an integer, its must be located on an address that is divisible by the size of the integer (4 bytes). So an int can be placed on address 8, 12 or 16, but not on address 11.
A char only requires its address to be divisible by the size of a char (1), so it can be placed on any address.
So in theory, the compiler could have given your struct a size of 5 bytes... Except that this wouldn't work if you created an array of st objects.
In an array, each object is placed immediately after the previous one, with no padding. So if the first object in the array is placed at an address divisible by 4, then the next object would be placed at a 5 bytes higher address, which would not be divisible by 4, and so the second struct in the array would not be properly aligned.
To solve this, the compiler inserts padding inside the struct, so its size becomes a multiple of its alignment requirement.
Not because it is impossible to create objects that don't have a size that is a multiple of 4, but because one of the members of your st struct requires 4-byte alignment, and so every time the compiler places an int in memory, it has to make sure it is placed at an address that is divisible by 4.
If you create a struct of two chars, it won't get a size of 4. It will usually get a size of 2, because when it contains only chars, the object can be placed at any address, and so alignment is not an issue.

Sizeof returns the value in bytes. You were talking about bits. 32 bit architectures are word aligned and byte referenced. It is irrelevant how the architecture stores a char, but to compiler, you must reference chars 1 byte at a time, even if they use up less than 1 byte.
This is why sizeof(char) is 1.
ints are 32 bit, hence sizeof(int)= 4, doubles are 64 bit, hence sizeof(double) = 8, etc.

Because of optimisation padding is added so size of an object is 1, 2 or n*4 bytes (or something like that, talking about x86). That's why there is added padding to 5-byte object and to 1-byte not. Single char doesn't have to be padded, it can be allocated on 1 byte, we can store it on space allocated with malloc(1). st cannot be stored on space allocated with malloc(5) because when st struct is being copied whole 8 bytes are being copied.

It works the same way as using half a piece of paper. You use one part for a char and the other part for something else. The compiler will hide this from you since loading and storing a char into a 32bit processor register depends on the processor.
Some processors have instructions to load and store only parts of the 32bit others have to use binary operations to extract the value of a char.
Addressing a char works as it is AFAIR by definition the smallest addressable memory. On a 32bit system pointers to two different ints will be at least 4 address points apart, char addresses will be only 1 apart.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js