Endianness of string literals and usage of strings in case statements

Endianness of string literals and usage of strings in case statements - c++

On my machine, the following program writes 1234 to it's output.
const char str[] = "1234";
printf("%c%c%c%c\n",
(int) (0xff & (*(uint32_t*) str) >> 0),
(int) (0xff & (*(uint32_t*) str) >> 8),
(int) (0xff & (*(uint32_t*) str) >> 16),
(int) (0xff & (*(uint32_t*) str) >> 24));
This implies that str is internally represented as 0x34333231, and the first byte str[0] represents the least significant 8 bits.
Does this mean str is encoded in little endian? And is the output of this program platform-dependent?
Also, is there a convenient way to use 1, 2, 4 and 8 character string literals in switch case statements? I can't find any way to convert the strings to integers, as *(const uint32_t* const) "1234" is not a constant expression, and 0x34333231/0x31323334 might be platform dependent and must be notated in hexadecimal.
edit:
In other words, is 0xff & *(uint32_t*) str always equal to str[0]?
Eh, never mind, just realized it is and also why.

You're confusing endianness of a string (which doesn't exists so long as we're talking about ASCII strings) with the endianness of an integer. The integer on your system is little endian.
To answer your second question, no you can't switch on strings. If you're really desperate for the increase in speed you could make one for little endian systems and one for big endian systems.

Endianness refers to the order of bytes in a larger value. Strings are (at least in C and C++) an array of bytes so endianness doesn't apply.
You actually can do what you mention in the last paragraph using multicharacter literals, though it's implementation defined exactly how it works and the string must be no longer than sizeof(int).
C++ standard, §2.14.3/1 - Character literals
(...) An ordinary character literal that contains more than one c-char is a multicharacter literal . A multicharacter literal has type int and implementation-deﬁned value.
For instance, 'abcd' is a value of type int with an implementation-defined value. This value probably would depend on endianness. Since it is an integer, you are allowed to switch on it.

The bytes are layed out as at increasing memory addresses as 0x31, 0x32, 0x33, 0x34.
In a 32-bit integer is little endian you've got 0x34333231. If big endian 0x31323334.
(Also in general integers are aligned on even or 4-fold addresses.)

Related

confusing sizeof operator result

I'm a bit confused about a sizeof result.
I have this :
unsigned long part1 = 0x0200000001;
cout << sizeof(part1); // this gives 8 byte
if I count correctly, part 1 is 9 byte long right ?
Can anybody clarify this for me
thanks
Regards

if I count correctly, part 1 is 9 byte long right ?
No, you are counting incorrectly. 0x0200000001 can fit into five bytes. One byte is represented by two hex digits. Hence, the bytes are 02 00 00 00 01.

I suppose you misinterpret the meaning of sizeof. sizeof(type) returns the number of bytes that the system reserves to hold any value of the respective type. So sizeof(int) on an 32 bit system will probably give you 4 bytes, and 8 bytes on a 64 bit system, sizeof(char[20]) gives 20, and so on.
Note that one can use also identifiers of (typed) variables, e.g. int x; sizeof(x); type is then deduced from the variable declaration/definition, such that sizeof(x) is the same as sizeof(int) in this case.
But: sizeof never ever interprets or analyses the content / the value of a variable at runtime, even if sizeof somehow sounds as if. So char *x = "Hello, world"; sizeof(x) is not the string length of string literal "Hello, world" but the size of type char*.
So your sizeof(part1) is the same as sizeof(long), which is 8 on your system regardless of what the actual content of part1 is at runtime.

An unsigned long has a minimum range of 0 to 4294967295 that's 4 bytes.
Assigning 0x0200000001 (8589934593) to an unsigned long that's not big enough will trigger a conversion so that it fits in an unsigned long on your machine. This conversion is implementation-defined but usually the higher bits will be discarded.
sizeof will tell you the amount of bytes a type uses. It won't tell you how many bytes are occupied by your value.

sizeof(part1) (I'm assuming you have a typo) gives the size of a unsigned long (i.e. sizeof(unsigned long)). The size of part1 is therefore the same regardless of what value is stored in it.
On your compiler, sizeof(unsigned long) has value of 8. The size of types is implementation defined for all types (other than char types that are defined to have a size of 1), so may vary between compilers.
The value of 9 you are expecting is the size of output you will obtain by writing the value of part1 to a file or string as human-readable hex, with no leading zeros or prefix. That has no relationship to the sizeof operator whatsoever. And, when outputting a value, it is possible to format it in different ways (e.g. hex versus decimal versus octal, leading zeros or not) which affect the size of output

sizeof(part1) returns the size of the data type of variable part1, which you have defined as unsigned long.
The bit-size of unsigned long for your compiler is always 64 bits, or 8 bytes long, that's 8 groups of 8 bits. The hexadecimal representation is a human readable form of the binary format, where each digit is 4 bits long. We humans often omit leading zeroes for clarity, computers never do.
Let's consider a single byte of data - a char - and the value zero.
- decimal: 0
- hexadecimal : 0x0 (often written as 0x00)
- binary: 0000 0000
For a list of C++ data types and their corresponding bit-size check out the documentation at (the msvc documentation is easier to read):
for msvc: https://msdn.microsoft.com/en-us/library/s3f49ktz(v=vs.71).aspx
for gcc: https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Data-Types
All compilers have documentation for their data sizes, since they depend on the hardware and the compiler itself. If you use a different compiler, a google search on "'your compiler name' data sizes" will help you find the correct sizes for your compiler.

What guarantees that a static_cast of a UTF-8 code unit stored in a char32_t or char16_t to char will turn out as expected?

When using the C++11 standard, is there any guarantee that an ASCII character stored in a char32_t or char16_t codepoint will be properly cast to char?
char32_t and char16_t are both defined to always be unsigned (http://en.cppreference.com/w/cpp/language/types). However, char may be signed or unsigned depending on the system.
I would assume that ASCII characters always work:
char32_t original = U'b';
char value = static_cast<char>(original);
However, what about values that are UTF-8 code units, which start with the first bit == 1, and are extracted from the UTF-32 character using a bitmask during conversion, e.g.:
char32_t someUtf32CodeUnit = 0x00001EA9;
// Third code-unit of ẩ
char extractedCodeUnit = static_cast<char>(((someUtf32CodeUnit >> 6) & 0x3F) | 0x80);
Is it guaranteed that the conversion on all systems will work the same way (resulting in the same expected bits of said UTF-8 code unit) or will the unsigned<->signed casts potentially make any difference?

EDIT:
as far as I know, C++ (including C++11) is agnostic about the coding used for the char type. The only requirement (§3.9.1.1) is that a char must be able to store any character of the basic character set defined in §2.3. Therefore even ASCII characters outside the set, like # or `, are not guaranteed to be stored in a char. Their code point values could obviously be stored, but the machine might interpret them as different glyphs (for functions like isalpha and similars)
Even if you are just interested in storing the values, in your example you try to static_cast an int expression to a char. In case your char is a signed type and value is bigger than 127, the result is implementation-defined, see this answer for details.

Why are wchar_t / unsigned short now distinct, but there is no analogous char / unsigned byte distinction?

It just seems like "not of one mind" in the design here, because integer data and character data of 16 bits is now differentiable but integer and character data of 8 bits is not.
C++ has always had the only choice for 8-bit values a 'char'. But the feature of recognizing wchar_t as an official, distinct type from unsigned short enables improvements, but only for wide-string users. It seems like this is not coordinated; the language acts differently for 8-bit and 16-bit values.
I think there is clear value in having more distinct types; having distinct 8-bit char AND and 8-bit "byte" would be much nicer, e.g. in usage for operator overloading. For example:
// This kind of sucks...
BYTE m = 59; // This is really 'unsigned char' because there is no other option
cout << m; // outputs character data ";" because it assumes 8-bits is char data.
// This is a consequence of limited ability to overload
// But for wide strings, the behavior is different and better...
unsigned short s = 59;
wcout << s; // Prints the number "59" like we expect
wchar_t w = L'C'
wcout << w; // Prints out "C" like we expect
The language would be more consistent if there were a new 8-bit integer type introduced, which would enable more intelligent overloads and overloads that behave more similarly irrespective of if you are using narrow or wide strings.

Yes, probably, but using single-byte integers that aren't characters is pretty rare and you can trivially get around your stated problem via integral promotion (try applying a unary + and see what happens).
It's also worth noting that your premise is flawed: wchar_t and unsigned short have always been distinct types, per paragraph 3.9.1/5 in C++98, C++03, C++11 and C++14.

Char vs unsigned char for byte arrays

When storing "byte arrays" (blobs...) is it better to use char or unsigned char for the items (unsigned char a.k.a. uint8_t)? (Standard says that sizeof of both is precisely 1 Byte.)
Does it matter at all? Or one is more convenient or prevalent than the other? Maybe, what libraries like Boost do use?

If char is signed, then performing arithmetic on a byte value with the high bit set will result in sign extension when promoting to int; so, for example:
char c = '\xf0';
int res = (c << 24) | (c << 16) | (c << 8) | c;
will give 0xfffffff0 instead of 0xf0f0f0f0. This can be avoided by masking with 0xff.
char may still be preferable if you're interfacing with libraries that use it instead of unsigned char.
Note that a cast from char * to/from unsigned char * is always safe (3.9p2). A philosophical reason to favour unsigned char is that 3.9p4 in the standard favours it, at least for representing byte arrays that could hold memory representations of objects:
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).

Theoretically, the size of a byte in C++ is dependant on the compiler-settings and target platform, but it is guaranteed to be at least 8 bits, which explains why sizeof(uint8_t) is required to be 1.
Here's more precisely what the standard has to say about it
§1.71
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation-defined. The least
significant bit is called the low-order bit; the most significant bit
is called the high-order bit. The memory available to a C++ program
consists of one or more sequences of contiguous bytes. Every byte has
a unique address.
So, if you are working on some special hardware where bytes are not 8 bits, it may make a practical difference. Otherwise, I'd say that it's a matter of taste and what information you want to communicate via the choice of type.

One of the other problems with potentially using a signed value for blobs is that the value will depend on the sign representation, which is not part of the standard. So, it's easier to invoke undefined behavior.
For example...
signed char x = 0x80;
int y = 0xffff00ff;
y |= (x << 8); // UB
The actual arithmetic value would also strictly depend two's complement, which may give some people surprises. Using unsigned explicitly avoids these problems.

makes no practcial difference although maybe from a readability point of view it is more clear if the type is unsigned char implying values 0..255.

3.9.1 Fundamental types

C++ Standard §3.9.1 Fundamental types
Objects declared as characters (char)
shall be large enough to store any
member of the implementation’s basic
character set. If a character from
this set is stored in a character
object, the integral value of that
character object is equal to the value
of the single character literal form
of that character. It is
implementation-defined whether a char
object can hold negative values.
Characters can be explicitly declared
unsigned or signed. Plain char, signed
char, and unsigned char are three
distinct types.<...>
I could not make sense of unsigned char.
A number may be +1 or -1.
I can not think -A and +A in similar manner.
What is the Historical reason of introducing unsigned char.

A char is actually an integral type. It is just that the type is also used to represent a character too. Since it is an integral type, it is valid to talk about signedness.
(I don't know exactly about the historical reason. Probably to save a keyword for byte by conflating it with char.)

In C (and thus C++), char does not mean character. It means a byte (int_least8_t). This is a historical legacy from the pre-Unicode days when a characters could actually fit in a char, but is now a flaw in the language.
Since char is really a small integer, having signed char and unsigned char makes sense. There are actually three distinct char types: char, signed char, and unsigned char. A common convention is that unsigned char represents bytes while plain char represents characters UTF-8 code units.

Computers do not "understand" the concept of alphabets or characters; they only work on numbers. So a bunch of people got together and agreed on what number maps to what letter. The most common one in use is ASCII (although the language does not guarantee that).
In ASCII, the letter A has the code 65. In environments using ASCII, the letter A would be represented by the number 65.
The char datatype also serves as an integral type - meaning that it can hold just numbers, so unsigned and signed was allowed. On most platforms I've seen, char is a single 8-bit byte.

You're reading too much in to it. A character is a small integral type that can hold a character. End of story. Unsigned char was never introduced or intended, it's just how it is, because char is an integral type identical to int or long or short, it's just the size that's different. The fact is that there's little reason to use unsigned char, but people do if they want one-byte unsigned integral storage.

If you want a small memory foot print and want to store a number than signed and unsigned char are usefull.
unsigned char is needed if you want to use a value between 128-255
unsigned char score = 232;
signed char is usfull if you want to store the difference between two characters.
signed char diff = 'D' - 'A';
char is distinct from the other two because you can not assume it is either.

You can use the the overflow from 255 to 0? (I don't know. Just a guess)
Maybe it is not only about characters but also about numbers between -128 and 127, and 0 to 255.

Think of the ASCII character set.
Historically, all characters used for text in computing were defined by the ASCII character set. Each character was represented by an 8 bit byte, which was unsigned, hence each character had a value in the range of 0 - 255.
The word character was reduced to char for coding.
An 8 bit char used the same memory as an 8 bit byte and as such they were interchangeable as far as a compiler was concerned.
The compiler directive unsigned (all numbers were signed by default as twos compliment is used to represent negative numbers in binary) when applied to a byte or a char forced them to have a value in the range 0-255.
If unsigned then then had a value of -128 - +127.
Nowadays with the advent of UNICODE and multiple byte character sets this relationship between byte and char no longer exists.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Endianness of string literals and usage of strings in case statements - c++

The bytes are layed out as at increasing memory addresses as 0x31, 0x32, 0x33, 0x34. In a 32-bit integer is little endian you've got 0x34333231. If big endian 0x31323334. (Also in general integers are aligned on even or 4-fold addresses.)

Related

confusing sizeof operator result

What guarantees that a static_cast of a UTF-8 code unit stored in a char32_t or char16_t to char will turn out as expected?

Why are wchar_t / unsigned short now distinct, but there is no analogous char / unsigned byte distinction?

Char vs unsigned char for byte arrays

3.9.1 Fundamental types

Categories

Resources