What's the standard-defined endianness of std::wstring? - c++

I know the UTF-16 has two types of endiannesses: big endian and little endian.
Does the C++ standard define the endianness of std::wstring? or it is implementation-defined?
If it is standard-defined, which page of the C++ standard provide the rules on this issue?
If it is implementation-defined, how to determine it? e.g. under VC++. Does the compiler guarantee the endianness of std::wstring is strictly dependent on the processor?
I have to know this; because I want to send the UTF-16 string to others. I must add the correct BOM in the beginning of the UTF-16 string to indicate its endianness.
In short: Given a std::wstring, how should I reliably determine its endianness?

Endianess is MACHINE dependent, not language dependent. Endianess is defined by the processor and how it arranges data in and out of memory. When dealing with wchar_t (which is wider than a single byte), the processor itself upon a read or write aligns the multiple bytes as it needs to in order to read or write it back to RAM again. Code simply looks at it as the 16 bit (or larger) word as represented in a processor internal register.
For determining (if that is really what you want to do) endianess (on your own), you could try writing a KNOWN 32 bit (unsigned int) value out to ram, then read it back using a char pointer. Look for the ordering that is returned.
It would look something like this:
unsigned int aVal = 0x11223344;
char * myValReadBack = (char *)(&aVal);
if(*myValReadBack == 0x11) printf("Big endian\r\n");
else printf("Little endian\r\n");
Im sure there are other ways, but something like the above should work, check my little versus big though :-)
Further, until Windows RT, VC++ really only compiled to intel type processors. They really only have had 1 endianess type.

It is implementation-defined. wstring is just a string of wchar_t, and that can be any byte ordering, or for that matter, any old size.

wchar_t is not required to be UTF-16 internally and UTF-16 endianness does not affect how wchar's are stored, it's a matter of saving and reading it.
You have to use an explicit procedure of converting wstring to a UTF-16 bytestream before sending it anywhere. Internal endianness of wchar is architecture-dependent and it's better to use some opaque interfaces for converting than try to convert it manually.

For the purposes of sending the correct BOM, you don't need to know the endianness. Just use the code \uFEFF. That will be bigendian or little-endian depending on the endianness of your implementation. You don't even need to know whether your implementation is UTF-16 or UTF-32. As long as it is some unicode encoding, you'll end up with the appropriate BOM.
Unfortunately, neither wchars nor wide streams are guaranteed to be unicode.

Related

Converting a uint8_t to its binary representation

I have a variable of type uint8_t which I'd like to serialize and write to a file (which should be quite portable, at least for Windows, which is what I'm aiming at).
Trying to write it to a file in its binary form, I came accross this working snippet:
uint8_t m_num = 3;
unsigned int s = (unsigned int)(m_num & 0xFF);
file.write((wchar_t*)&s, 1); // file = std::wofstream
First, let me make sure I understand what this snippet does - it takes my var (which is basically an unsigned char, 1 byte long), converts it into an unsigned int (which is 4 bytes long, and not so portable), and using & 0xFF "extracts" only the least significant byte.
Now, there are two things I don't understand:
Why convert it into unsigned int in the first place, why can't I simply do something like
file.write((wchar_t*)&m_num, 1); or reinterpret_cast<wchar_t *>(&m_num)? (Ref)
How would I serialize a longer type, say a uint64_t (which is 8 bytes long)? unsigned int may or may not be enough here.
uint8_t is 1 byte, same as char
wchar_t is 2 bytes in Windows, 4 bytes in Linux. It is also depends on endianness. You should avoid wchar_t if portability is a concern.
You can just use std::ofstream. Windows has an additional version for std::ofstream which accepts UTF16 file name. This way your code is compatible with Windows UTF16 filenames and you can still use std::fstream. For example
int i = 123;
std::ofstream file(L"filename_in_unicode.bin", std::ios::binary);
file.write((char*)&i, sizeof(i)); //sizeof(int) is 4
file.close();
...
std::ifstream fin(L"filename_in_unicode.bin", std::ios::binary);
fin.read((char*)&i, 4); // output: i = 123
This is relatively simple because it's only storing integers. This will work on different Windows systems, because Windows is always little-endian, and int size is always 4.
But some systems are big-endian, you would have to deal with that separately.
If you use standard I/O, for example fout << 123456 then integer will be stored as text "123456". Standard I/O is compatible, but it takes a little more disk space and can be a little slower.
It's compatibility versus performance. If you have large amounts of data (several mega bytes or more) and you can deal with compatibility issues in future, then go ahead with writing bytes. Otherwise it's easier to use standard I/O. The performance difference is usually not measurable.
It is impossible to write unit8_t values to a wofstream because a wofstream only writes wide characters and doesn't handle binary values at all.
If what you want to do is to write a wide character representing a code point between 0 and 255, then your code is correct.
If you want to write binary data to a file then your nearest equivalent is ofstream, which will allow you to write bytes.
To answer your questions:
wofstream::write writes wide characters, not bytes. If you reinterpret the address of m_num as the address of a wide character, you will be writing a 16-bit or 32-bit (depending on platform) wide character of which the first byte (that is, the least significant or most significant, depending on platform) is the value of m_num and the remaining bytes are whatever happens to occur in memory after m_num. Depending on the character encoding of the wide characters, this may not even be a valid character. Even if valid, it is largely nonsense. (There are other possible problems if wofstream::write expects a wide-character-aligned rather than a byte-aligned input, or if m_num is immediately followed by unreadable memory).
If you use wofstream then this is a mess, and I shan't address it. If you switch to a byte-oriented ofstream then you have two choices. 1. If you will only ever be reading the file on the same system, file.write(&myint64value,sizeof(myint64value)) will work. The sequence in which the bytes of the 64-bit value are written will be undefined, but the same sequence will be used when you read back, so this doesn't matter. Don't try do something analogous with wofstream because it's dangerous! 2. Extract each of the 8 bytes of myint64value separately (shift right by a multiple of 8 bits and then take the bottom 8 bits) and then write it. This is fully portable because you control the order in which the bytes are written.

Reversing long read from file?

I'm trying to read a long (signed, 4 bytes) from a binary file in C++.
My main concerns are: portability (longs are different sizes on different platforms), when you read from binary files w/ std::ifstream, it reverses the byte order (to my machine's endianness).
I understand for data types like unsigned int, you can simply use bitwise operators and shift and AND each byte to reverse the byte order after being read from a file.
I'm just not sure what I'd do for this:
Currently my code will give a nonsense value:
long value;
in.seekg(0x3c);
in.read(reinterpret_cast<char*>(&value), sizeof(long));
I'm not sure how I can achieve portability (I read something about unions and char*) and also reverse the signed long it reads in.
Thanks.
Rather than using long, use int32_t from <stdint.h> to directly specify a 32-bit integer. (or uint32_t for unsigned).
Use htonl and ntohl as appropriate to get to/from network byte order.
Better:
int32_t value;
in.seekg(0x3c);
in.read(reinterpret_cast<char*>(&value), sizeof(value));
value = ntohl(value); // convert from big endian to native endian
I'd suggest you use functions like htonl, htnons, ntohl and ntohs. These are used in network programming to achieve just the same goal: portability and independence of endianness.
Since cross platform support is important to you I'd recommend using cstdint to specify the size of your types. You'll be able to say int32_t x (for example) and know you are getting 32 bits of data.
Regarding the endianness of the data I'd recommend standardizing on a format (eg all data is written in little endian format) and wrapping your I/O operations in a class and using it to read/write the data. Then use a #define to decide how to read the data:
#ifdef BIG_ENDIAN
// Read the data that is in little endian format and convert
#else
// We're in little endian mode so no need to convert data
#endif
Alternatively you could look at using something like Google Protobuf that will take care of all the encoding issues for you.

int8_t and char: converts between pointers to integer types with different sign - but it doesn't

I'm working with some embedded code and I am writing something new from scratch so I am preferring to stick with the uint8_t, int8_t and so on types.
However, when porting a function:
void functionName(char *data)
to:
void functionName(int8_t *data)
I get the compiler warning "converts between pointers to integer types with different sign" when passing a literal string to the function. ( i.e. when calling functionName("put this text in"); ).
Now, I understand why this happens and these lines are only debug however I wonder what people feel is the most appropriate way of handling this, short of typecasting every literal string. I don't feel that blanket typecasting in any safer in practice than using potentially ambiguous types like "char".
You seem to be doing the wrong thing, here.
Characters are not defined by C as being 8-bit integers, so why would you ever choose to use int8_t or uint8_t to represent character data, unless you are working with UTF-8?
For C's string literals, their type is pointer to char, and that's not at all guaranteed to be 8-bit.
Also it's not defined if it's signed or unsigned, so just use const char * for string literals.
To answer your addendum (the original question was nicely answered by #unwind). I think it mostly depends on the context. If you are working with text i.e. string literals you have to use const char* or char* because the compiler will convert the characters accordingly. Short of writing your own string implementation you are probably stuck with whatever the compiler provides to you. However, the moment you have to interact with someone/something outside of your CPU context e.g. network, serial, etc. you have to have control over the exact size (which I suppose is where your question stems from). In this case I would suggest writing functions to convert strings or any data-type for that matter to uint8_t buffers for serialized sending (or receiving).
const char* my_string = "foo bar!";
uint8_t buffer* = string2sendbuffer(my_string);
my_send(buffer, destination);
The string2buffer function would know everything there is to know about putting characters in a buffer. For example it might know that you have to encode each char into two buffer elements using big-endian byte ordering. This function is most certainly platform dependent but encapsulates all this platform dependence so you would gain a lot of flexibility.
The same goes for every other complex data-type. For everything else (where the compiler does not have that strong an opinion) I would advise on using the (u)intX_t types provided by stdint.h (which should be portable).
It is implementation-defined whether the type char is signed or unsigned. It looks like you are using an environment where is it unsigned.
So, you can either use uint8_t or stick with char, whenever you are dealing with characters.

Data conversion for ARM platform (from x86/x64)

We have developed win32 application for x86 and x64 platform. We want to use the same application on ARM platform. Endianness will vary for ARM platform i.e. ARM platform uses Big endian format in general. So we want to handle this in our application for our device.
For e.g. // In x86/x64, int nIntVal = 0x12345678
In ARM, int nIntVal = 0x78563412
How values will be stored for the following data types in ARM?
double
char array i.e. char chBuffer[256]
int64
Please clarify this.
Regards,
Raphel
Endianess only matters for register <-> memory operations.
In a register there is no endianess. If you put
int nIntVal = 0x12345678
in your code it will have the same effect on any endianess machine.
all IEEE formats (float, double) are identical in all architectures, so this does not matter.
You only have to care about endianess in two cases:
a) You write integers to files that have to be transferable between the two architectures.
Solution: Use the hton*, ntoh* family of converters, use a non-binary file format (e.g. XML) or a standardised file format (e.g. SQLite).
b) You cast integer pointers.
int a = 0x1875824715;
char b = a;
char c = *(char *)&a;
if (b == c) {
// You are working on Little endian
}
The latter code by the way is a handy way of testing your endianess at runtime.
Arrays and the likes if you use write, fwrite falimies of calls to transfer them you will have no problems unless they contain integers: then look above.
int64_t: look above. Only care if you have to store them binary in files or cast pointers.
(Sergey L., above, says, taht you mostly don't have to care for the byte order. He is right, with at least 1 exception: I assumed you want to convert binary data from one platform to the other ...)
http://en.wikipedia.org/wiki/Endianness has a good overview.
In short:
Little endian means, the least significant byte is stored first (at the lowest address)
Big endian means the most significant byte is stored first
The order in which array elements are stored is not affected (but the byte order in array elements, of course)
So
char array is unchanged
int64 - byte order is reversed compared to x86
With regard to the floating point format, consider http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness. Generally it seems to obey the same rules of endianness as the integer format, but there is are exceptions for older ARM platforms. (I've no first hand experience of that).
Generally I'd suggest, test your conversion of primitive types by controlled experiments first.
Also consider, that compilers might use different padding in structs (a topic you haven't addressed yet).
Hope this helps.
In 98% cases you don't need to care about endianness. Unless you need to transfer some data between systems of different endiannness, or read/write some endian-sensitive file format, you should not bother with it. And even in those cases, you can write your code to perform properly when compiled under any endianness.
From Rob Pike's "The byte order fallacy" post:
Let's say your data stream has a little-endian-encoded 32-bit integer.
Here's how to extract it (assuming unsigned bytes):
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
If it's big-endian, here's how to extract it:
i = (data[3]<<0) | (data[2]<<8) | (data[1]<<16) | (data[0]<<24);
Both these snippets work on any machine, independent of the machine's
byte order, independent of alignment issues, independent of just about
anything. They are totally portable, given unsigned bytes and 32-bit
integers.
The arm is little endian, it has two big endian variants depending on architecture, but it is better to just run native little endian, the tools and volumes of code out there are more fully tested in little endian mode.
Endianness is just one factor in system engineering if you do your system engineering it all works out, no fears, no worries. Define your interfaces and code to that design. Assuming for example that one processors endianness automatically results in having to byteswap is a bad assumption and will bite you eventually. You will end up having to swap an even number of times to undo other bad assumptions that cause a swap (ideally swapping zero times of course rather than 2 or 4 or 6, etc times). If you have any endian concerns at all when writing code you should write it endian independent.
Since some ARMs have BE32 (word invariant) and the newer arms BE8 (byte invariant) you would have to do even more work to try to make something generic that also tries to compensate for little intel, little arm, BE32 arm and BE8 arm. Xscale tends to run big endian natively but can be run as little endian to reduce the headaches. You may be assuming that because an ARM clone is big endian then all are big endian, another bad assumption.

Endianness swap without ntohs

I am writing an ELF analyzer, but I'm having some trouble converting endianness properly. I have functions to determine the endianness of the analyzer and the endiannness of the object file.
Basically, there are four possible scenarios:
A big endian compiled analyzer run on a big endian object file
nothing needs converted
A big endian compiled analyzer run on a little endian object file
the byte order needs swapped, but ntohs/l() and htons/l() are both null macros on a big endian machine, so they won't swap the byte order. This is the problem
A little endian compiled analyzer run on a big endian object file
the byte order needs swapped, so use htons() to swap the byte order
A little endian compiled analyzer run on a little endian object file.
nothing needs converted
Is there a function I can use to explicitly swap byte order/change endianness, since ntohs/l() and htons/l() take the host's endianness into account and sometimes don't convert? Or do I need to find/write my own swap byte order function?
I think it's worth raising The Byte Order Fallacy article here, by Rob Pyke (one of Go's author).
If you do things right -- ie you do not assume anything about your platforms byte order -- then it will just work. All you need to care about is whether ELF format files are in Little Endian or Big Endian mode.
From the article:
Let's say your data stream has a little-endian-encoded 32-bit integer. Here's how to extract it (assuming unsigned bytes):
i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);
If it's big-endian, here's how to extract it:
i = (data[3]<<0) | (data[2]<<8) | (data[1]<<16) | (data[0]<<24);
And just let the compiler worry about optimizing the heck out of it.
In Linux there are several conversion functions in endian.h, which allow to convert between arbitrary endianness:
uint16_t htobe16(uint16_t host_16bits);
uint16_t htole16(uint16_t host_16bits);
uint16_t be16toh(uint16_t big_endian_16bits);
uint16_t le16toh(uint16_t little_endian_16bits);
uint32_t htobe32(uint32_t host_32bits);
uint32_t htole32(uint32_t host_32bits);
uint32_t be32toh(uint32_t big_endian_32bits);
uint32_t le32toh(uint32_t little_endian_32bits);
uint64_t htobe64(uint64_t host_64bits);
uint64_t htole64(uint64_t host_64bits);
uint64_t be64toh(uint64_t big_endian_64bits);
uint64_t le64toh(uint64_t little_endian_64bits);
Edited, less reliable solution. You can use union to access the bytes in any order. It's quite convenient:
union {
short number;
char bytes[sizeof(number)];
};
Do I need to find/write my own swap byte order function?
Yes you do. But, to make it easy, I refer you to this question: How do I convert between big-endian and little-endian values in C++? which gives a list of compiler specific byte order swap functions, as well as some implementations of byte order swap functions.
The ntoh functions can swap between more than just big and little endian. Some systems are also 'middle endian' where the bytes are scrambled up rather than just ordered one way or another.
Anyway, if all you care about are big and little endian, then all you need to know is if the host and the object file's endianess differ. You'll have your own function which unconditionally swaps byte order and you'll call it or not based on whether or not host_endianess()==objectfile_endianess().
If I would think about a cross-platform solution that would work on windows or linux, I would write something like:
#include <algorithm>
// dataSize is the number of bytes to convert.
char le[dataSize];// little-endian
char be[dataSize];// big-endian
// Fill contents in le here...
std::reverse_copy(le, le + dataSize, be);