C++ encoding macro - c++

Is there a macro that tells you what encoding C++ is using for its wchar_t type?
I am currently stuck to GNU and clang. I am guessing UTF32 because my wchar_t has a size of 4 bytes. Although it could be UTF-16, it also uses 4 bytes for some code-points.
But then there is still the problem of UCS-4 or UTF-32LE or UTF-32BE.
Any help/expertise on this topic?

wchar_t is implementation specific. It is not bound any specific encoding. If you are on a platform where wchar_t is 16 bits then it simply does not support UTF32 for example.
Encoding (UTF8, UTF32) and storage (wchar_t) are different things.

There is no such macro in C++. In C99, there is macro STDC_ISO_10646 to indicate that wchar_t is Unicode. In C++, encoding of characters stored in wchar_t depends on locale and it is implementation-defined feature. In other words, you need to consult documentation of the C++ implementation you use to see see what wchar_t is associated with each locale.

Related

Performance issues with u_snprintf_u from libicu

I'm porting some application from wchar_t for C strings to char16_t offered by C++11.
Although I have an issue. The only library I found that can handle snprintf for char16_t types is ICU with their UChar types.
The performance of u_snprintf_u (equivalent to swprintf/snprintf, but taking Uchar as arguments) is abismal.
Some testing leads to u_snprintf_u being 25x slower than snprintf.
Example of what I get on valgrind :
As you can see, the underlying code is doing too much work and instanciating internal objects that I don't want.
Edit : The data I'm working with doesn't need to be interpreted by the underlying ICU code. It's ascii oriented. I didn't find any way to tell ICU to not try to apply locales and such on such function calls.

C++ small vs all caps datatype

Why in C++ (MSVS), datatypes with all caps are defined (and most of them are same)?
These are exactly the same. Why all caps versions are defined?
double and typedef double DOUBLE
char and typedef char CHAR
bool and BOOL (typedef int BOOL), here both all small and all caps represent Boolean states, why int is used in the latter?
What extra ability was gained through such additional datatypes?
The ALLCAPS typedefs started in the very first days of Windows programming (1.0 and before). Back then, for example, there was no such thing as a bool type. The Windows APIs and headers were defined for old-school C. C++ didn't even exist back when they were being developed.
So to help document the APIs better, compiler macros like BOOL were introduced. Even though BOOL and INT were both macros for the same underlying type (int), this let you look at a function's type signature to see whether an argument or return value was intended as a boolean value (defined as "0 for false, any nonzero value for true") or an arbitrary integer.
As another example, consider LPCSTR. In 16-bit Windows, there were two kinds of pointers: near pointers were 16-bit pointers, and far pointers used both a 16-bit "segment" value and a 16-bit offset into that segment. The actual memory address was calculated in the hardware as ( segment << 4 ) + offset.
There were macros or typedefs for each of these kinds of pointers. NPSTR was a near pointer to a character string, and LPSTR was a far pointer to a character string. If it was a const string, then a C would get added in: NPCSTR or LPCSTR.
You could compile your code in either "small" model (using near pointers by default) or "large" model (using far pointers by default). The various NPxxx and LPxxx "types" would explicitly specify the pointer size, but you could also omit the L or N and just use PSTR or PCSTR to declare a writable or const pointer that matched your current compilation mode.
Most Windows API functions used far pointers, so you would generally see LPxxx pointers there.
BOOL vs. INT was not the only case where two names were synonyms for the same underlying type. Consider a case where you had a pointer to a single character, not a zero-terminated string of characters. There was a name for that too. You would use PCH for a pointer to a character to distinguish it from PSTR which pointed to a zero-terminated string.
Even though the underlying pointer type was exactly the same, this helped document the intent of your code. Of course there were all the same variations: PCCH for a pointer to a constant character, NPCH and LPCH for the explicit near and far, and of course NPCCH and LPCCH for near and far pointers to a constant character. Yes, the use of C in these names to represent both "const" and "char" was confusing!
When Windows moved to 32 bits with a "flat" memory model, there were no more near or far pointers, just flat 32-bit pointers for everything. But all of these type names were preserved to make it possible for old code to continue compiling, they were just all collapsed into one. So NPSTR, LPSTR, plain PSTR, and all the other variations mentioned above became synonyms for the same pointer type (with or without a const modifier).
Unicode came along around that same time, and most unfortunately, UTF-8 had not been invented yet. So Unicode support in Windows took the form of 8-bit characters for ANSI and 16-bit characters (UCS-2, later UTF-16) for Unicode. Yes, at that time, people thought 16-bit characters ought to be enough for anyone. How could there possibly be more than 65,536 different characters in the world?! (Famous last words...)
You can guess what happened here. Windows applications could be compiled in either ANSI or Unicode ("Wide character") mode, meaning that their default character pointers would be either 8-bit or 16-bit. You could use all of the type names above and they would match the mode your app was compiled in. Almost all Windows APIs that took string or character pointers came in both ANSI and Unicode versions, with an A or W suffix on the actual function name. For example, SetWindowText( HWND hwnd, LPCSTR lpString) became two functions: SetWindowTextA( HWND hwnd, LPCSTR lpString ) or SetWindowTextW( HWND hwnd, LPCWSTR lpString ). And SetWindowText itself became a macro defined as one or the other of those depending on whether you compiled for ANSI or Unicode.
Back then, you might have actually wanted to write your code so that it could be compiled either in ANSI or Unicode mode. So in addition to the macro-ized function name, there was also the question of whether to use "Howdy" or L"Howdy" for your window title. The TEXT() macro (more commonly known as _T() today) fixed this. You could write:
SetWindowText( hwnd, TEXT("Howdy") );
and it would compile to either of these depending on your compilation mode:
SetWindowTextA( hwnd, "Howdy" );
SetWindowTextW( hwnd, L"Howdy" );
Of course, most of this is moot today. Nearly everyone compiles their Windows apps in Unicode mode. That is the native mode on all modern versions of Windows, and the ...A versions of the API functions are shims/wrappers around the native Unicode ...W versions. By compiling for Unicode you avoid going through all those shim calls. But you still can compile your app in ANSI (or "multi-byte character set") mode if you want, so all of these macros still exist.
Microsoft decided to create macros or type aliases for all of these types, in the Windows code. It's possible that they were going for "consistency" with the all-caps WinAPI type aliases, like LPCSTR.
But what real benefit does it serve? None.
The case of BOOL is particularly headache-inducing. Although some old-school C had a convention of doing this (before actual bool entered the language), nowadays it really just confuses… especially when using WinAPI with C++.
This convention goes back more than 30 years to the early days of the Windows operating system.
Current practice in the '70s and early '80s was still to use all caps for programming in various assembly languages and the higher languages of the day, Fortran, Cobol... as well as command line interpreters and file system defaults. A habit probably rooted in the encoding of punch cards that goes back way further, to the dawn of the 20th century.
When I started programming in 1975, the card punch we used did not even support lowercase letters as you can see on the pictures, it did not even have a shift key.
MS/DOS was written in assembly language, as were most successful PC packages of the early '80s such as Lotus 1-2-3, MS Word, etc. C was invented at Bell Labs for the Unix system and took a long time to gain momentum in the PC world.
In the budding microprocessor world, there were literally 2 separate schools: the Intel little endian world with all caps assembly documentation and the big endian Motorola alternative, with small caps assembly, C and Unix operating systems and clones and other weird languages such as lisp.
Windows is the brainchild of the former and this proliferation of all caps types and modifiers did not seem ugly then, it looked consistent and reassuring. Microsoft tried various alternatives for the pointer modifiers: far, _far, __far, FAR and finally got rid of these completely but kept the original allcaps typedefs for compatibility purposes, leading to silly compromises such as 32-bit LONG even on 64-bit systems.
This answer is not unbiassed, but it was fun reviving these memories.
Only MS knows.
The only benefit I can think of is for some types (e.g. int) whose size is OS dependant (see the table here). This would allow to use 16 bits type on a 64 bits OS, with some more typedefs or #defines. The code would be easier to port to other OS versions.
Now, if this "portable" thing was true, then the rest of types would follow the same convention, even their sizes were the same in all machines.

Does the g++ 4.8.2 compiler support Unicode characters?

Consider the following statements -
cout<<"\U222B";
int a='A';
cout<<a;
The first statement prints an integration sign (the character equivalent to the Unicode code point) whereas the second cout statement prints the ASCII value 65.
So I want to ask two things -
1) If my compiler supports Unicode character set then why it is implementing the ASCII character set and showing the ascii values of the characters?
2) With reference to this question - what is the difference in defining the 'byte' in terms of computer memory and in terms of C++?
Does my compiler implement 16-bit or 32-bit byte? If yes, then why do the value of CHAR_BIT is set to 8?
In answer to your first question, the bottom 128 code points of Unicode are ASCII. There's no real distinction between the two.
The reason you're seeing 65 is because the thing you're outputting (a) is an int rather than a char (it may have started as a char but, by putting it into a, you modified how it would be treated in future).
For your second question, a byte is a char, at least as far as the ISO C and C++ standards are concerned. If CHAR_BIT is defined as 8, that's how wide your char type is.
However, you should keep in mind the difference between Unicode code points and Unicode representations (such as UTF-8). Having CHAR_BIT == 8 will still allow Unicode to work if UTF-8 representation is used.
My advice would be to capture the output of you program with a hex dump utility, you may well find the Unicode character is coming out as e2 88 ab, which is the UTF-8 representation of U+222B. It will then be interpreted by something outside of the program (eg, the terminal program) to render the correct glyph(s):
#include <iostream>
using namespace std;
int main() { cout << "\u222B\n"; }
Running that program above shows what's being output:
pax> g++ -o testprog testprog.cpp ; ./testprog
∫
pax> ./testprog | hexdump
0000000 e2 88 ab 0a
You could confirm that by generating the same UTF-8 byte sequence in a different way:
pax> printf "\xe2\x88\xab\n"
∫
There are several different questions/issues here:
As paxdiablo pointed out, you're seeing "65" because you're outputting "a" (value 'A' = ASCII 65) as an "int".
Yes, gcc supports Unicode source files: --finput-charset=OPTION
The final issue is whether the C++ compiler treats your "strings" as 8-bit ASCII or n-bit Unicode.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian:
How well is Unicode supported in C++11?
PS:
As far as language support for Unicode:
Java was designed from the ground up for Unicode.
Unfortunately, at the time that meant only UTF-16. Java 5 supported nicode 6.0, Java 7 Unicode 6.0 and the current Java 8 supports Unicode 6.2.
.Net is newer. C#, VB.Net and C++/CLI all fully support Unicode 4.0.
Newer versions of .Net support newer versions of Unicode. For example, .Net 4.0 supports Unicode 5.1](What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?).
Python3 also supports Unicode 4.0: http://www.diveintopython3.net/strings.html
For of all, sorry for my English if it has mistakes.
A C++ byte is any defined amount of bits large enough to transport every character of a set specified by the standard. This required set of characters is a subset of ASCII, and that previously defined "amount of bits" must be the memory unit for chars, the tiniest memory atom of C++. Every other type must be a multiple of sizeof(char) (any C++ value is a bunch of chars continously stored on memory).
So, sizeof(char) must be 1 by definition, because is the memory measurement unit of C++. If that 1 means 1 physical byte or not is an implementation issue, but universally accepted as 1 byte.
What I don't understand is what do you mean with 16-bit or 32-bit byte.
Other related question is about the encoding your compiled applies to your source texts, literal strings included. A compiler, if I'm not wrong, normalizes each translation unit (source code file) to an encoding of its choice to handle the file.
I don't really know what happens under the hood, but perhaps you have read something somewhere about source file/internal encoding, and 16 bits/32bits encodings and all the mess is blended on your head. I'm still confused either though.

c++ std::wcout. fail and bad bits are set

Why L"&'v\x5\x17\x15-\x1dR\x14]Dv\x1991q-5Xp\x13\x172" value in a container of the type std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > yields setting bad and fail bits in std::wcout when being output?
Not sure about what really happens in your code but on windows you have to use platform-specific functions to enable UTF-16 support in console because it uses your default ANSI codepage, i.e. cp1251 (see http://www.siao2.com/2008/03/18/8306597.aspx). This solves the problem with bad bits:
#include <fcntl.h>
#include <io.h>
/*...*/
_setmode( _fileno(stdout), _O_U16TEXT );
/*...*/
You also have to pick the correct font for the console otherwise some of the characters will look like squares. For example, you may pick Lucida Console or Consolas but if you look at the character map, you will see that there are no Chinese characters and many others too in these fonts. If you want to change font to something different try doing this https://superuser.com/questions/5035/how-to-change-the-windows-xp-console-font but beware (http://support.microsoft.com/kb/247815).
Also remember that Unicode support on POSIX-like OSes and Windows is different. Windows uses UTF-16 encoding stored in wchar_t types while most POSIX operating systems use UTF-8 and char for storage.

What's the standard-defined endianness of std::wstring?

I know the UTF-16 has two types of endiannesses: big endian and little endian.
Does the C++ standard define the endianness of std::wstring? or it is implementation-defined?
If it is standard-defined, which page of the C++ standard provide the rules on this issue?
If it is implementation-defined, how to determine it? e.g. under VC++. Does the compiler guarantee the endianness of std::wstring is strictly dependent on the processor?
I have to know this; because I want to send the UTF-16 string to others. I must add the correct BOM in the beginning of the UTF-16 string to indicate its endianness.
In short: Given a std::wstring, how should I reliably determine its endianness?
Endianess is MACHINE dependent, not language dependent. Endianess is defined by the processor and how it arranges data in and out of memory. When dealing with wchar_t (which is wider than a single byte), the processor itself upon a read or write aligns the multiple bytes as it needs to in order to read or write it back to RAM again. Code simply looks at it as the 16 bit (or larger) word as represented in a processor internal register.
For determining (if that is really what you want to do) endianess (on your own), you could try writing a KNOWN 32 bit (unsigned int) value out to ram, then read it back using a char pointer. Look for the ordering that is returned.
It would look something like this:
unsigned int aVal = 0x11223344;
char * myValReadBack = (char *)(&aVal);
if(*myValReadBack == 0x11) printf("Big endian\r\n");
else printf("Little endian\r\n");
Im sure there are other ways, but something like the above should work, check my little versus big though :-)
Further, until Windows RT, VC++ really only compiled to intel type processors. They really only have had 1 endianess type.
It is implementation-defined. wstring is just a string of wchar_t, and that can be any byte ordering, or for that matter, any old size.
wchar_t is not required to be UTF-16 internally and UTF-16 endianness does not affect how wchar's are stored, it's a matter of saving and reading it.
You have to use an explicit procedure of converting wstring to a UTF-16 bytestream before sending it anywhere. Internal endianness of wchar is architecture-dependent and it's better to use some opaque interfaces for converting than try to convert it manually.
For the purposes of sending the correct BOM, you don't need to know the endianness. Just use the code \uFEFF. That will be bigendian or little-endian depending on the endianness of your implementation. You don't even need to know whether your implementation is UTF-16 or UTF-32. As long as it is some unicode encoding, you'll end up with the appropriate BOM.
Unfortunately, neither wchars nor wide streams are guaranteed to be unicode.