How can I embed unicode string constants in a source file? - c++

I'm writing some unit tests which are going to verify our handling of various resources that use other character sets apart from the normal latin alphabet: Cyrilic, Hebrew etc.
The problem I have is that I cannot find a way to embed the expectations in the test source file: here's an example of what I'm trying to do...
///
/// Protected: TestGetHebrewConfigString
///
void CPrIniFileReaderTest::TestGetHebrewConfigString()
{
prwstring strHebrewTestFilePath = GetTestFilePath( strHebrewTestFileName );
CPrIniFileReader prIniListReader( strHebrewTestFilePath.c_str() );
prIniListReader.SetCurrentSection( strHebrewSubSection );
CPPUNIT_ASSERT( prIniListReader.GetConfigString( L"דונדארןמע" ) == L"דונהשךוק") );
}
This quite simply doesnt work. Previously I worked around this using a macro which calls a routine to transform a narrow string to a wide string (we use towstring all over the place in our applications so it's existing code)
#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue )
{
wostringstream os;
os << lpszValue;
return os.str();
}
The assertion in the test above then became:
CPPUNIT_ASSERT( prIniListReader.GetConfigString( UNICODE_CONSTANT( "דונדארןמע" ) ) == UNICODE_CONSTANT( "דונהשךוק" ) );
This worked OK on OS X but now I'm porting to linux and I'm finding that the tests are all failing: it all feels rather hackish as well. Can anyone tell me if they have a nicer solution to this problem?

A tedious but portable way is to build your strings using numeric escape codes. For example:
wchar_t *string = L"דונדארןמע";
becomes:
wchar_t *string = "\x05d3\x05d5\x05e0\x05d3\x05d0\x05e8\x05df\x05de\x05e2";
You have to convert all your Unicode characters to numeric escapes. That way your source code becomes encoding-independent.
You can use online tools for conversion, such as this one. It outputs the JavaScript escape format \uXXXX, so just search & replace \u with \x to get the C format.

You have to tell GCC which encoding your file uses to code those characters into the file.
Use the option -finput-charset=charset, for example -finput-charset=UTF-8. Then you need to tell it about the encoding used for those string literals at runtime. That will determine the values of the wchar_t items in the strings. You set that encoding using -fwide-exec-charset=charset, for example -fwide-exec-charset=UTF-32. Beware that the size of the encoding (utf-32 needs 32bits, utf-16 needs 16bits) must not exceed the size of wchar_t gcc uses.
You can adjust that. That option is mainly useful for compiling programs for wine, designed to be compatible with windows. The option is called -fshort-wchar, and will most likely then be 16bits instead of 32bits, which is its usual width for gcc on linux.
Those options are described in more detail in man gcc, the gcc manpage.

#define UNICODE_CONSTANT( CONSTANT ) towstring( CONSTANT )
wstring towstring( LPCSTR lpszValue ) {
wostringstream os;
os << lpszValue;
return os.str();
}
This does not actually convert at all between Unicode encodings, which requires a dedicated routine. You need to keep your source code and data encodings unified- most people use UTF-8- and then convert that to the OS-specific encoding if necessary (such as UTF-16 on Winders).

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

Printing out Korean in console C++

I am having trouble with printing out korean.
I have tried various methods with no avail.
I have tried
1.
cout << "한글" << endl;
2.
wcout << "한글" << endl;
3.
wprintf(L"한글\n");
4.
setlocale(LC_ALL, "korean");
wprintf("한글");
and more. But all of those prints "한글".
I am using MinGW compiler, and my OS is windows 7.
P.S Strangely Java prints out Korean fine,
String kor = "한글";
System.out.println(kor);
works.
Set the console codepage to utf-8 before printing the text
::SetConsoleOutputCP(65001)
Since you are using Windows 7 you can use WriteConsoleW which is part of the windows API. #include <windows.h> and try the following code:
DWORD numCharsToWrite = str.length();
LPDWORD numCharsWritten = NULL;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str.c_str(), numCharsToWrite, numCharsWritten, NULL);
where str is the is a std::wstring
More on WriteConsoleW: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687401%28v=vs.85%29.aspx
After having tried other methods this worked for me.
Problem is that there are a lot of places where this could be broken.
Here is answer I've posted some time ago (covers Korean). Answear is for MSVC, but same applies to MinGW (compiler switches are different, locale name may be different).
Here are 5 traps which makes this hard:
Source code encoding. Source has to use encoding which supports all required characters. Nowadays UTF-8 is recommended. It is best to make sure your editor (IDE) is properly configure to enforce source encoding.
You have to inform compiler what is encoding of source file. For gcc it is: -finput-charset=utf-8 (it is default)
Encoding used by executable. You have to define what kind of encoding string literals should be encode in final executable. This encoding should also cover required characters. Here UTF-8 is also the best. Gcc option is -fexec-charset=utf-8
When you run application you have to inform standard library what kind of encoding your string literals are define in or what encoding in program logic is used. So somewhere in your code at beginning of execution you need something like this (here UTF-8 is enforced):
std::locale::global(std::locale{".utf-8"});
and finally you have to instruct stream what kind of encoding it should use. So for std::cout and std::cin you should set locale which is default for the system:
auto streamLocale = std::locale{""};
// this impacts date/time/floating point formats, so you may want tweak it just to use sepecyfic encoding and use C-loclae for formating
std::cout.imbue(streamLocale);
std::cin.imbue(streamLocale);
After this everything should work as desired without code which explicitly does conversions.
Since there are 5 places to make mistake, this is reason people have trouble with it and internet is full of "hack" solutions.
Note that if system is not configured for support all needed characters (for example wrong code page is set) then with thsi configuration characters which could not be converted will be replaced with question mark.

Size of UTF-8 string in bytes

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
will produce as the output :
abc
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
I get a garbage like this:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
I am obviously missing something, but what exactly is wrong?
ADDITIONAL QUESTION
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.
The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
To your edited question:
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.
The first thing that goes wrong is your stated assumption. QString doesn't store UTF-8, it stores unicode strings. That's why you need to call str1.toUtf8(). It creates a temporary UTF-8 string.
The second part is just how UTF-8 works. It's a multi-byte extension of ASCII. üß aren't ASCII characters, and you do expect that both characters get a multi-byte representation. std::cout apparently doesn't expect UTF-8. This depends on the std::locale used.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.

_wfopen equivalent under Mac OS X

I'm looking to the equivalent of Windows _wfopen() under Mac OS X. Any idea?
I need this in order to port a Windows library that uses wchar* for its File interface. As this is intended to be a cross-platform library, I am unable to rely on how the client application will get the file path and give it to the library.
POSIX API in Mac OS X are usable with UTF-8 strings. In order to convert a wchar_t string to UTF-8, it is possible to use the CoreFoundation framework from Mac OS X.
Here is a class that will wrap an UTF-8 generated string from a wchar_t string.
class Utf8
{
public:
Utf8(const wchar_t* wsz): m_utf8(NULL)
{
// OS X uses 32-bit wchar
const int bytes = wcslen(wsz) * sizeof(wchar_t);
// comp_bLittleEndian is in the lib I use in order to detect PowerPC/Intel
CFStringEncoding encoding = comp_bLittleEndian ? kCFStringEncodingUTF32LE
: kCFStringEncodingUTF32BE;
CFStringRef str = CFStringCreateWithBytesNoCopy(NULL,
(const UInt8*)wsz, bytes,
encoding, false,
kCFAllocatorNull
);
const int bytesUtf8 = CFStringGetMaximumSizeOfFileSystemRepresentation(str);
m_utf8 = new char[bytesUtf8];
CFStringGetFileSystemRepresentation(str, m_utf8, bytesUtf8);
CFRelease(str);
}
~Utf8()
{
if( m_utf8 )
{
delete[] m_utf8;
}
}
public:
operator const char*() const { return m_utf8; }
private:
char* m_utf8;
};
Usage:
const wchar_t wsz = L"Here is some Unicode content: éà€œæ";
const Utf8 utf8 = wsz;
FILE* file = fopen(utf8, "r");
This will work for reading or writing files.
You just want to open a file handle using a path that may contain Unicode characters, right? Just pass the path in filesystem representation to fopen.
If the path came from the stock Mac OS X frameworks (for example, an Open panel whether Carbon or Cocoa), you won't need to do any conversion on it and will be able to use it as-is.
If you're generating part of the path yourself, you should create a CFStringRef from your path and then get that in filesystem representation to pass to POSIX APIs like open or fopen.
Generally speaking, you won't have to do a lot of that for most applications. For example, many applications may have auxiliary data files stored the user's Application Support directory, but as long as the names of those files are ASCII, and you use standard Mac OS X APIs to locate the user's Application Support directory, you don't need to do a bunch of paranoid conversion of a path constructed with those two components.
Edited to add: I would strongly caution against arbitrarily converting everything to UTF-8 using something like wcstombs because filesystem encoding is not necessarily identical to the generated UTF-8. Mac OS X and Windows both use specific (but different) canonical decomposition rules for the encoding used in filesystem paths.
For example, they need to decide whether "é" will be stored as one or two code units (either LATIN SMALL LETTER E WITH ACUTE or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). These will result in two different — and different-length — byte sequences, and both Mac OS X and Windows work to avoid putting multiple files with the same name (as the user perceives them) in the same directory.
The rules for how to perform this canonical decomposition can get pretty hairy, so rather than try to implement it yourself it's best to leave it to the functions the system frameworks have provided for you to do the heavy lifting.
#JKP:
Not all functions in MacOS X accept UTF8, but filenames and filepaths may be UTF8, thus all POSIX functions dealing with file access (open, fopen, stat, etc.) accept UTF8.
See here. Quote:
How a file name looks at the API level
depends on the API. Current Carbon
APIs handle file names as an array of
UTF-16 characters; POSIX ones handle
them as an array of UTF-8, which is
why UTF-8 works well in Terminal. How
it's stored on disk depends on the
disk format; HFS+ uses UTF-16, but
that's not important in most cases.
Some other POSIX functions handle UTF8 as well. E.g. functions dealing with user names, group names or user passwords use UTF8 to store the information (thus a user name can be Japanese and your password can be Chinese, no problem).
But not all handle UTF8. E.g. for all string functions an UTF8 string is just a normal C String and characters above 126 have no special meaning. They don't understand the concept of multiple bytes (chars in C) forming a single Unicode character. How other APIs handle char * pointer being passed to them is different from API to API. However, as a rule as the thumb you can say:
Either the function only accepts C strings with pure ASCII characters (only in the range 0 to 126) or it will accept UTF8. Usually functions don't allow characters above 126 and interpret them in any other encoding than UTF8. If this really was the case, it is documented and then there must be a way to pass the encoding along with the string.
If you're using Cocoa it's fairly easy with NSString. Just load the UTF16 data in using -initWithBytes:length:encoding: (or perhaps -initWithCString:encoding:) and then get a UTF8 version by calling UTF8String on the result. Then, just call fopen with your new UTF8 string as the param.
You can definitely call fopen with a UTF-8 string, regardless of language - can't help with C++ on OSX though - sorry.
I have read file name from configuration UTF8 file through wifstream (it uses wchar_t buffer).
Mac implementation is different from Linux and Windows.
wifstream reads each byte from file to separate wchar_t cell in the buffer. So we have 3 empty bytes, although open requires char string. Thus programmer can use wcstombs function to convert wide character string to multi-byte string.
The API supports UTF8. For better understanding use memory watcher and hex editor for your file.