Conversion from UTF-8 to ANSI wcstombs failes at one spezial character - c++

I want to change a wchar_t* like it is displayed to a char*.
No conversions like in the WideCharToMultibyte should be done.
I found the wcstombs function and it looked like it works perfectly, but there is one char which does not get changed correcly.
It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339.
Of caurse ASCII has not so much numbers, but why does it get the wrong one?
Here a part of my sourcecode, I added a loop and a if so that it works:
wchar_t *wc; // source string
char *cc; // destination string
int len = 0; // length of the strings
...
for(int i = 0; i < len; i++) {
if(wc[i] != 339) {
cc[i] = wc[i];
}else{
cc[i] = 156;
}
}
This Code is working, but seriously, is this the best way to solve that problem?
Many thanks in advance!

I want to change a wchar_t* like it is displayed to a char*.
Okay, you want to convert from wchar_t strings to char strings.
No conversions like in the WideCharToMultibyte should be done.
What? I presume you don't mean 'no conversion should be done,' but with only one example I can't tell what you want to avoid. Just WideCharToMultibyte or are there other functions?
I found the wcstombs function and it looked like it works perfectly,
wcstombs seems like WideCharToMultibyte to me, but I guess it's different in some way that's important to you? It'd be good if you could describe what exactly makes wcstombs acceptable and WideCharToMultibyte unacceptable.
but there is one char which does not get changed correcly.
Sounds like it's not working perfectly...
It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339. Of caurse ASCII has not so much numbers, but why does it get the wrong one?
You probably mean that in CP1252 'œ' is encoded as 156 in decimal or 0x9C in hex, and that this character has the Unicode codepoint 339 in decimal, or more conventionally U+0153. I don't see where UTF-8 comes into this at all.
Here a part of my sourcecode, I added a loop and a if so that it works:
As for why you're not getting the results you expect, it's probably because you're not using wcstombs() correctly. It's hard to tell because you're not showing how you're doing the conversion with wcstombs().
wcstombs() converts between wchar_t and char using the encodings specified by the program's current C locale. If you've set the locale to one that uses a Unicode encoding for wchar_t and uses CP1252 for char then it should do what you expect.
This Code is working, but seriously, is this the best way to solve that problem?
No.

Please bear with my complete ignorance of c/c++, but you can either use a custom lookup table
or some premade function.
Here is an array of 256 integers, where the index i contains the unicode codepoint for the Windows-1252
codepoint i.
So for instance, the index 156, contains 0x0153 which is 339 in decimal.
int[] windows1252ToUnicodeCodePoints = {
0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000A,0x000B,0x000C,0x000D,0x000E,0x000F
,0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,0x001F
,0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002A,0x002B,0x002C,0x002D,0x002E,0x002F
,0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,0x003F
,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004A,0x004B,0x004C,0x004D,0x004E,0x004F
,0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005A,0x005B,0x005C,0x005D,0x005E,0x005F
,0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006A,0x006B,0x006C,0x006D,0x006E,0x006F
,0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007A,0x007B,0x007C,0x007D,0x007E,0x007F
,0x20AC,0xFFFD,0x201A,0x0192,0x201E,0x2026,0x2020,0x2021,0x02C6,0x2030,0x0160,0x2039,0x0152,0xFFFD,0x017D,0xFFFD
,0xFFFD,0x2018,0x2019,0x201C,0x201D,0x2022,0x2013,0x2014,0x02DC,0x2122,0x0161,0x203A,0x0153,0xFFFD,0x017E,0x0178
,0x00A0,0x00A1,0x00A2,0x00A3,0x00A4,0x00A5,0x00A6,0x00A7,0x00A8,0x00A9,0x00AA,0x00AB,0x00AC,0x00AD,0x00AE,0x00AF
,0x00B0,0x00B1,0x00B2,0x00B3,0x00B4,0x00B5,0x00B6,0x00B7,0x00B8,0x00B9,0x00BA,0x00BB,0x00BC,0x00BD,0x00BE,0x00BF
,0x00C0,0x00C1,0x00C2,0x00C3,0x00C4,0x00C5,0x00C6,0x00C7,0x00C8,0x00C9,0x00CA,0x00CB,0x00CC,0x00CD,0x00CE,0x00CF
,0x00D0,0x00D1,0x00D2,0x00D3,0x00D4,0x00D5,0x00D6,0x00D7,0x00D8,0x00D9,0x00DA,0x00DB,0x00DC,0x00DD,0x00DE,0x00DF
,0x00E0,0x00E1,0x00E2,0x00E3,0x00E4,0x00E5,0x00E6,0x00E7,0x00E8,0x00E9,0x00EA,0x00EB,0x00EC,0x00ED,0x00EE,0x00EF
,0x00F0,0x00F1,0x00F2,0x00F3,0x00F4,0x00F5,0x00F6,0x00F7,0x00F8,0x00F9,0x00FA,0x00FB,0x00FC,0x00FD,0x00FE,0x00FF
};
What you need is this table inversed (or do linear scans everytime), in any other language I would use a construct like Map<int, int>.

Related

Rendering font with UTF8 in SDL_ttf

I am trying to render characters using the TTF_RenderUTF8_Blended method provided by the SDL_ttf library. I implemented user input (keyboard) and pressing 'ä' or 'ß' for example works fine. These are special characters of the German language. In this case, they are even in the extended ASCII 8-bit code, but even when I copy and paste some Greek letters for example, the fonts get rendered correctly using UTF8. (However not all the UNICODE glyphs you can find here (http://unicode-table.com/) am I able to render as I recognized during testing but I guess that is normal because the Arial font might not have every single glyph. Anyways most of the UNICODE glyphs work fine.)
My problem is that passing strings (parameter as const char*) the additional characters (to ASCII) aren't rendered correctly. So entering 'Ä', 'ß', or some other UNICODE chars with the keyboard at runtime works but passing them as a parameter to get - let's say a title for my game - inside the code like this does not work:
font_srf = TTF_RenderUTF8_Blended(font, "Hällö", font_clr);
I don't really understand why this is happening. What I get on the screen is:
H_ll_
And I am using _ to represent the typical vertical rectangle that the guy who gave the following speech used as a funny way of an introduction:
https://www.youtube.com/watch?v=MW884pluTw8
Ironically, when I use TTF_RenderText_Blended(font, "Hällö", font_clr); it works because 'ä' and 'ö' are 8-bit extended ASCII encoded, but what I want is UNICODE support, so that does not help.
Edit & Semi-Solution
I kind of (not really good) fixed the problem, Because my input works fine, I just checked what values I get as input when I press 'ä', 'ß', ... on my keyboard using the following code:
const char* c = input.c_str();
for (int i = 0; i < input.length(); i++)
{
std::cout << int(c[i]) << " ";
}
Then I printed those characters in the following way:
const char char_array[] = {-61, -74, -61, -97, '\0'};
const char* char_pointer = char_array;
-61, -74 is 'ö' and -61, -97 is 'ß'.
This does fit the UTF8 encoding right?
U+00F6 | ö | C3 B6 (from UTF8 data table)
256-61=195 which is C3
and 256-74=182 which is B6
const char char_array[] = {0xC3, 0xB6};
This code works fine as well in case some of you were wondering. And I think this is what I will keep doing for now. Looking up the Hex-code for some Unicode glyphs isn't that hard.
But what I still can't figure out is how to get to the extended ASCII integer value of 246. Plus, isn't there a more human-friendly solution to my problem?
If you have non-ASCII characters in a source file, the character encoding of that source code file matters. So in your text editor or IDE, you need to set the character set (e.g. UTF-8) when you save it.
Alternatively, you can use the \x... or \u.... format to specify non-ASCII characters using only ASCII characters, so source file encoding doesn't matter.
Microsoft doc, but not MS-specific:
https://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx

Conversion from char * to wchar* does not work properly

I'm getting a string like: "aña!a¡a¿a?a" from the server so I decode it and then I pass it to a function.
What I need to do with the message is something like loading paths depending the letters.
The header of my function is: void SetInfo(int num, char *descr[4]) so it receives one number and an array of 4 chars (sentences). To make it easier, let's say I just need to work only with descr[0].
When I debug and arrive there to SetInfo(), I get the exact message in the debugg view: "aña!a¡a¿a?a" so until here is all ok.
Initially, the info I was receiving on that function, was a std::wstring so all my code working with that message was with wstrings and strings but now what I receive is a char as shown in the header. The message arrived until here ok, but if I want to work with it, then I can't because if I debug and see each position of Descr[0] then I get
descr[0][0] = 'a'; //ok
descr[0][1] = 'Ã '; // BAD
so I tried converting char* to wchar* with a code found here:
size_t size = strlen(descr[0]) + 1;
wchar_t* wa = new wchar_t[size];
mbstowcs(wa,descr[0],size);
But then the debugger shows me that wa has:
wa wchar_t * 0x185d4be8 L"a-\uffffffff刯2e2e牵6365⽳6f73歯6f4c楲6553䈯736f獵6e6f档6946琯7361灭6569湰2e6f琀0067\021ᡰ9740슃b8\020\210=r"
which I suppose that is incorrect (I'm supossing that I have to see the same initial message of "aña!a¡a¿a?a". If this message is fine then I don't know how to get what I need...)
So my question is: how can I get that descr[0][0] = 'a' and descr[0][1] = 'ñ' ?? I can't pass char to wchar (you've already see what I got). Am I doing it wrong? Or is there any other way? I am really stuck on that so any idea will be very apreciated.
Before, when I was working with wstrings (and it worked so fine) I was doing something like this:
if (word[i]==L'\x00D1' or word[i]==L'\x00F1') // ñ or Ñ
path ="PathOfÑ";
where word[i] is the same as descr[0][1] in that case but with wstrings. So with that i knew that this word[i] was the letter 'ñ'. Maybe this helps to understand what I'm doing
(btw...I'm working on eclipse, on linux. )
The mbstowcs function work on C-style strings, and one of the things about C-style strings is that they have a special terminating character, '\0'. You don't seem to be adding this terminator to the string, leading mbstowcs to go out of bounds of the actual string and giving you undefined behavior.

Convert wide CString to char*

There are lots of times this question has been asked and as many answers - none of which work for me and, it seems, many others. The question is about wide CStrings and 8bit chars under MFC. We all want an answer that will work in ALL cases, not a specific instance.
void Dosomething(CString csFileName)
{
char cLocFileNamestr[1024];
char cIntFileNamestr[1024];
// Convert from whatever version of CString is supplied
// to an 8 bit char string
cIntFileNamestr = ConvertCStochar(csFileName);
sprintf_s(cLocFileNamestr, "%s_%s", cIntFileNamestr, "pling.txt" );
m_KFile = fopen(LocFileNamestr, "wt");
}
This is an addition to existing code (by somebody else) for debugging.
I don't want to change the function signature, it is used in many places.
I cannot change the signature of sprintf_s, it is a library function.
You are leaving out a lot of details, or ignoring them. If you are building with UNICODE defined (which it seems you are), then the easiest way to convert to MBCS is like this:
CStringA strAIntFileNameStr = csFileName.GetString(); // uses default code page
CStringA is the 8-bit/MBCS version of CString.
However, it will fill with some garbage characters if the unicode string you are translating from contains characters that are not in the default code page.
Instead of using fopen(), you could use _wfopen() which will open a file with a unicode filename. To create your file name, you would use swprintf_s().
an answer that will work in ALL cases, not a specific instance...
There is no such thing.
It's easy to convert "ABCD..." from wchar_t* to char*, but it doesn't work that way with non-Latin languages.
Stick to CString and wchar_t when your project is unicode.
If you need to upload data to webpage or something, then use CW2A and CA2W for utf-8 and utf-16 conversion.
CStringW unicode = L"Россия";
MessageBoxW(0,unicode,L"Russian",0);//should be okay
CStringA utf8 = CW2A(unicode, CP_UTF8);
::MessageBoxA(0,utf8,"format error",0);//WinApi doesn't get UTF-8
char buf[1024];
strcpy(buf, utf8);
::MessageBoxA(0,buf,"format error",0);//same problem
//send this buf to webpage or other utf-8 systems
//this should be compatible with notepad etc.
//text will appear correctly
ofstream f(L"c:\\stuff\\okay.txt");
f.write(buf, strlen(buf));
//convert utf8 back to utf16
unicode = CA2W(buf, CP_UTF8);
::MessageBoxW(0,unicode,L"okay",0);

Size of UTF-8 string in bytes

I am using QString to store strings, and now I need to store these strings (converted to UTF-8 encoding) in POD structures, which looks like this :
template < int N >
struct StringWrapper
{
char theString[N];
};
To convert raw data from the QString, I do it like this :
QString str1( "abc" );
StringWrapper< 20 > str2;
strcpy( str2.theString, str1.toUtf8().constData() );
Now the question. I noticed that if I convert from normal string, it works fine :
QString str( "abc" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
will produce as the output :
abc
but if I use some special characters, like for example :
QString str( "Schöne Grüße" );
std::cout<< std::string( str.toUtf8().constData() ) << std::endl;
I get a garbage like this:
Gr\xC3\x83\xC2\xBC\xC3\x83\xC2\x9F
I am obviously missing something, but what exactly is wrong?
ADDITIONAL QUESTION
What is a maximum size of an UTF-8 encoded character? I read it here it is 4 bytes.
The first question you need to answer is what is the encoding of your source files is? QString default constructor assumes it's Latin1 unless you change it with QTextStream::setCodecForCStrings(). So if your sources are in anything else than Latin1 (say, UTF-8), you get a wrong result at this point:
QString str( "Schöne Grüße" );
Now, if your sources are in UTF-8, you need to replace it with:
QString str = QString::fromUtf8( "Schöne Grüße" );
Or, better yet, use QObject::trUf8() wherever possible as it gives you i18n capabilities as a free bonus.
The next thing to check is what is the encoding of your console is. You try to print a UTF-8 string to it, but does it support UTF-8? If it's a Windows console, it probably doesn't. If it's something xterm-compatible using a Unicode font on a *nix system with some *.UTF-8 locale, it should be fine.
To your edited question:
I don't see any reason not to trust Wikipedia, especially when it refers to a particular standard. It also mentions that UTF-8 used to have up to 6 bytes characters, though. From my experience, 3 bytes is maximum you get with reasonable native language characters like Latin/Cyrillic/Hebrew/Chinese/Japanese. 4 bytes are probably used for something much more exotic, you can always check the standards if you are really curious.
The first thing that goes wrong is your stated assumption. QString doesn't store UTF-8, it stores unicode strings. That's why you need to call str1.toUtf8(). It creates a temporary UTF-8 string.
The second part is just how UTF-8 works. It's a multi-byte extension of ASCII. üß aren't ASCII characters, and you do expect that both characters get a multi-byte representation. std::cout apparently doesn't expect UTF-8. This depends on the std::locale used.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.