Rendering font with UTF8 in SDL_ttf - c++

I am trying to render characters using the TTF_RenderUTF8_Blended method provided by the SDL_ttf library. I implemented user input (keyboard) and pressing 'ä' or 'ß' for example works fine. These are special characters of the German language. In this case, they are even in the extended ASCII 8-bit code, but even when I copy and paste some Greek letters for example, the fonts get rendered correctly using UTF8. (However not all the UNICODE glyphs you can find here (http://unicode-table.com/) am I able to render as I recognized during testing but I guess that is normal because the Arial font might not have every single glyph. Anyways most of the UNICODE glyphs work fine.)
My problem is that passing strings (parameter as const char*) the additional characters (to ASCII) aren't rendered correctly. So entering 'Ä', 'ß', or some other UNICODE chars with the keyboard at runtime works but passing them as a parameter to get - let's say a title for my game - inside the code like this does not work:
font_srf = TTF_RenderUTF8_Blended(font, "Hällö", font_clr);
I don't really understand why this is happening. What I get on the screen is:
H_ll_
And I am using _ to represent the typical vertical rectangle that the guy who gave the following speech used as a funny way of an introduction:
https://www.youtube.com/watch?v=MW884pluTw8
Ironically, when I use TTF_RenderText_Blended(font, "Hällö", font_clr); it works because 'ä' and 'ö' are 8-bit extended ASCII encoded, but what I want is UNICODE support, so that does not help.
Edit & Semi-Solution
I kind of (not really good) fixed the problem, Because my input works fine, I just checked what values I get as input when I press 'ä', 'ß', ... on my keyboard using the following code:
const char* c = input.c_str();
for (int i = 0; i < input.length(); i++)
{
std::cout << int(c[i]) << " ";
}
Then I printed those characters in the following way:
const char char_array[] = {-61, -74, -61, -97, '\0'};
const char* char_pointer = char_array;
-61, -74 is 'ö' and -61, -97 is 'ß'.
This does fit the UTF8 encoding right?
U+00F6 | ö | C3 B6 (from UTF8 data table)
256-61=195 which is C3
and 256-74=182 which is B6
const char char_array[] = {0xC3, 0xB6};
This code works fine as well in case some of you were wondering. And I think this is what I will keep doing for now. Looking up the Hex-code for some Unicode glyphs isn't that hard.
But what I still can't figure out is how to get to the extended ASCII integer value of 246. Plus, isn't there a more human-friendly solution to my problem?

If you have non-ASCII characters in a source file, the character encoding of that source code file matters. So in your text editor or IDE, you need to set the character set (e.g. UTF-8) when you save it.
Alternatively, you can use the \x... or \u.... format to specify non-ASCII characters using only ASCII characters, so source file encoding doesn't matter.
Microsoft doc, but not MS-specific:
https://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx

Related

Can't display 'ä' in GLFW window title

glfwSetWindowTitle(win, "Nämen");
Becomes "N?men", where '?' is in a little black, twisted square, indicating that the character could not be displayed.
How do I display 'ä'?
If you want to use non-ASCII letters in the window title, then the string has to be utf-8 encoded.
GLFW: Window title:
The window title is a regular C string using the UTF-8 encoding. This means for example that, as long as your source file is encoded as UTF-8, you can use any Unicode characters.
If you see a little black, twisted square then this indicates that the ä is encoded with some iso encoding that is not UTF-8, maybe something like latin1. To fix this you need to open it in the editor in which you can change the encoding of the file, change it to uft-8 (without BOM) and fix the ä in the title.
It seems like the GLFW implementation does not work according to the specification in this case. Probably the function still uses Latin-1 instead of UTF-8.
I had the same problem on GLFW 3.3 Windows 64 bit precompiled binaries and fixed it like this:
SetWindowTextA(glfwGetWin32Window(win),"Nämen")
The issue does not lie within GLFW but within the compiler. Encodings are handled by the major compilers as follows:
Good guy clang assumes that every file is encoded in UTF-8
Trusty gcc checks the system's settings1 and falls back on UTF-8, when it fails to determine one.
MSVC checks for BOM and uses the detected encoding; otherwise it assumes that file is encoded using the users current code page2.
You can determine your current code page by simply running chcp in your (Windows) console or PowerShell. For me, on a fresh install of Windows 11, it yields "850" (language: English; keyboard: German), which stands for Code Page 850.
To fix this issue you have several solutions:
Change your systems code page. Arguably the worst solution.
Prefix your strings with u8, escape all unicode literals and convert the string to wide char before passing Win32 functions; e.g.:
const char* title = u8"\u0421\u043b\u0430\u0432\u0430\u0020\u0423\u043a\u0440\u0430\u0457\u043d\u0456\u0021";
// This conversion is actually performed by GLFW; see footnote ^3
const int l = MultiByteToWideChar(CP_UTF8, 0, title, -1, NULL, 0);
wchar_t* buf = _malloca(l * sizeof(wchar_t));
MultiByteToWideChar(CP_UTF8, 0, title, -1, buf, l);
SetWindowTextW(hWnd, buf);
_freea(buf);
Save your source files with UTF-8 encoding WITH BOM. This allows you to write your strings without having to escape them. You'll still need to convert the string to a wide char string using the method seen above.
Specify the /utf-8 flag when compiling; this has the same effect as the previous solution, but you don't need the BOM anymore.
The solutions stated above still require you convert your good and nice string to a big chunky wide string.
Another option would be to provide a manifest4 with the activeCodePage set to UTF-8. This way all Win32 functions with the A-suffix (e.g. SetWindowTextA) now accept and properly handle UTF-8 strings, if the running system is at least or newer than Windows Version 1903.
TL;DR
Compile your application with the /utf-8 flag active.
IMPORTANT: This works for the Win32 APIs. This doesn't let you magically write Unicode emojis to the console like a hipster JS developer.
1 I suppose it reads the LC_ALL setting on linux. In the last 6 years I have never seen a distribution, that does NOT specify UTF-8. However, take this information with a grain of salt; I might be entirely wrong on how gcc handles this now.
2 If no byte-order mark is found, it assumes that the source file is encoded in the current user code page [...].
3 GLFW performs the conversion as seen here.
4 More about Win32 ANSI-APIs, Manifest and UTF-8 can be found here.

QTextBrowser not displaying non-english characters

I'm developing a Qt GUI application to parse out a custom windows binary file that stores unicode text using wchar_t (default UTF-16 encoding). I've constructed a QString using QString::fromWcharArray and passed it to QTextBrowser::insertPlainText like this
wchar_t *p = ; // pointer to a wchar_t string in the binary file
QString t = QString::fromWCharArray(p);
ui.logBrowser->insertPlainText(t);
The displayed text displays ASCII characters correctly, but non-ASCII characters are displayed as a rectangular box instead. I've followed the code in a debugger and p points to a valid wchar_t string and the constructed QString t is also a valid string matching the wchar_t string. The problem happens when printing it out on a QTextBrowser.
How do I fix this?
First of all read documentation. So depending on system you will have different encoding UCS-4 or UTF-16! What is the size of wchar_t?
Secondly there is alternative API: try QString::fromUtf16.
Finally what kind of character are you using? Hebrew/Cyrillic/Japanese/???. Are you sure those characters are supported by font you are using?

std::wstring to QString conversion with Hiragana

I'm trying to convert text containing Hiragana from a wstring to a QString, so that it can be used on a label's text property. However my code is not working and I'm not sure why that is.
The following conversion method obviously tells me that I made something wrong:
std::wstring myWString = L"Some Hiragana: あ い う え お";
ui->label->setText(QString::fromStdWString(myWString));
Output: Some Hiragana: ゠ㄠㆠ㈠ãŠ
I can print Hiragana on a label if I put them in a string directly:
ui->label->setText("Some Hiragana: あ い う え お");
Output: Some Hiragana: あ い う え お
That means I can avoid this problem by simply using std::string instead of std::wstring, but I'd like to know why this is happening.
VS is interpreting the file as Windows-1252 instead of UTF-8.
As an example, 'あ' in UTF-8 is E3 81 82, but the compiler is reading each byte as a single Windows-1252 char before converting it to the respective UTF-16 codepoints E3 201A, which works out as 'ã‚' (81 is either ignored by VS as it is reserved in Windows-1252, or not printed by qt if VS happens to convert it to the respective C1 control character).
The direct version works because the compiler doesn't perform any conversions and leaves the string as E3 81 82.
To fix your issue you will need to inform VS that the file is UTF-8, according to other posts one way is to ensure the file has a UTF-8 BOM.
The only portable way of fixing this is to use escape sequences instead:
L"Some Hiragana: \u3042 \u3044 \u3046 \u3048 \u304A"

Conversion from UTF-8 to ANSI wcstombs failes at one spezial character

I want to change a wchar_t* like it is displayed to a char*.
No conversions like in the WideCharToMultibyte should be done.
I found the wcstombs function and it looked like it works perfectly, but there is one char which does not get changed correcly.
It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339.
Of caurse ASCII has not so much numbers, but why does it get the wrong one?
Here a part of my sourcecode, I added a loop and a if so that it works:
wchar_t *wc; // source string
char *cc; // destination string
int len = 0; // length of the strings
...
for(int i = 0; i < len; i++) {
if(wc[i] != 339) {
cc[i] = wc[i];
}else{
cc[i] = 156;
}
}
This Code is working, but seriously, is this the best way to solve that problem?
Many thanks in advance!
I want to change a wchar_t* like it is displayed to a char*.
Okay, you want to convert from wchar_t strings to char strings.
No conversions like in the WideCharToMultibyte should be done.
What? I presume you don't mean 'no conversion should be done,' but with only one example I can't tell what you want to avoid. Just WideCharToMultibyte or are there other functions?
I found the wcstombs function and it looked like it works perfectly,
wcstombs seems like WideCharToMultibyte to me, but I guess it's different in some way that's important to you? It'd be good if you could describe what exactly makes wcstombs acceptable and WideCharToMultibyte unacceptable.
but there is one char which does not get changed correcly.
Sounds like it's not working perfectly...
It is the 'œ', it has the ANSI Number 156, but in UTF-8 it is the number 339. Of caurse ASCII has not so much numbers, but why does it get the wrong one?
You probably mean that in CP1252 'œ' is encoded as 156 in decimal or 0x9C in hex, and that this character has the Unicode codepoint 339 in decimal, or more conventionally U+0153. I don't see where UTF-8 comes into this at all.
Here a part of my sourcecode, I added a loop and a if so that it works:
As for why you're not getting the results you expect, it's probably because you're not using wcstombs() correctly. It's hard to tell because you're not showing how you're doing the conversion with wcstombs().
wcstombs() converts between wchar_t and char using the encodings specified by the program's current C locale. If you've set the locale to one that uses a Unicode encoding for wchar_t and uses CP1252 for char then it should do what you expect.
This Code is working, but seriously, is this the best way to solve that problem?
No.
Please bear with my complete ignorance of c/c++, but you can either use a custom lookup table
or some premade function.
Here is an array of 256 integers, where the index i contains the unicode codepoint for the Windows-1252
codepoint i.
So for instance, the index 156, contains 0x0153 which is 339 in decimal.
int[] windows1252ToUnicodeCodePoints = {
0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000A,0x000B,0x000C,0x000D,0x000E,0x000F
,0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001A,0x001B,0x001C,0x001D,0x001E,0x001F
,0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002A,0x002B,0x002C,0x002D,0x002E,0x002F
,0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003A,0x003B,0x003C,0x003D,0x003E,0x003F
,0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004A,0x004B,0x004C,0x004D,0x004E,0x004F
,0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005A,0x005B,0x005C,0x005D,0x005E,0x005F
,0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006A,0x006B,0x006C,0x006D,0x006E,0x006F
,0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007A,0x007B,0x007C,0x007D,0x007E,0x007F
,0x20AC,0xFFFD,0x201A,0x0192,0x201E,0x2026,0x2020,0x2021,0x02C6,0x2030,0x0160,0x2039,0x0152,0xFFFD,0x017D,0xFFFD
,0xFFFD,0x2018,0x2019,0x201C,0x201D,0x2022,0x2013,0x2014,0x02DC,0x2122,0x0161,0x203A,0x0153,0xFFFD,0x017E,0x0178
,0x00A0,0x00A1,0x00A2,0x00A3,0x00A4,0x00A5,0x00A6,0x00A7,0x00A8,0x00A9,0x00AA,0x00AB,0x00AC,0x00AD,0x00AE,0x00AF
,0x00B0,0x00B1,0x00B2,0x00B3,0x00B4,0x00B5,0x00B6,0x00B7,0x00B8,0x00B9,0x00BA,0x00BB,0x00BC,0x00BD,0x00BE,0x00BF
,0x00C0,0x00C1,0x00C2,0x00C3,0x00C4,0x00C5,0x00C6,0x00C7,0x00C8,0x00C9,0x00CA,0x00CB,0x00CC,0x00CD,0x00CE,0x00CF
,0x00D0,0x00D1,0x00D2,0x00D3,0x00D4,0x00D5,0x00D6,0x00D7,0x00D8,0x00D9,0x00DA,0x00DB,0x00DC,0x00DD,0x00DE,0x00DF
,0x00E0,0x00E1,0x00E2,0x00E3,0x00E4,0x00E5,0x00E6,0x00E7,0x00E8,0x00E9,0x00EA,0x00EB,0x00EC,0x00ED,0x00EE,0x00EF
,0x00F0,0x00F1,0x00F2,0x00F3,0x00F4,0x00F5,0x00F6,0x00F7,0x00F8,0x00F9,0x00FA,0x00FB,0x00FC,0x00FD,0x00FE,0x00FF
};
What you need is this table inversed (or do linear scans everytime), in any other language I would use a construct like Map<int, int>.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx