UTF text in SDL2 - c++

I just finished creating a function to cache any loaded font in the small game engine I'm building in SDL2, the following function works flawlessly and rendering text is about 12 times faster than creating a new SDL_Surface each time I need text. However, as you can see, it caches only ANSI characters, this is fine for English, but not if I ever want to translate my game (German umlauts, or Cyrillic glyphs are not available in ANSI)
void cacheFonts(){
for(unsigned int i = 0; i < GlobalFontAssets.size; i++){
SDL_Colour color_font = {255, 255, 255, 255};
std::vector<SDL_Texture*> tempVector;
for(int j = 32; j < 128; j++){
char temp[2];
temp[0] = j;
temp[1] = 0;
SDL_Surface* glyph = TTF_RenderUTF8_Blended(GlobalFontAssets.fonts[i], temp, color_font);
SDL_Texture* texture =
SDL_CreateTextureFromSurface(renderer, glyph);
tempVector.push_back(texture);
SDL_FreeSurface(glyph);
}
GlobalFontAssets.cache.push_back(tempVector);
}
printf("Global Fonts Cached!\n");
}
I have tried using wchar_t, and looping from 0 to 256^2, however I can not get any characters to print even using printf, wprintf, cout and wcout, However if I do:
std::string str = "Привет, öäü"
printf("%s\n", str.c_str());
Then it prints the string on the terminal just fine. I should mention that I am on Ubuntu 16.04, so a Windows only solution doesn't work for me, ideally I wish to do this in a portable manner. To those not familiar with SDL, all I need is a way to get every UTF8 Character in a C string. I hope that this is possible.

Addressing only this portion of the question:
all I need is a way to get every UTF8 Character in a C string
Wikipedia has a nice table showing the various encoding rules, the range of codepoints that each covers, and the corresponding UTF-8 length and data bytes.
For covering the first 2000-odd characters, just generate all the one- and two-byte patterns:
char s[3] = { 0 };
for(s[0] = 0x00; s[0] < 0x80u; ++s[0]) { // can start at 0x20 to skip control characters
// one byte encodings
}
for(s[0] = 0xC0u; s[0] < 0xE0u; ++s[0]) {
for(s[1] = 0x80u; s[1] < 0xC0u; ++s[1]) {
// two byte encodings
}
}
It's no coincidence that the values 0x80u and 0xC0u appear more than once in the loop conditions -- the fact that there is no overlap between lead bytes and following bytes is what gives UTF-8 its self-synchronizing property.
I guess you're relying on the following fact (quoted from Wikipedia):
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks.
Because this range contains combining marks, you will have quite a few entries that can't be rendered alone. Whether you skip them or just handle the resulting confusion from the text layout engine is up to you.

Related

How to correctly skip unicode (UTF-8) characters?

I have written a parser that turns out works incorrectly with UTF-8 texts.
The parser is very very simple:
while(pos < end) {
// find some ASCII char
if (text.at(pos) == '#') {
// Check some conditions and if the syntax is wrong...
if (...)
createDiagnostic(pos);
}
pos++;
}
So you can see I am creating a diagnostic at pos. But that pos is wrong if there were some UTF-8 characters (because UTF-8 characters in reality consists of more than one char. How do I correctly skip the UTF-8 chars as if they are one character?
I need this because the diagnostics are sent to UTF-8-aware VSCode.
I tried to read some articles on UTF-8 in C++ but every material I found is huge. And I only need to skip the UTF-8.
If the code point is less than 128, then UTF-8 encodes it as ASCII (No highest bit set). If code point is equal or larger than 128, all the encoded bytes will have the highest bit set. So, this will work:
unsigned char b = <...>; // b is a byte from a utf-8 string
if (b&0x80) {
// ignore it, as b is part of a >=128 codepoint
} else {
// use b as an ASCII code
}
Note: if you want to calculate the number of UTF-8 codepoints in a string, then you have to count bytes with:
!(b&0x80): this means that the byte is an ASCII character, or
(b&0xc0)==0xc0: this means, that the byte is the first byte of a multi-byte UTF8-sequence

Unicode big endian some character not getting properly from wchar_t array

I am trying to extract the exact "unicode big endian" character from an array.
The values i directly taken from a file using big endian. i use vs 2015, mfc framework (unicode support).
values: 𠀐亙𠀃𠀃亙亙𠀐𠀐Val𪛕𨕥
So these values directly taken from file to an array and without changing those values in the same array and directly printing to another txt file as unicode big endian format is possible. But changing some chars getting wrong result.
Directly written to editor.cpp file
wchar_t chr[] = {L'𠀐', L'亙', L'𠀃', L'𠀃', L'亙', L'亙', L'𠀐', L'𠀐', L'V', L'a', L'l', L'𪛕', L'𨕥'};
wchar_t chVal = (wchar_t) chr[0]; // getting � or a rectangle mark
if(chVal == L'𠀐')
MessageBox(_T("Show msg")); // results wrong
wchar_t chVal = (wchar_t) chr[1]; // getting 亙 proper element.
if(chVal == L'亙')
MessageBox(_T("Show msg")); // results correct
llly correct results in 'V', 'a', 'l'
=======================================
Before i placed the code
wchar_t* ch = _wsetlocale(LC_ALL, _T("Chinese"));
is it a problem from _wsetLocale ?
in the editor we can directly write those characters. But during debug or exe the results wrong.
why the editor not displaying some characters during debugging or execution.
================
updated:
// wcstring is wchar_t array with unicode characters
CStringW str; wchar_t wh;
System::Text::Encoding^ encodingWr = System::Text::Encoding::BigEndianUnicode;
StreamWriter^ writer = gcnew StreamWriter("Converted.txt", true, encodingWr );
//String^ line = reader->ReadLine();
for(int ct = 0; ct< ctTot; ct++)
{
int ln = wcstring[ct]; // correct number
wh = /*(wchar_t)*/ wcstring[ct]; //wrong
str.Format(_T("UNNUM %d %lc"), ln, wh);
/* https://learn.microsoft.com/en-us/cpp/text/how-to-convert-between-various-string-types?view=vs-2017*/
// Convert a wide character CStringW to a
// System::String.
String ^systemstringw = gcnew String(str);
//systemstringw += " (System::String)";
//Console::WriteLine("{0}", systemstringw);
//delete systemstringw;
writer->WriteLine(systemstringw);
delete systemstringw;
OutputDebugString(str);
}
but needed to print on file correct unicode character.
so compiler problems need to know too.

SetWindowText with a single dimensional array

Is it possible to display a single dimensional array of values using SetWindowsText() in a text box on windows api?
for example. SetWindowText(hwndStatic3, sArray);
******************EDIT************
I have a textbox on the windows api where I use GetWindowText() to retrieve the string written in the text box then I convert the string to decimal array. I then convert this decimal array value to hexadecimal value as I am trying to print those values using SetwindowsText within another textbox. However only the last value of the array is printing. How can I print all the values?
******************EDIT************
code:
GetWindowText(hwndtext1, value, 256);
for (i = 15; i >= 0; i--)
{
temp[i] = atoll(value); //converts sting to decimal
ulltoa(temp[i] , sArray, 16); //converts decimal to hexadecimal
buf[i] = temp[i];
}
SetWindowText(hwndStatic3, sArray);
SetWindowText is just a macro with signature:
BOOL SetWindowText(HWND, const TCHAR*);
Depending on your build settings, it will call one of the following:
BOOL SetWindowTextA(HWND, const char*); //ansi version
BOOL SetWindowTextW(HWND, const wchar_t*); //unicode version
where TCHAR is defined as:
#ifdef _UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
So, an array of strings is not compatible with SetWindowText but an array of characters will work, provided that the array is of type TCHAR *, or of type (char * or wchar_t *) that is compatible with your settings.
First, atoll and ulltoa aren't documented with the Microsoft Visual C/C++ (which is what I use for Windows) so I'm working from documentation I found online. Either your versions do more than those I've found documented, or you've left out some significant code from your example.
Based on the loop control, I'm guessing that you expect to always find 15 values in the string you read from the first control. BUT... the atoll and ulltoa functions only operate on one value at a time and do nothing to advance through the input list. So your loop is converting the first number from string to 64 bit int and then converting that into a string 15 times.
Since you say the last value is the only one you see, your functions must actually be parsing the value string in some way that is not apparent in your example. However, ulltoa seems to always be placing the value into the same place in the same string variable, with each subsequent call in the loop overwriting the previous call. My lazy self would add a bit like this:
int len = 0;
char szOutput[15*20]; // enough space for 15 64 bit hex strings
GetWindowText(hwndtext1, value, 256);
for (i = 15; i >= 0; i--)
{
temp[i] = atoll(value); //converts sting to decimal
ulltoa(temp[i] , sArray, 16); //converts decimal to hexadecimal
buf[i] = temp[i];
len += sprintf( szOutput+len, "%s ", sArray );
}
szOutput[len-1] - '\0'; // remove the final space
SetWindowText(hwndStatic3, szOutput);
Of course, with the sprintf you could also skip the ulltoa call entirely and change the sprintf line to:
len += sprintf( szOutput+len, "%16.16I64X", temp[i] );
(or whatever flavor/form of the hex output you want (see the printf format documentation for details.) If you want your list to be one item per line, then replace the trailing space with a newline. Oh, the I64 in the %16.16I64X is a Microsoft thing that might be different in other compilers/libraries.
FYI, the sprintf technique I used lets the function keep appending to the end of the buffer but incrementing the offset into the buffer (len) by the length of the string just appended, which is the value returned by sprintf. It is a quick and easy way to assembling string lists such as yours.

How does one manipulate Unicode strings at the character level?

Sometimes manipulating character strings at the character level is unavoidable.
Here I have a function written for ANSI/ASCII based character strings that replaces CR/LF sequences with LF only, and also replaces CR with LF. We use this because incoming text files often have goofy line endings due to various text or email programs that have made a mess of them, and I need them to be in a consistent format to make parsing / processing / output work properly down the road.
Here's a fairly efficient implementation of this compression from various line-endings to LF only, for single byte per character implementations:
// returns the in-place conversion of a Mac or PC style string to a Unix style string (i.e. no CR/LF or CR only, but rather LF only)
char * AnsiToUnix(char * pszAnsi, size_t cchBuffer)
{
size_t i, j;
for (i = 0, j = 0; pszAnsi[i]; ++i, ++j)
{
// bounds checking
ASSERT(i < cchBuffer);
ASSERT(j <= i);
switch (pszAnsi[i])
{
case '\n':
if (pszAnsi[i + 1] == '\r')
++i;
break;
case '\r':
if (pszAnsi[i + 1] == '\n')
++i;
pszAnsi[j] = '\n';
break;
default:
if (j != i)
pszAnsi[j] = pszAnsi[i];
}
}
// append null terminator if we changed the length of the string buffer
if (j != i)
pszAnsi[j] = '\0';
// bounds checking
ASSERT(pszAnsi[j] == 0);
return pszAnsi;
}
I'm trying to transform this into something that will work correctly with multibyte/unicode strings, where the size of the next character can be multible bytes wide.
So:
I need to look at a character only at a valid character-point (not in the middle of a character)
I need to copy over the portion of the character that is part of the rejected piece properly (i.e. copy whole characters, not just bytes)
I understand that _mbsinc() will give me the address of the next start of a real character. But what is the equivalent for Unicode (UTF16), and are there already primitives to be able to copy a full character (e.g. length_character(wsz))?
One of the beautiful things about UTF-8 is that if you only care about the ASCII subset, your code doesn't need to change at all. The non-ASCII characters get encoded to multi-byte sequences where all of the bytes have the upper bit set, keeping them out of the ASCII range themselves. Your CR/LF replacement should work without modification.
UTF-16 has the same property. Characters that can be encoded as a single 16-bit entity will never conflict with the characters that require multiple entities.
Do not try to keep text internally in mix of whatever encodings and work with those it is true Hell.
First pick some "internal" encoding. When target platform is UNIX then UTF-8 is good candidate, it is slightly easier to display there. When target platform is Windows then UTF-16 is good candidate, Windows uses it internally anyway everywhere. Whatever you pick, stick to it an only it.
Then you convert all incoming "dirty" text into that encoding. Also you may make some re-formatting that actually looks exactly like your code, only that on case of wchar_t containing UTF-16 you have to use literals like L'\n'.

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.