In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++
There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.
In this answer: https://stackoverflow.com/a/4059934/3426514
I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.
If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.
Code, commented.
This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.
// converting one byte (latin-1 character) of input
while (*in)
{
if ( *in < 0x80 )
{
// just copy
*out++ = *in++;
}
else
{
// first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
// (the condition in () evaluates to true / 1)
*out++ = 0xc2 + ( *in > 0xbf ),
// second byte is the lower six bits of the input byte
// with the highest bit set (and, implicitly, the second-
// highest bit unset)
*out++ = ( *in++ & 0x3f ) + 0x80;
}
}
The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.
Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (€), or any of these characters ŠšŽžŒœŸ, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")
All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:
#include <unistr.h>
#include <bytestream.h>
// ...
icu::UnicodeString ustr( in, "ISO-8859-1" );
// ...work with a properly Unicode-aware string class...
// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );
That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.
Related
I have written a parser that turns out works incorrectly with UTF-8 texts.
The parser is very very simple:
while(pos < end) {
// find some ASCII char
if (text.at(pos) == '#') {
// Check some conditions and if the syntax is wrong...
if (...)
createDiagnostic(pos);
}
pos++;
}
So you can see I am creating a diagnostic at pos. But that pos is wrong if there were some UTF-8 characters (because UTF-8 characters in reality consists of more than one char. How do I correctly skip the UTF-8 chars as if they are one character?
I need this because the diagnostics are sent to UTF-8-aware VSCode.
I tried to read some articles on UTF-8 in C++ but every material I found is huge. And I only need to skip the UTF-8.
If the code point is less than 128, then UTF-8 encodes it as ASCII (No highest bit set). If code point is equal or larger than 128, all the encoded bytes will have the highest bit set. So, this will work:
unsigned char b = <...>; // b is a byte from a utf-8 string
if (b&0x80) {
// ignore it, as b is part of a >=128 codepoint
} else {
// use b as an ASCII code
}
Note: if you want to calculate the number of UTF-8 codepoints in a string, then you have to count bytes with:
!(b&0x80): this means that the byte is an ASCII character, or
(b&0xc0)==0xc0: this means, that the byte is the first byte of a multi-byte UTF8-sequence
std::string is commonly interpreted as UTF8, hence has a variable length encoding. In my font renderer I've hit a problem in that I'm not sure how to get a "character" from a std::string and convert it into a Freetype FT_ULong in order to get a glyph with FT_Get_Char_Index. That is to say, I am not sure that what I'm doing is "correct" as I'm just iterating through std::string and casting the resulting chars over (surely this is incorrect, although it works with my OS defaults).
So is there a "correct" way of doing this and more importantly has someone written a library that implements this "correct" way that I can use off the shelf?
You should first check how UTF8 is encoded, and would know that what kind of start bits are with how many bytes.
See http://en.wikipedia.org/wiki/UTF8
And then you can write code like this:
if ((byte & 0x80) == 0x00) {
// 1 byte UTF8 char
}
else if ((byte & 0xE0) == 0xC0) {
// 2 bytes UTF8 char
}
else if ...
Then you can iterates each UTF8 characters in the std::string with correct bytes.
How do I convert a decimal number, 225 for example, to its corresponding Unicode character when it's being output? I can convert ASCII characters from decimal to the character like this:
int a = 97;
char b = a;
cout << b << endl;
And it output the letter "a", but it just outputs a question mark when I use the number 225, or any non-ascii character.
To start with, it's not your C++ program which converts strings of bytes written to standard output into visible characters; it's your terminal (or, more commonly these days, your terminal emulator). Unfortunately, there is no way to ask the terminal how it expects characters to be encoded, so that needs to be configured into your environment; normally, that's done by setting appropriate locale environment variables.
Like most things which have to do with terminals, the locale configuration system would probably have been done very differently if it hadn't developed with a history of many years of legacy software and hardware, most of which were originally designed without much consideration for niceties like accented letters, syllabaries or ideographs. C'est la vie.
Unicode is pretty cool, but it also had to be deployed in the face of the particular history of computer representation of writing systems, which meant making a lot of compromises in the face of the various firmly-held but radically contradictory opinions in the software engineering community, dicho sea de paso a community in which head-butting is rather more common that compromise. The fact that Unicode has eventually become more or less the standard is a testimony to its solid technical foundations and the perseverance and political skills of its promoters and designers -- particularly Mark Davis --, and I say this despite the fact that it basically took more than two decades to get to this point.
One of the aspects of this history of negotiation and compromise is that there is more than one way to encode a Unicode string into bits. There are at least three ways, and two of those have two different versions depending on endianness; moreover, each of these coding systems has its dedicated fans (and consequently, its dogmatic detractors). In particular, Windows made an early decision to go with a mostly-16-bit encoding, UTF-16, while most unix(-like) systems use a variable-length 8-to-32-bit encoding, UTF-8. (Technically, UTF-16 is also a 16- or 32-bit encoding, but that's beyond the scope of this rant.)
Pre-Unicode, every country/language used their own idiosyncratic 8-bit encoding (or, at least, those countries whose languages are written with an alphabet of less than 194 characters). Consequently, it made sense to configure the encoding as part of the general configuration of local presentation, like the names of months, the currency symbol, and what character separates the integer part of a number from its decimal fraction. Now that there is widespread (but still far from universal) convergence on Unicode, it seems odd that locales include the particular flavour of Unicode encoding, given that all flavours can represent the same Unicode strings and that the encoding is more generally specific to the particular software being used than the national idiosyncrasy. But it is, and that's why on my Ubuntu box, the environment variable LANG is set to es_ES.UTF-8 and not just es_ES. (Or es_PE, as it should be, except that I keep running into little issues with that locale.) If you're using a linux system, you might find something similar.
In theory, that means that my terminal emulator (konsole, as it happens, but there are various) expects to see UTF-8 sequences. And, indeed, konsole is clever enough to check the locale setting and set up its default encoding to match, but I'm free to change the encoding (or the locale settings), and confusion is likely to result.
So let's suppose that your locale settings and the encoding used by your terminal are actually in synch, which they should be on a well-configure workstation, and go back to the C++ program. Now, the C++ program needs to figure out which encoding it's supposed to use, and then transform from whatever internal representation it uses to the external encoding.
Fortunately, the C++ standard library should handle that correctly, if you cooperate by:
Telling the standard library to use the configured locale, instead of the default C (i.e. only unaccented characters, as per English) locale; and
Using strings and iostreams based on wchar_t (or some other wide character format).
If you do that, in theory you don't need to know either what wchar_t means to your standard library, nor what a particular bit pattern means to your terminal emulator. So let's try that:
#include <iostream>
#include <locale>
int main(int argc, char** argv) {
// std::locale() is the "global" locale
// std::locale("") is the locale configured through the locale system
// At startup, the global locale is set to std::locale("C"), so we need
// to change that if we want locale-aware functions to use the configured
// locale.
// This sets the global" locale to the default locale.
std::locale::global(std::locale(""));
// The various standard io streams were initialized before main started,
// so they are all configured with the default global locale, std::locale("C").
// If we want them to behave in a locale-aware manner, including using the
// hopefully correct encoding for output, we need to "imbue" each iostream
// with the default locale.
// We don't have to do all of these in this simple example,
// but it's probably a good idea.
std::cin.imbue(std::locale());
std::cout.imbue(std::locale());
std::cerr.imbue(std::locale());
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcerr.imbue(std::locale());
// You can't write a wchar_t to cout, because cout only accepts char. wcout, on the
// other hand, accepts both wchar_t and char; it will "widen" char. So it's
// convenient to use wcout:
std::wcout << "a acute: " << wchar_t(225) << std::endl;
std::wcout << "pi: " << wchar_t(960) << std::endl;
return 0;
}
That works on my system. YMMV. Good luck.
Small side-note: I've run into lots of people who think that wcout automatically writes "wide characters", so that using it will produce UTF-16 or UTF-32 or something. It doesn't. It produces exactly the same encoding as cout. The difference is not what it outputs but what it accepts as input. In fact, it can't really be different from cout because both of them are connected to the same OS stream, which can only have one encoding (at a time).
You might ask why it is necessary to have two different iostreams. Why couldn't cout have just accepted wchar_t and std::wstring values? I don't actually have an answer for that, but I suspect it is part of the philosophy of not paying for features you don't need. Or something like that. If you figure it out, let me know.
If for some reason you want to handle this entirely on your own:
void GetUnicodeChar(unsigned int code, char chars[5]) {
if (code <= 0x7F) {
chars[0] = (code & 0x7F); chars[1] = '\0';
} else if (code <= 0x7FF) {
// one continuation byte
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
} else if (code <= 0xFFFF) {
// two continuation bytes
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
} else if (code <= 0x10FFFF) {
// three continuation bytes
chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
} else {
// unicode replacement character
chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
chars[3] = '\0';
}
}
And then to use it:
char chars[5];
GetUnicodeChar(225, chars);
cout << chars << endl; // á
GetUnicodeChar(0x03A6, chars);
cout << chars << endl; // Φ
GetUnicodeChar(0x110000, chars);
cout << chars << endl; // �
Note that this is just a standard UTF-8 encoding algorithm, so if your platform does not assume UTF-8 it might not render correctly. (Thanks, #EmilioGaravaglia)
Sometimes manipulating character strings at the character level is unavoidable.
Here I have a function written for ANSI/ASCII based character strings that replaces CR/LF sequences with LF only, and also replaces CR with LF. We use this because incoming text files often have goofy line endings due to various text or email programs that have made a mess of them, and I need them to be in a consistent format to make parsing / processing / output work properly down the road.
Here's a fairly efficient implementation of this compression from various line-endings to LF only, for single byte per character implementations:
// returns the in-place conversion of a Mac or PC style string to a Unix style string (i.e. no CR/LF or CR only, but rather LF only)
char * AnsiToUnix(char * pszAnsi, size_t cchBuffer)
{
size_t i, j;
for (i = 0, j = 0; pszAnsi[i]; ++i, ++j)
{
// bounds checking
ASSERT(i < cchBuffer);
ASSERT(j <= i);
switch (pszAnsi[i])
{
case '\n':
if (pszAnsi[i + 1] == '\r')
++i;
break;
case '\r':
if (pszAnsi[i + 1] == '\n')
++i;
pszAnsi[j] = '\n';
break;
default:
if (j != i)
pszAnsi[j] = pszAnsi[i];
}
}
// append null terminator if we changed the length of the string buffer
if (j != i)
pszAnsi[j] = '\0';
// bounds checking
ASSERT(pszAnsi[j] == 0);
return pszAnsi;
}
I'm trying to transform this into something that will work correctly with multibyte/unicode strings, where the size of the next character can be multible bytes wide.
So:
I need to look at a character only at a valid character-point (not in the middle of a character)
I need to copy over the portion of the character that is part of the rejected piece properly (i.e. copy whole characters, not just bytes)
I understand that _mbsinc() will give me the address of the next start of a real character. But what is the equivalent for Unicode (UTF16), and are there already primitives to be able to copy a full character (e.g. length_character(wsz))?
One of the beautiful things about UTF-8 is that if you only care about the ASCII subset, your code doesn't need to change at all. The non-ASCII characters get encoded to multi-byte sequences where all of the bytes have the upper bit set, keeping them out of the ASCII range themselves. Your CR/LF replacement should work without modification.
UTF-16 has the same property. Characters that can be encoded as a single 16-bit entity will never conflict with the characters that require multiple entities.
Do not try to keep text internally in mix of whatever encodings and work with those it is true Hell.
First pick some "internal" encoding. When target platform is UNIX then UTF-8 is good candidate, it is slightly easier to display there. When target platform is Windows then UTF-16 is good candidate, Windows uses it internally anyway everywhere. Whatever you pick, stick to it an only it.
Then you convert all incoming "dirty" text into that encoding. Also you may make some re-formatting that actually looks exactly like your code, only that on case of wchar_t containing UTF-16 you have to use literals like L'\n'.
I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.