How do I convert a decimal number, 225 for example, to its corresponding Unicode character when it's being output? I can convert ASCII characters from decimal to the character like this:
int a = 97;
char b = a;
cout << b << endl;
And it output the letter "a", but it just outputs a question mark when I use the number 225, or any non-ascii character.
To start with, it's not your C++ program which converts strings of bytes written to standard output into visible characters; it's your terminal (or, more commonly these days, your terminal emulator). Unfortunately, there is no way to ask the terminal how it expects characters to be encoded, so that needs to be configured into your environment; normally, that's done by setting appropriate locale environment variables.
Like most things which have to do with terminals, the locale configuration system would probably have been done very differently if it hadn't developed with a history of many years of legacy software and hardware, most of which were originally designed without much consideration for niceties like accented letters, syllabaries or ideographs. C'est la vie.
Unicode is pretty cool, but it also had to be deployed in the face of the particular history of computer representation of writing systems, which meant making a lot of compromises in the face of the various firmly-held but radically contradictory opinions in the software engineering community, dicho sea de paso a community in which head-butting is rather more common that compromise. The fact that Unicode has eventually become more or less the standard is a testimony to its solid technical foundations and the perseverance and political skills of its promoters and designers -- particularly Mark Davis --, and I say this despite the fact that it basically took more than two decades to get to this point.
One of the aspects of this history of negotiation and compromise is that there is more than one way to encode a Unicode string into bits. There are at least three ways, and two of those have two different versions depending on endianness; moreover, each of these coding systems has its dedicated fans (and consequently, its dogmatic detractors). In particular, Windows made an early decision to go with a mostly-16-bit encoding, UTF-16, while most unix(-like) systems use a variable-length 8-to-32-bit encoding, UTF-8. (Technically, UTF-16 is also a 16- or 32-bit encoding, but that's beyond the scope of this rant.)
Pre-Unicode, every country/language used their own idiosyncratic 8-bit encoding (or, at least, those countries whose languages are written with an alphabet of less than 194 characters). Consequently, it made sense to configure the encoding as part of the general configuration of local presentation, like the names of months, the currency symbol, and what character separates the integer part of a number from its decimal fraction. Now that there is widespread (but still far from universal) convergence on Unicode, it seems odd that locales include the particular flavour of Unicode encoding, given that all flavours can represent the same Unicode strings and that the encoding is more generally specific to the particular software being used than the national idiosyncrasy. But it is, and that's why on my Ubuntu box, the environment variable LANG is set to es_ES.UTF-8 and not just es_ES. (Or es_PE, as it should be, except that I keep running into little issues with that locale.) If you're using a linux system, you might find something similar.
In theory, that means that my terminal emulator (konsole, as it happens, but there are various) expects to see UTF-8 sequences. And, indeed, konsole is clever enough to check the locale setting and set up its default encoding to match, but I'm free to change the encoding (or the locale settings), and confusion is likely to result.
So let's suppose that your locale settings and the encoding used by your terminal are actually in synch, which they should be on a well-configure workstation, and go back to the C++ program. Now, the C++ program needs to figure out which encoding it's supposed to use, and then transform from whatever internal representation it uses to the external encoding.
Fortunately, the C++ standard library should handle that correctly, if you cooperate by:
Telling the standard library to use the configured locale, instead of the default C (i.e. only unaccented characters, as per English) locale; and
Using strings and iostreams based on wchar_t (or some other wide character format).
If you do that, in theory you don't need to know either what wchar_t means to your standard library, nor what a particular bit pattern means to your terminal emulator. So let's try that:
#include <iostream>
#include <locale>
int main(int argc, char** argv) {
// std::locale() is the "global" locale
// std::locale("") is the locale configured through the locale system
// At startup, the global locale is set to std::locale("C"), so we need
// to change that if we want locale-aware functions to use the configured
// locale.
// This sets the global" locale to the default locale.
std::locale::global(std::locale(""));
// The various standard io streams were initialized before main started,
// so they are all configured with the default global locale, std::locale("C").
// If we want them to behave in a locale-aware manner, including using the
// hopefully correct encoding for output, we need to "imbue" each iostream
// with the default locale.
// We don't have to do all of these in this simple example,
// but it's probably a good idea.
std::cin.imbue(std::locale());
std::cout.imbue(std::locale());
std::cerr.imbue(std::locale());
std::wcin.imbue(std::locale());
std::wcout.imbue(std::locale());
std::wcerr.imbue(std::locale());
// You can't write a wchar_t to cout, because cout only accepts char. wcout, on the
// other hand, accepts both wchar_t and char; it will "widen" char. So it's
// convenient to use wcout:
std::wcout << "a acute: " << wchar_t(225) << std::endl;
std::wcout << "pi: " << wchar_t(960) << std::endl;
return 0;
}
That works on my system. YMMV. Good luck.
Small side-note: I've run into lots of people who think that wcout automatically writes "wide characters", so that using it will produce UTF-16 or UTF-32 or something. It doesn't. It produces exactly the same encoding as cout. The difference is not what it outputs but what it accepts as input. In fact, it can't really be different from cout because both of them are connected to the same OS stream, which can only have one encoding (at a time).
You might ask why it is necessary to have two different iostreams. Why couldn't cout have just accepted wchar_t and std::wstring values? I don't actually have an answer for that, but I suspect it is part of the philosophy of not paying for features you don't need. Or something like that. If you figure it out, let me know.
If for some reason you want to handle this entirely on your own:
void GetUnicodeChar(unsigned int code, char chars[5]) {
if (code <= 0x7F) {
chars[0] = (code & 0x7F); chars[1] = '\0';
} else if (code <= 0x7FF) {
// one continuation byte
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
} else if (code <= 0xFFFF) {
// two continuation bytes
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
} else if (code <= 0x10FFFF) {
// three continuation bytes
chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
} else {
// unicode replacement character
chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
chars[3] = '\0';
}
}
And then to use it:
char chars[5];
GetUnicodeChar(225, chars);
cout << chars << endl; // á
GetUnicodeChar(0x03A6, chars);
cout << chars << endl; // Φ
GetUnicodeChar(0x110000, chars);
cout << chars << endl; // �
Note that this is just a standard UTF-8 encoding algorithm, so if your platform does not assume UTF-8 it might not render correctly. (Thanks, #EmilioGaravaglia)
Related
I try to understand how to handle basic UTF-8 operations in C++.
Let's say we have this scenario: User inputs a name, it's limited to 10 letters (symbols in user's language, not bytes), it's being stored.
It can be done this way in ASCII.
// ASCII
char * input; // user's input
char buf[11] // 10 letters + zero
snprintf(buf,11,"%s",input); buf[10]=0;
int len= strlen(buf); // return 10 (correct)
Now, how to do it in UTF-8? Let's assume it's up to 4 bytes charset (like Chinese).
// UTF-8
char * input; // user's input
char buf[41] // 10 letters * 4 bytes + zero
snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters
int len= strlen(buf); // return number of bytes not letters (incorrect)
Can it be done with standard sprintf/strlen? Are there any replacements of those function to use with UTF-8 (in PHP there was mb_ prefix of such functions IIRC)? If not, do I need to write those myself? Or maybe do I need to approach it another way?
Note: I would prefer to avoid wide characters solution...
EDIT: Let's limit it to Basic Multilingual Plane only.
I would prefer to avoid wide characters solution...
Wide characters are just not enough, because if you need 4 bytes for a single glyph, then that glyph is likely to be outside the Basic Multilingual Plane, and it will not be represented by a single 16 bits wchar_t character (assuming wchar_t is 16 bits wide which is just the common size).
You will have to use a true unicode library to convert the input to a list of unicode characters in their Normal Form C (canonical composition) or the compatibility equivalent (NFKC)(*) depending on whether for example you want to count one or two characters for the ligature ff (U+FB00). AFAIK, you best bet should be ICU.
(*) Unicode allows multiple representation for the same glyph, notably the normal composed form (NFC) and normal decomposed form (NFD). For example the french é character can be represented in NFC as U+00E9 or LATIN SMALL LETTER E WITH ACUTE or as U+0065 U+0301 or LATIN SMALL LETTER E followed with COMBINING ACUTE ACCENT (also displayed as é).
References and other examples on Unicode equivalence
strlen only counts the bytes in the input string, until the terminating NUL.
On the other hand, you seem interested in the glyph count (what you called "symbols in user's language").
The process is complicated by UTF-8 being a variable length encoding (as is, in a kind of lesser extent, also UTF-16), so code points can be encoded using one up to four bytes. And there are also Unicode combining characters to consider.
To my knowledge, there's nothing like that in the standard C++ library. However, you may have better luck using third party libraries like ICU.
std::strlen indeed considers only one byte characters. To compute the length of a unicode NUL terminated string, one can use std::wcslen instead.
Example:
#include <iostream>
#include <cwchar>
#include <clocale>
int main()
{
const wchar_t* str = L"爆ぜろリアル!弾けろシナプス!パニッシュメントディス、ワールド!";
std::setlocale(LC_ALL, "en_US.utf8");
std::wcout.imbue(std::locale("en_US.utf8"));
std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n';
}
If you do not want to count utf-8 chars by yourself - you can use temporary conversion to widechar to cut your input string. You do not need to store the intermediate values
#include <iostream>
#include <codecvt>
#include <string>
#include <locale>
std::string cutString(const std::string& in, size_t len)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt;
auto wstring = cvt.from_bytes(in);
if(len < wstring.length())
{
wstring = wstring.substr(0,len);
return cvt.to_bytes(wstring);
}
return in;
}
int main(){
std::string test = "你好世界這是演示樣本";
std::string res = cutString(test,5);
std::cout << test << '\n' << res << '\n';
return 0;
}
/****************
Output
$ ./test
你好世界這是演示樣本
你好世界這
*/
There seems to be a problem when I'm writing words in foreign characters (french...)
For example, if I ask for input for an std::string or a char[] like this:
std::string s;
std::cin>>s; //if we input the string "café"
std::cout<<s<<std::endl; //outputs "café"
Everything is fine.
Although if the string is hard-coded
std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"
What is going on? What characters are supported by C++ and how do I make it work right? Does it have something to do with my operating system (Windows 10)? My IDE (VS 15)? Or with C++?
In a nutshell, if you want to pass/receive unicode text to/from the console on Windows 10 (in fact any version of Windows), you need to use wide strings, IE, std::wstring. Windows itself doesn't support UTF-8 encoding. This is a fundamental OS limitation.
The entire Win32 API, on which things like console and file system access are based, only works with unicode characters under the UTF-16 encoding, and the C/C++ runtimes provided in Visual Studio don't offer any kind of translation layer to make this API UTF-8 compatible. This doesn't mean you can't use UTF-8 encoding internally, it just means that when you hit the Win32 API, or a C/C++ runtime feature that uses it, you'll need to convert between UTF-8 and UTF-16 encoding. It sucks, but it's just where we are right now.
Some people might direct you to a series of tricks that proport to make the console work with UTF-8. Don't go this route, you'll run into a lot of problems. Only wide-character strings are properly supported for unicode console access.
Edit: Because UTF-8/UTF-16 string conversion is non-trivial, and there also isn't much help provided for this in C++, here are some conversion functions I prepared earlier:
///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
// Convert the encoding of the supplied string
std::wstring stringUTF16;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF8.size();
stringUTF16.reserve(sourceStringSize);
while (sourceStringPos < sourceStringSize)
{
// Determine the number of code units required for the next character
static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];
// Ensure that the requested number of code units are left in the source string
if ((sourceStringPos + codeUnitCount) > sourceStringSize)
{
break;
}
// Convert the encoding of this character
switch (codeUnitCount)
{
case 1:
{
stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
break;
}
case 2:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 3:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
stringUTF16.push_back((wchar_t)unicodeCodePoint);
break;
}
case 4:
{
unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
(((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
(((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
stringUTF16.push_back(convertedCodeUnit1);
stringUTF16.push_back(convertedCodeUnit2);
break;
}
}
// Advance past the converted code units
sourceStringPos += codeUnitCount;
}
// Return the converted string to the caller
return stringUTF16;
}
///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
// Convert the encoding of the supplied string
std::string stringUTF8;
size_t sourceStringPos = 0;
size_t sourceStringSize = stringUTF16.size();
stringUTF8.reserve(sourceStringSize * 2);
while (sourceStringPos < sourceStringSize)
{
// Check if a surrogate pair is used for this character
bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);
// Ensure that the requested number of code units are left in the source string
if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
{
break;
}
// Decode the character from UTF-16 encoding
unsigned int unicodeCodePoint;
if (usesSurrogatePair)
{
unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
}
else
{
unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
}
// Encode the character into UTF-8 encoding
if (unicodeCodePoint <= 0x7F)
{
stringUTF8.push_back((char)unicodeCodePoint);
}
else if (unicodeCodePoint <= 0x07FF)
{
char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
}
else if (unicodeCodePoint <= 0xFFFF)
{
char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
}
else
{
char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
stringUTF8.push_back(convertedCodeUnit1);
stringUTF8.push_back(convertedCodeUnit2);
stringUTF8.push_back(convertedCodeUnit3);
stringUTF8.push_back(convertedCodeUnit4);
}
// Advance past the converted code units
sourceStringPos += (usesSurrogatePair) ? 2 : 1;
}
// Return the converted string to the caller
return stringUTF8;
}
I was in charge of the unenviable task of converting a 6 million line legacy Windows app to support Unicode, when it was only written to support ASCII (in fact its development pre-dates Unicode), where we used std::string and char[] internally to store strings. Since changing all the internal string storage buffers was simply not possible, we needed to adopt UTF-8 internally and convert between UTF-8 and UTF-16 when hitting the Win32 API. These are the conversion functions we used.
I would strongly recommend sticking with what's supported for new Windows development, which means wide strings. That said, there's no reason you can't base the core of your program on UTF-8 strings, but it will make things more tricky when interacting with Windows and various aspects of the C/C++ runtimes.
Edit 2: I've just re-read the original question, and I can see I didn't answer it very well. Let me give some more info that will specifically answer your question.
What's going on? When developing with C++ on Windows, when you use std::string with std::cin/std::cout, the console IO is being done using MBCS encoding. This is a deprecated mode under which the characters are encoded using the currently selected code page on the machine. Values encoded under these code pages are not unicode, and cannot be shared with other systems that have a different code page selected, or even the same system if the code page is changed. It works perfectly in your test, because you're capturing the input under the current code page, and displaying it back under the same code page. If you try capturing that input and saving it to a file, inspection will show it's not unicode. Load it back with a different code page selected in our OS, and the text will appear corrupted. You can only interpret text if you know what code page it was encoded in. Since these legacy code pages are regional, and none of them can represent all text characters, it makes it effectively impossible to share text universally across different machines and computers. MBCS pre-dates the development of unicode, and it was specifically because of these kind of issues that unicode was invented. Unicode is basically the "one code page to rule them all". You might be wondering why UTF-8 isn't a selectable "legacy" code page on Windows. A lot of us are wondering the same thing. Suffice to say, it isn't. As such, you shouldn't rely on MBCS encoding, because you can't get unicode support when using it. Your only option for unicode support on Windows is using std::wstring, and calling the UTF-16 Win32 API's.
As for your example about the string being hard-coded, first of all understand that encoding non-ASCII text into your source file puts you into the realm of compiler-specific behaviour. In Visual Studio, you can actually specify the encoding of the source file (Under File->Advanced Save Options). In your case, the text is coming out different to what you'd expect because it's being encoded (most likely) in UTF-8, but as mentioned, the console output is being done using MBCS encoding on your currently selected code page, which isn't UTF-8. Historically, you would have been advised to avoid any non-ASCII characters in source files, and escape any using the \x notation. Today, there are C++11 string prefixes and suffixes that guarantee various encoding forms. You could try using these if you need this ability. I have no practical experience using them, so I can't advise if there are any issues with this approach.
The problem originates with Windows itself. It uses one character encoding (UTF-16) for most internal operations, another (Windows-1252) for default file encoding, and yet another (Code Page 850 in your case) for console I/O. Your source file is encoded in Windows-1252, where é equates to the single byte '\xe9'. When you display this same code in Code Page 850, it becomes Ú. Using u8"é" produces a two byte sequence "\xc3\xa9", which prints on the console as ├®.
Probably the easiest solution is to avoid putting non-ASCII literals in your code altogether and use the hex code for the character you require. This won't be a pretty or portable solution though.
std::string s="caf\x82";
A better solution would be to use u16 strings and encode them using WideCharToMultiByte.
What characters are supported by C++
The C++ standarad does not specify which characters are supported. It is implementation specific.
Does it have something to do with...
... C++?
No.
... My IDE?
No, although an IDE might have an option to edit a source file in particular encoding.
... my operating system?
This may have an influence.
This is influenced by several things.
What is the encoding of the source file.
What is the encoding that the compiler uses to interpret the source file.
Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly).
The native encoding of your operating system probably influences what character encoding your compiler expects by default.
What encoding does the terminal that runs the program support.
Is it the same as the encoding of the file, or different (it should be the same or it might not work correctly without conversion).
Is the used character encoding wide. By wide, I mean whether the width of a code unit is more than CHAR_BIT. A wide source / compiler will cause a conversion into another, narrow encoding since you use a narrow string literal and narrow stream operator. In this case, you'll need to figure out both the native narrow and the native wide character encoding expected by the compiler. The compiler will convert the input string into the narrow encoding. If the narrow encoding has no representation for the character in the input encoding, it might not work correctly.
An example:
Source file is encoded in UTF-8. Compiler expects UTF-8. The terminal expects UTF-8. In this case, what you see is what you get.
The trick here is setlocale:
#include <clocale>
#include <string>
#include <iostream>
int main() {
std::setlocale(LC_ALL, "");
std::string const s("café");
std::cout << s << '\n';
}
The output for me with the Windows 10 Command Prompt is correct, even without changing the terminal codepage.
In this question: Convert ISO-8859-1 strings to UTF-8 in C/C++
There is a really nice concise piece of c++ code that converts ISO-8859-1 strings to UTF-8.
In this answer: https://stackoverflow.com/a/4059934/3426514
I'm still a beginner at c++ and I'm struggling to understand how this works. I have read up on the encoding sequences of UTF-8, and I understand that <128 the chars are the same, and above 128 the first byte gets a prefix and the rest of the bits are spread over a couple of bytes starting with 10xx, but I see no bit shifting in this answer.
If someone could help me to decompose it into a function that only processes 1 character, it would really help me understand.
Code, commented.
This works on the fact that Latin-1 0x00 through 0xff are mapping to consecutive UTF-8 code sequences 0x00-0x7f, 0xc2 0x80-bf, 0xc3 0x80-bf.
// converting one byte (latin-1 character) of input
while (*in)
{
if ( *in < 0x80 )
{
// just copy
*out++ = *in++;
}
else
{
// first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
// (the condition in () evaluates to true / 1)
*out++ = 0xc2 + ( *in > 0xbf ),
// second byte is the lower six bits of the input byte
// with the highest bit set (and, implicitly, the second-
// highest bit unset)
*out++ = ( *in++ & 0x3f ) + 0x80;
}
}
The problem with a function processing a single (input) character is that the output could be either one or two bytes, making the function a bit awkward to use. You are usually better off (both in performance and cleanliness of code) with processing whole strings.
Note that the assumption of Latin-1 as input encoding is very likely to be wrong. For example, Latin-1 doesn't have the Euro sign (€), or any of these characters ŠšŽžŒœŸ, which makes most people in Europe use either Latin-9 or CP-1252, even if they are not aware of it. ("Encoding? No idea. Latin-1? Yea, that sounds about right.")
All that being said, that's the C way to do it. The C++ way would (probably, hopefully) look more like this:
#include <unistr.h>
#include <bytestream.h>
// ...
icu::UnicodeString ustr( in, "ISO-8859-1" );
// ...work with a properly Unicode-aware string class...
// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );
That is using the International Components for Unicode (ICU) library. Note the ease this is adopted to a different input encoding. Different output encodings, iostream operators, character iterators, and even a C API are readily available from the library.
I don't know how to solve that:
Imagine, we have 4 websites:
A: UTF-8
B: ISO-8859-1
C: ASCII
D: UTF-16
My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".
The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters.
Then I send these words to a server. The database and the web-frontend are using UTF-8.
So my questions are:
How can I convert "any" (or the most used) character encoding to UTF-8?
How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
How can I convert "any" (or the most used) character encoding to UTF-8?
ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).
You create a converter for a given encoding...
#include <ucnv.h>
UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
// ...
ucnv_close( converter );
}
...and then use the UnicodeString class as appripriate.
I think wchar_t does not work because it is 2 bytes long.
The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.
Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.
Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.
Check BreakIterator.
Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...
In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.
No sure if this will give you everything you're looking for but it might help a little.
Have you tried looking at:
1) Boost.Locale library ?
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
2) Or at
conversions are part of C++11?
#include <codecvt>
#include <locale>
#include <string>
#include <cassert>
int main() {
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
std::string utf8 = convert.to_bytes(0x5e9);
assert(utf8.length() == 2);
assert(utf8[0] == '\xD7');
assert(utf8[1] == '\xA9');
}
How can I work with UTF-8-strings in C++? I think wchar_t does not
work because it is 2 bytes long. Code-Points in UTF-8 are up to 4
bytes long...
This is easy, there is a project named tinyutf8 , which is a drop-in replacement for std::string/std::wstring.
Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.
How can I convert "any" (or the most used) character encoding to
UTF-8?
You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).
UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).
It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.
Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.
Content-Type: text/html; charset=UTF-8
If not present, there also there might be
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8"> <!-- HTML5 -->
ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such.
Even browsers on MacOS will understand these though ISO-8859-1 was specified.
Conversion libraries: alread mentioned by #syam.
Conversion
Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.
The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.
table[157] = "\xEF\xBF\xBD";
public static void main(String[] args) {
final String SOURCE_ENCODING = "windows-1252";
byte[] sourceBytes = new byte[1];
System.out.println(" const char* table[] = {");
for (int c = 0; c < 256; ++c) {
String comment = "";
System.out.printf(" /* %3d */ \"", c);
if (32 <= c && c < 127) {
// Pure ASCII
if (c == '\"' || c == '\\')
System.out.print("\\");
System.out.print((char)c);
} else {
if (c == 0) {
comment = " // Unusable";
}
sourceBytes[0] = (byte)c;
try {
byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
for (int j = 0; j < targetBytes.length; ++j) {
int b = targetBytes[j] & 0xFF;
System.out.printf("\\x%02X", b);
}
} catch (UnsupportedEncodingException ex) {
comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
}
}
System.out.print("\"");
if (c < 255) {
System.out.print(",");
}
System.out.println();
}
System.out.println(" };");
}
I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?
ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.
TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}
If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.
If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.