How to convert an ASCII string to an UTF8 string in C++? - c++

How to convert an ASCII std::string to an UTF8 (Unicode) std::string in C++?

std::string ASCIIToUTF8(std::string str) {
return str;
}
Every ASCII character has the same representation in UTF8, so there is nothing to convert.
Of course, if the input string uses an extended (8-bit) ASCII character set, the answer is more complex.

ASCII is a seven-bit encoding and maps identically onto the UTF-8 encoding of the subset of characters that can be represented in ASCII.
In short, there is nothing to do. Your ASCII string is already valid UTF-8.

I assume that by ASCII you mean CP1252 or other 8 bit character set (ASCII is only 7 bits and it is directly compatible with UTF-8, no conversion required). Standard C++ cannot do it. You need e.g. Glibmm, Qt, iconv or WINAPI to do it.

Related

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?
Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:
Which string container should I pick?
std::string with UTF-8?
std::wstring (don't really know much about it)
std::u16string with UTF-16?
std::u32string with UTF-32?
Should I stick entirely to one of the above containers or change them when needed?
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Which string container should I pick?
That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.
std::wstring (don't really know much about it)
The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.
Should I stick entirely to one of the above containers or change them when needed?
Use whatever containers suit your needs.
Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.
How to properly convert between them?
There are many ways to handle that.
C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.
There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.
There are C library functions, platform system calls, etc.
Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?
Yes, if you use appropriate string literal prefixes:
u8 for UTF-8.
L for UTF-16 or UTF-32 (depending on compiler/platform).
u16 for UTF-16.
u32 for UTF-32.
Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.
What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.
Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.
When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).
What happens when i do the following?
std::string s = u8"foo";
s += 'x';
Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.
How can I distinguish UTF char[] and non-UTF char[] or std::string?
You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.
What are differences between wchar_t and other multi-byte character types?
Byte size and UTF encoding.
char = ANSI/MBCS or UTF-8
wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform
char8_t = UTF-8
char16_t = UTF-16
char32_t = UTF-32
Is wchar_t character or wchar_t string literal capable of storing UTF encodings?
Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.
How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?
That is a very broad topic, worthy of its own separate question by itself.

Understanding unicode codecvt

I have a UTF-16 encoded stream and I'd like to convert it into plain ASCII, i.e. if there's an ASCII character -> print it. If a codeunit represents something else I don't care e.g. chinese characters) -> output garbage.
I'm using this code
typedef std::codecvt_utf16<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
std::string converted = converterX.from_bytes(str);
and it seems to work.. but why?
documentation for codecvt_utf16 states:
std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).
UCS2 is a version of unicode as far as I know.. so this code is converting to a sequence of wchar_t bytes that represent unicode characters right? How come I'm getting ASCII bytes?
The nice thing about unicode is that unicode values 0-127 represent ASCII characters 0-127.
So, you don't even need to waste your time with std::codecvt. All you have to do is scan your UTF-16 sequence, grab all UTF-16 values in the range of 0-127 (see the wikipedia entry for UTF-16 for the simple process of extracting UTF-16 values from the bytestream), and you'll end up with plain ASCII, as if by magic. That's because, by definition, values above 127 are not plain ASCII. You can do whatever you want with all other characters.
And, if you would like to expand your universe to iso-8859-1, rather than US-ASCII, you can expand your range to 0-255. Because unicode values 128-255 are also equivalent to characters 128-255 in the iso-8859-1 codeset.

std::string conversion to char32_t (unicode characters)

I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.
char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?
If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.
Unicode is generally encoded using:
UTF-8 in text files (and web pages, etc...)
A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t
So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string
For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.
For this, use a converter defined like this:
std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
And convert like that:
std::wstring unicodeString = converter.from_bytes(utf8String);
You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:
if(unicodeString[i]==L'仮')
{
info("this is some japanese character");
}

How can I convert a decimal code of a character into a Unicode string in C++?

How can I convert a decimal code of a character into a Unicode string in C++?
For example, I give it the integer 241, that is the &apos;ñ&apos; spanish letter, and I want to convert it to a Unicode string.
If your source character set is ISO 8859-1 or 8859-15 (both of which have LATIN SMALL LETTER N WITH TILDE at code point 0xF1 = 241), then the conversion needs to create the correct encoding for Unicode character U+00F1.
Now, we need to know which Unicode encoding scheme you are using. If you use UTF-8, you will need the result:
\xC3 \xB1
If you use UTF-16 BE (big endian), you need:
\x00 \xF1
If you use UTF-16 LE (little endian), you need:
\xF1 \x00
If you are using UTF-32, then you need 4 bytes instead of 2.
And if you want a string, you will need to encode the U+0000 (NULL) as a following character.
If you don't know which form you need, you have big problems; to use Unicode, you need to understand something of how the different forms are encoded. Your library may save you from a lot of the bother of understanding, but ultimately, you need to know at least a minimum about Unicode.
If the character code is determined at runtime, and you cannot use literals like explained by Jonathan, you need to rely on your toolkit. For example, in Qt:
QString codepointToString(QString codepointDecimal) {
int codepoint = codepointDecimal.toInt(); //TODO: check errors
QChar character(codepoint);
return QString(character);
}

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.