c++: getting ascii value of a wide char - c++

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!

You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.

You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?

There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.

Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...

what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.

Related

Understanding unicode codecvt

I have a UTF-16 encoded stream and I'd like to convert it into plain ASCII, i.e. if there's an ASCII character -> print it. If a codeunit represents something else I don't care e.g. chinese characters) -> output garbage.
I'm using this code
typedef std::codecvt_utf16<wchar_t> convert_typeX;
std::wstring_convert<convert_typeX, wchar_t> converterX;
std::string converted = converterX.from_bytes(str);
and it seems to work.. but why?
documentation for codecvt_utf16 states:
std::codecvt_utf16 is a std::codecvt facet which encapsulates conversion between a UTF-16 encoded byte string and UCS2 or UCS4 character string (depending on the type of Elem).
UCS2 is a version of unicode as far as I know.. so this code is converting to a sequence of wchar_t bytes that represent unicode characters right? How come I'm getting ASCII bytes?
The nice thing about unicode is that unicode values 0-127 represent ASCII characters 0-127.
So, you don't even need to waste your time with std::codecvt. All you have to do is scan your UTF-16 sequence, grab all UTF-16 values in the range of 0-127 (see the wikipedia entry for UTF-16 for the simple process of extracting UTF-16 values from the bytestream), and you'll end up with plain ASCII, as if by magic. That's because, by definition, values above 127 are not plain ASCII. You can do whatever you want with all other characters.
And, if you would like to expand your universe to iso-8859-1, rather than US-ASCII, you can expand your range to 0-255. Because unicode values 128-255 are also equivalent to characters 128-255 in the iso-8859-1 codeset.

Conversion from std::wstring to ISO Latin1 format in c++

I have a need to convert from std::wstring to ISO Latin1. After reading several forums, I landed in a confusion. wstring is supports unicode character set in which each is two bytes where as ISO Latin1 is just 1 byte. But first 256 code points are same for both.
Is ISO Latin-1 is a multi byte string? If so do I need to use wstombs() to convert from wstring to ISO Latin-1
2.Do I have a need to convert the input wstring to ISO LATIN-1, if so how to achieve that?
Please help me in understanding this.
In Windows wchar_t is 16 bits.
When there are no surrogate pairs (character represented as 2 successive wchar_t values) you know that any wchar_t value <256 is Latin-1, and otherwise not.
Surrogate pair values are easily recognizable as such because they are in a value range reserved for such.
Actually, this means that you know that any wchar_t value <256 is Latin-1, and otherwise not, regardless of surrogate pairs.
And no, Latin-1 is not a multibyte encoding. "Multibyte" refers to encodings where the number of bytes per character can vary.

std::string and UTF-8 encoded unicode

If I understand well, it is possible to use both string and wstring to store UTF-8 text.
With char, ASCII characters take a single byte, some chinese characters take 3 or 4, etc. Which means that str[3] doesn't necessarily point to the 4th character.
With wchar_t same thing, but the minimal amount of bytes used per characters is always 2 (instead of 1 for char), and a 3 or 4 byte wide character will take 2 wchar_t.
Right ?
So, what if I want to use string::find_first_of() or string::compare(), etc with such a weirdly encoded string ? Will it work ? Does the string class handle the fact that characters have a variable size ? Or should I only use them as dummy feature-less byte arrays, in which case I'd rather go for a wchar_t[] buffer.
If std::string doesn't handle that, second question: are there libraries providing string classes that could handle that UTF-8 encoding so that str[3] actually points to the 3rd character (which would be a byte array from length 1 to 4) ?
You are talking about Unicode. Unicode uses 32 bits to represent a character. However since that is wasting memory there are more compact encodings. UTF-8 is one such encoding. It assumes that you are using byte units and it maps Unicode characters to 1, 2, 3 or 4 bytes. UTF-16 is another that is using words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes).
You can use both encoding with both string and wchar_t. UTF-8 tends to be more compact for english text/numbers.
Some things will work regardless of encoding and type used (compare). However all functions that need to understand one character will be broken. I.e the 5th character is not always the 5th entry in the underlying array. It might look like it's working with certain examples but It will eventually break.
string::compare will work but do not expect to get alphabetical ordering. That is language dependent.
string::find_first_of will work for some but not all. Long string will likely work just because they are long while shorter ones might get confused by character alignment and generate very hard to find bugs.
Best thing is to find a library that handles it for you and ignore the type underneath (unless you have strong reasons to pick one or the other).
You can't handle Unicode with std::string or any other tools from Standard Library. Use external library such as: http://utfcpp.sourceforge.net/
You are correct for those:
...Which means that str[3] doesn't necessarily point to the 4th character...only use them as dummy feature-less byte arrays...
string of C++ can only handle ascii characters. This is different from the String of Java, which can handle Unicode characters. You can store the encoding result (bytes) of Chinese characters into string (char in C/C++ is just byte), but this is meaningless as string just treat the bytes as ascii chars, so you cannot use string function to process it.
wstring may be something you need.
There is something that should be clarified. UTF-8 is just an encoding method for Unicode characters (transforming characters from/to byte format).

How can I convert a decimal code of a character into a Unicode string in C++?

How can I convert a decimal code of a character into a Unicode string in C++?
For example, I give it the integer 241, that is the &apos;ñ&apos; spanish letter, and I want to convert it to a Unicode string.
If your source character set is ISO 8859-1 or 8859-15 (both of which have LATIN SMALL LETTER N WITH TILDE at code point 0xF1 = 241), then the conversion needs to create the correct encoding for Unicode character U+00F1.
Now, we need to know which Unicode encoding scheme you are using. If you use UTF-8, you will need the result:
\xC3 \xB1
If you use UTF-16 BE (big endian), you need:
\x00 \xF1
If you use UTF-16 LE (little endian), you need:
\xF1 \x00
If you are using UTF-32, then you need 4 bytes instead of 2.
And if you want a string, you will need to encode the U+0000 (NULL) as a following character.
If you don't know which form you need, you have big problems; to use Unicode, you need to understand something of how the different forms are encoded. Your library may save you from a lot of the bother of understanding, but ultimately, you need to know at least a minimum about Unicode.
If the character code is determined at runtime, and you cannot use literals like explained by Jonathan, you need to rely on your toolkit. For example, in Qt:
QString codepointToString(QString codepointDecimal) {
int codepoint = codepointDecimal.toInt(); //TODO: check errors
QChar character(codepoint);
return QString(character);
}

How does the UTF-8 support of TinyXML work?

I'm using TinyXML to parse/build XML files. Now, according to the documentation this library supports multibyte character sets through UTF-8. So far so good I think. But, the only API that the library provides (for getting/setting element names, attribute names and values, ... everything where a string is used) is through std::string or const char*. This has me doubting my own understanding of multibyte character set support. How can a string that only supports 8-bit characters contain a 16 bit character (unless it uses a code page, which would negate the 'supports Unicode' claim)? I understand that you could theoretically take a 16-bit code point and split it over 2 chars in a std::string, but that wouldn't transform the std::string to a 'Unicode' string, it would make it invalid for most purposes and would maybe accidentally work when written to a file and read in by another program.
So, can somebody explain to me how a library can offer an '8-bit interface' (std::string or const char*) and still support 'Unicode' strings?
(I probably mixed up some Unicode terminology here; sorry about any confusion coming from that).
First, utf-8 is stored in const char * strings, as #quinmars said. And it's not only a superset of 7-bit ASCII (code points <= 127 always encoded in a single byte as themselves), it's furthermore careful that bytes with those values are never used as part of the encoding of the multibyte values for code points >= 128. So if you see a byte == 44, it's a '<' character, etc. All of the metachars in XML are in 7-bit ASCII. So one can just parse the XML, breaking strings where the metachars say to, sticking the fragments (possibly including non-ASCII chars) into a char * or std::string, and the returned fragments remain valid UTF-8 strings even though the parser didn't specifically know UTF-8.
Further (not specific to XML, but rather clever), even more complex things genrally just work (tm). For example, if you sort UTF-8 lexicographically by bytes, you get the same answer as sorting it lexicographically by code points, despite the variation in # of bytes used, because the prefix bytes introducing the longer (and hence higher-valued) code points are numerically greater than those for lesser values).
UTF-8 is compatible to 7-bit ASCII code. If the value of a byte is larger then 127, it means a multibyte character starts. Depending on the value of the first byte you can see how many bytes the character will take, that can be 2-4 bytes including the first byte (technical also 5 or 6 are possible, but they are not valid utf-8). Here is a good resource about UTF-8: UTF-8 and Unicode FAQ, also the wiki page for utf8 is very informative. Since UTF-8 is char based and 0-terminated, you can use the standard string functions for most things. The only important thing is that the character count can differ from the byte count. Functions like strlen() return the byte count but not necessarily the character count.
By using between 1 and 4 chars to encode one Unicode code point.