How to iterate over unicode characters in C++?

How to iterate over unicode characters in C++? - c++

I know that to get a unicode character in C++ I can do:
std::wstring str = L"\u4FF0";
However, what if I want to get all the characters in the range 4FF0 to 5FF0? Is it possible to dynamically build a unicode character? What I have in mind is something like this pseudo-code:
for (int i = 20464; i < 24560; i++ { // From 4FF0 to 5FF0
std::wstring str = L"\u" + hexa(i); // build the unicode character
// do something with str
}
How would I do that in C++?

The wchar_t type held within a wstring is an integer type, so you can use it directly:
for (wchar_t c = 0x4ff0; c <= 0x5ff0; ++c) {
std::wstring str(1, c);
// do something with str
}
Be careful trying to do this with characters above 0xffff, since depending on the platform (e.g. Windows) they will not fit into a wchar_t.
If for example you wanted to see the Emoticon block in a string, you can create surrogate pairs:
std::wstring str;
for (int c = 0x1f600; c <= 0x1f64f; ++c) {
if (c <= 0xffff || sizeof(wchar_t) > 2)
str.append(1, (wchar_t)c);
else {
str.append(1, (wchar_t)(0xd800 | ((c - 0x10000) >> 10)));
str.append(1, (wchar_t)(0xdc00 | ((c - 0x10000) & 0x3ff)));
}
}

You cannot increment over Unicode characters as if it is an array, some characters are build up out of multiple 'char's (UTF-8) and multiple 'WCHAR's (UTF-16) that's because of the diacritics etc. If you're really serious about this stuff you should use an API like UniScribe or ICU.
Some resources to read:
http://en.wikipedia.org/wiki/UTF-16/UCS-2
http://en.wikipedia.org/wiki/Precomposed_character
http://en.wikipedia.org/wiki/Combining_character
http://scripts.sil.org/cms/scripts/page.php?item_id=UnicodeNames#4d2aa980
http://en.wikipedia.org/wiki/Unicode_equivalence
http://msdn.microsoft.com/en-us/library/dd374126.aspx

What about:
for (std::wstring::value_type i(0x4ff0); i <= 0x5ff0; ++i)
{
std::wstring str(1, i);
}
Note that the code has not been tested, so it may not compile as-is.
Also, given the platform you are working on a wstring's character unit may be 2, 4, or N bytes wide- so be intentional about how you use it.

Related

How to convert an integer to a unicode character?

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :
unsigned int foo = (unsigned int)L'آ';
std::cout << foo << std::endl;
How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?
EDIT : I am expecting the output to be the unicode value of an integer, example:
cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)
I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro
Cheers

L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.
Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:
wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);
However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:
int utf32 = 1570;
const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
str.push_back((wchar_t)utf32);
}
else
{
utf32 -= 0x10000;
int hi = (utf32 >> 10) & mask;
int lo = utf32 & mask;
hi += 0xD800;
lo += 0xDC00;
str.push_back((wchar_t)hi);
str.push_back((wchar_t)lo);
}
MessageBox(0, str.c_str(), 0, 0);
See related posts for printing UTF16 in Windows console.

The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.
#include <stdio.h>
#include <iostream>
int main() {
setlocale(LC_ALL, "en_US.UTF-8");
// This does not work without setlocale(LC_ALL, "en_US.UTF-8");
for(int ch=30000; ch<30030; ch++) {
wprintf(L"%lc", ch);
}
printf("\n");
return 0;
}
Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.
If you want to use this method to print some variables, use the type wchat_t.
Useful links:
setlocale
wprintf

What is the correct way of processing different strings encodings via c++ main char** args?

I need some clarifications.
The problem is I have a program for windows written in C++ which uses 'wmain' windows-specific function that accepts wchar_t** as its args. So, there is an opportunity to pass whatever-you-like as a command line parameters to such program: for example, Chinese symbols, Japanese ones, etc, etc.
To be honest, I have no information about the encoding this function is usually used with. Probably utf-32, or even utf-16.
So, the questions:
What is the not windows-specific, but unix/linux way to achieve this with standard main function? My first thoughts were about usage of utf-8 encoded input strings with some kind of locales specifying?
Can somebody give a simple example of such main function? How can a std::string hold a Chinese symbols?
Can we operate with Chinese symbols encoded in utf-8 and contained in std::strings as usual when we just access each char (byte) like this: string_object[i] ?

Disclaimer: All Chinese words provided by GOOGLE translate service.
1) Just proceed as normal using normal std::string. The std::string can hold any character encoding and argument processing is simple pattern matching. So on a Chinese computer with the Chinese version of the program installed all it needs to do is compare Chinese versions of the flags to what the user inputs.
2) For example:
#include <string>
#include <vector>
#include <iostream>
std::string arg_switch = "开关";
std::string arg_option = "选项";
std::string arg_option_error = "缺少参数选项";
int main(int argc, char* argv[])
{
const std::vector<std::string> args(argv + 1, argv + argc);
bool do_switch = false;
std::string option;
for(auto arg = args.begin(); arg != args.end(); ++arg)
{
if(*arg == "--" + arg_switch)
do_switch = true;
else if(*arg == "--" + arg_option)
{
if(++arg == args.end())
{
// option needs a value - not found
std::cout << arg_option_error << '\n';
return 1;
}
option = *arg;
}
}
std::cout << arg_switch << ": " << (do_switch ? "on":"off") << '\n';
std::cout << arg_option << ": " << option << '\n';
return 0;
}
Usage:
./program --开关 --选项 wibble
Output:
开关: on
选项: wibble
3) No.
For UTF-8/UTF-16 data we need to use special libraries like ICU
For character by character processing you need to use or convert to UTF-32.

In short:
int main(int argc, char **argv) {
setlocale(LC_CTYPE, "");
// ...
}
http://unixhelp.ed.ac.uk/CGI/man-cgi?setlocale+3
And then you use mulitbyte string functions. You can still use normal std::string for storing multibyte strings, but beware that characters in them may span multiple array cells. After successfully setting the locale, you can also use wide streams (wcin, wcout, wcerr) to read and write wide strings from the standard streams.

1) with linux, you'd get standard main(), and standard char. It would use UTF-8 encoding. So chineese specific characters would be included in the string with a multibyte encoding.
***Edit:**sorry, yes: you have to set the default "" locale like here as well as cout.imbue().*
2) All the classic main() examples would be good examples. As said, chineese specific characters would be included in the string with a multibyte encoding. So if you cout such a string with the default UTF-8 locale, the cout sream would interpret the special UTF8 encoded sequences, knowing it has to agregate between 2 and 6 of each in order to produce the chineese output.
3) you can operate as usual on strings. THere are some issues however if you cout the string length for example: there is a difference between memory (ex: 3 bytes) and the chars that the user sees (ex: only 1). Same if you move with a pointer forward or backward. You have to make sure you interpret mulrtibyte encoding correctly, in order not to output an invalid encoding.
You could be interested in this other SO question.
Wikipedia explains the logic of the UTF-8 multibyte encoding. From this article you'll understand that any char u is a multibyte encoded char if:
( ((u & 0xE0) == 0xC0)
|| ((u & 0xF0) == 0xE0)
|| ((u & 0xF8) == 0xF0)
|| ((u & 0xFC) == 0xF8)
|| ((u & 0xFE) == 0xFC) )
It is followed by one or several chars such as:
((u & 0xC0) == 0x80)
All other chars are ASCII chars (i.e. not multibyte).

C++: how to convert ASCII or ANSI to UTF8 and stores in std::string

My company use some code like this:
std::string(CT2CA(some_CString)).c_str()
which I believe it converts a Unicode string (whose type is CString)into ANSI encoding, and this string is for a email's subject. However, header of the email (which includes the subject) indicates that the mail client should decode it as a unicode (this is how the original code does). Thus, some German chars like "ä ö ü" will not be properly displayed as the title.
Is there anyway that I can put this header back to UTF8 and store into a std::string or const char*?
I know there are a lot of smarter ways to do this, but I need to keep the code sticking to its original one (i.e. sent the header as std::string or const char*).
Thanks in advance.

Becareful : it's '|' and not '&' !
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3F);

This sounds like a plain conversion from one encoding to another encoding: You can use std::codecvt<char, char, mbstate_t> for this. Whether your implementation ships with a suitable conversion, I don't know, however. From the sounds of it you just try to convert ISO-Latin-1 into Unicode. That should be pretty much trivial: the first 128 characters map (0 to 127) identically to UTF-8 and the second half conveniently map to the corresponding Unicode code points, i.e., you just need to encode the corresponding value into UTF-8. Each character will be replaced by two characters. That it, I think the conversion is something like that:
// Takes the next position and the end of a buffer as first two arguments and the
// character to convert from ISO-Latin-1 as third argument.
// Returns a pointer to end of the produced sequence.
char* iso_latin_1_to_utf8(char* buffer, char* end, unsigned char c) {
if (c < 128) {
if (buffer == end) { throw std::runtime_error("out of space"); }
*buffer++ = c;
}
else {
if (end - buffer < 2) { throw std::runtime_error("out of space"); }
*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3f);
}
return buffer;
}

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.
I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.
I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.
How would I do that?

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.
for each char:
uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}
See http://en.wikipedia.org/wiki/UTF-8#Description for more details.
EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

TO c++ i use this:
std::string iso_8859_1_to_utf8(std::string &str)
{
string strOut;
for (std::string::iterator it = str.begin(); it != str.end(); ++it)
{
uint8_t ch = *it;
if (ch < 0x80) {
strOut.push_back(ch);
}
else {
strOut.push_back(0xc0 | ch >> 6);
strOut.push_back(0x80 | (ch & 0x3f));
}
}
return strOut;
}

If general-purpose charset frameworks (like iconv) are too bloated for you, roll your own.
Compose a static translation table (char to UTF-8 sequence), put together your own translation. Depending on what do you use for string storage (char buffers, or std::string or what) it would look somewhat differently, but the idea is - scroll through the source string, replace each character with code over 127 with its UTF-8 counterpart string. Since this can potentially increase string length, doing it in place would be rather inconvenient. For added benefit, you can do it in two passes: pass one determines the necessary target string size, pass two performs the translation.

If you don't mind doing an extra copy, you can just "widen" your ISO Latin 1 chars to 16-bit characters and thus get UTF-16. Then you can use something like UTF8-CPP to convert it to UTF-8.
In fact, I think UTF8-CPP could even convert ISO Latin 1 to UTF-8 directly (utf16to8 function) but you may get a warning.
Of course, it needs to be real ISO Latin 1, not Windows CP 1232.

Convert wchar_t to char

I was wondering is it safe to do so?
wchar_t wide = /* something */;
assert(wide >= 0 && wide < 256 &&);
char myChar = static_cast<char>(wide);
If I am pretty sure the wide char will fall within ASCII range.

Why not just use a library routine wcstombs.

assert is for ensuring that something is true in a debug mode, without it having any effect in a release build. Better to use an if statement and have an alternate plan for characters that are outside the range, unless the only way to get characters outside the range is through a program bug.
Also, depending on your character encoding, you might find a difference between the Unicode characters 0x80 through 0xff and their char version.

You are looking for wctomb(): it's in the ANSI standard, so you can count on it. It works even when the wchar_t uses a code above 255. You almost certainly do not want to use it.
wchar_t is an integral type, so your compiler won't complain if you actually do:
char x = (char)wc;
but because it's an integral type, there's absolutely no reason to do this. If you accidentally read Herbert Schildt's C: The Complete Reference, or any C book based on it, then you're completely and grossly misinformed. Characters should be of type int or better. That means you should be writing this:
int x = getchar();
and not this:
char x = getchar(); /* <- WRONG! */
As far as integral types go, char is worthless. You shouldn't make functions that take parameters of type char, and you should not create temporary variables of type char, and the same advice goes for wchar_t as well.
char* may be a convenient typedef for a character string, but it is a novice mistake to think of this as an "array of characters" or a "pointer to an array of characters" - despite what the cdecl tool says. Treating it as an actual array of characters with nonsense like this:
for(int i = 0; s[i]; ++i) {
wchar_t wc = s[i];
char c = doit(wc);
out[i] = c;
}
is absurdly wrong. It will not do what you want; it will break in subtle and serious ways, behave differently on different platforms, and you will most certainly confuse the hell out of your users. If you see this, you are trying to reimplement wctombs() which is part of ANSI C already, but it's still wrong.
You're really looking for iconv(), which converts a character string from one encoding (even if it's packed into a wchar_t array), into a character string of another encoding.
Now go read this, to learn what's wrong with iconv.

An easy way is :
wstring your_wchar_in_ws(<your wchar>);
string your_wchar_in_str(your_wchar_in_ws.begin(), your_wchar_in_ws.end());
char* your_wchar_in_char = your_wchar_in_str.c_str();
I'm using this method for years :)

A short function I wrote a while back to pack a wchar_t array into a char array. Characters that aren't on the ANSI code page (0-127) are replaced by '?' characters, and it handles surrogate pairs correctly.
size_t to_narrow(const wchar_t * src, char * dest, size_t dest_len){
size_t i;
wchar_t code;
i = 0;
while (src[i] != '\0' && i < (dest_len - 1)){
code = src[i];
if (code < 128)
dest[i] = char(code);
else{
dest[i] = '?';
if (code >= 0xD800 && code <= 0xD8FF)
// lead surrogate, skip the next code unit, which is the trail
i++;
}
i++;
}
dest[i] = '\0';
return i - 1;
}

Technically, 'char' could have the same range as either 'signed char' or 'unsigned char'. For the unsigned characters, your range is correct; theoretically, for signed characters, your condition is wrong. In practice, very few compilers will object - and the result will be the same.
Nitpick: the last && in the assert is a syntax error.
Whether the assertion is appropriate depends on whether you can afford to crash when the code gets to the customer, and what you could or should do if the assertion condition is violated but the assertion is not compiled into the code. For debug work, it seems fine, but you might want an active test after it for run-time checking too.

Here's another way of doing it, remember to use free() on the result.
char* wchar_to_char(const wchar_t* pwchar)
{
// get the number of characters in the string.
int currentCharIndex = 0;
char currentChar = pwchar[currentCharIndex];
while (currentChar != '\0')
{
currentCharIndex++;
currentChar = pwchar[currentCharIndex];
}
const int charCount = currentCharIndex + 1;
// allocate a new block of memory size char (1 byte) instead of wide char (2 bytes)
char* filePathC = (char*)malloc(sizeof(char) * charCount);
for (int i = 0; i < charCount; i++)
{
// convert to char (1 byte)
char character = pwchar[i];
*filePathC = character;
filePathC += sizeof(char);
}
filePathC += '\0';
filePathC -= (sizeof(char) * charCount);
return filePathC;
}

one could also convert wchar_t --> wstring --> string --> char
wchar_t wide;
wstring wstrValue;
wstrValue[0] = wide
string strValue;
strValue.assign(wstrValue.begin(), wstrValue.end()); // convert wstring to string
char char_value = strValue[0];

In general, no. int(wchar_t(255)) == int(char(255)) of course, but that just means they have the same int value. They may not represent the same characters.
You would see such a discrepancy in the majority of Windows PCs, even. For instance, on Windows Code page 1250, char(0xFF) is the same character as wchar_t(0x02D9) (dot above), not wchar_t(0x00FF) (small y with diaeresis).
Note that it does not even hold for the ASCII range, as C++ doesn't even require ASCII. On IBM systems in particular you may see that 'A' != 65

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to iterate over unicode characters in C++? - c++

Related

How to convert an integer to a unicode character?

What is the correct way of processing different strings encodings via c++ main char** args?

C++: how to convert ASCII or ANSI to UTF8 and stores in std::string

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

Convert wchar_t to char

Categories

Resources