"warning: multi-character character constant [-Wmultichar]" and program does not work - c++

So, I am writing program to translate runes to english alphabet
it gives me warning "warning: multi-character character constant [-Wmultichar]"
here is the code (not from the program but it has same problem)
code is in c++
string s = "ᛡᚣ"; (string of two utf-8 letters, runes)
if(s.at(0) == 'ᛡ')
cout<<"YES";
but warning is not the main problem, problem is that when i run it it does not output "YES", in case of a program when i try to translate Runes to alphabet it just starts working and makes bunch endl functions rather than translating runes (basically it does nothing)
P.S I tried using different compailers, in visual studio error poped up "Debug Assertion Failed!" "Expression: string subscript out of range"
other compailers just do nothing, i even tried to build program by using unicode instead of characters like "\u16B3".... but it's same, so what should i do? Do i need specific library for utf-8? pls help

If you look at the representation of the characters in your std::string you'll see that each of the characters uses multiple bytes - hence the warning. When dealing with Unicode you'll either need to use something with 32 bits to represent individual code points or you need to use multiple bytes for each code point. The use of code points is probably sufficient but does rely on the characters not using combining characters.
Comparing Unicode strings isn't entirely trivial (and I don't know all the rules). When representing the data using UTF-8 you'll need to compare byte sequences. In addition, you need to make sure that your Unicode string is normalized: some string have different valid representations. For example the u-umlaut in my name can be represented with a code point for u-umlaut or with a code point for u and combining character for the dieresis. In your your code I'd guess you could use
std::string expect("ᛡ");
if (expect.size() <= s.size() && s.substr(0, expect.size()) == expect)
std::cout << "YES\n";

Related

Why C++ returns wrong codes of some characters, and how to fix this?

I have a simple line of code:
std::cout << std::hex << static_cast<int>('©');
This character's the Copyright Sign Emoji, its code's a9, but the app writes c2a9. The same happens to lots of Unicode characters. Another example: ™ (this's 2122) suddenly returns e284a2. Why C++ returns wrong codes of some characters, and how to fix this?
Note: I'm using Microsoft Visual Studio, a file with my code is saved in UTF-8.
An ordinary character literal (one without prefix) usually has type char and can store only elements of the execution character set that are representable as a single byte.
If the character is not representable in this way, the character literal is only conditionally-supported with type int and implementation-defined value. Compilers typically warn when this happens with some of the generic warning flags since it is a mistake most of the time. That might depend on what warning flags exactly you have enabled.
A byte is typically 8 bits and therefore it is impossible to store all of unicode in it. I don't know what execution character set your implementation uses, but clearly neither © nor ™ are in it.
It also seems that your implementation chose to support the non-representable character by encoding it in UTF-8 and using that as the value of the literal. You are seeing a representation of the numeric value of the UTF-8 encoding of the two characters.
If you want the numeric value of the unicode code point for the character, then you should use a character literal with U prefix, which implies that the value of the character according to UTF-32 is given with type char32_t, which is large enough to hold all unicode code points:
std::cout << std::hex << static_cast<std::uint_least32_t>(U'©');

Get decimal value of Unicode Character C++

How do I get the decimal values of Unicode Character such as "Ồ"
std::string a = "Ồ";
unsigned char c = a[0];
long val = long(c);
cout << val << endl;
OUTPUT
7,891;
Your question may look pretty straight-forward but as we delve into it, we'll find it isn't as simple as it might first appear.
The first problem is that std::string is defined as std::basic_string<char> which isn't really compatible with "Ồ". Thus the results you get from your code will probably depend on the compiler you use and/or the environment and OS you are running on. For example, my copy of Visual Studio treats "Ồ" as an invalid ASCII character and puts "?" (or 0x3F) in `a[0]'.
The second problem is that the character "Ồ" is more than eight bits wide, so it may not fit into the variable c. Whatever the compiler put into a[0], the variable c will only hold char bits of that value. Again, the results you get are likely to change depending on the compiler you use and/or the environment you run in.
Leaving that aside, let's start by assuming the character "Ồ" is LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND GRAVE (0x1ED2). With that assumption, one might imagine that the answer we are seeking to get is 0x1ED2 right? But not necessarily.
There are several ways to encode a Unicode character. The UTF-32 encoding is 0x1ED2 (or 0x00001ED2 if we include all the leading zeros to get thirty-two bits). The UTF-8 encoding is 0xE1BB92.
So the decimal value of "Ồ" is 7,890 if it is encoded in UTF-32 or 14,793,618 if it is encoded in UTF-8 (I'm ignoring the effects of endianness to keep things simple)
The Unicode site has a FAQ on encodings and Wikipedia has a page too.
As you can see, the answer to your question (to some extent) depends on the encoding you want to use. One C++ way to deal with encodings is std::codecvt. Another solution is to just treat your string as a sequence of bytes - which your code attempts to do - but that rather depends on you knowing how your system encodes strings, what endianness you are dealing with, etc. And the code won't necessarily be portable.
Another wrinkle to consider is that - in the general case - "Ồ" might not be one character. Obviously it is one character in your code. But if you read a string in from a disk file say and when printed or displayed that file produces "Ồ" we can't assume the file contains a single "Ồ" character.
Unicode defines COMBINING CIRCUMFLEX ACCENT (0x0302) and COMBINING GRAVE ACCENT (0x0300) as separate characters which can be combined with other characters. And it defines intermediate characters like LATIN CAPITAL LETTER O WITH GRAVE and LATIN CAPITAL LETTER O WITH ACUTE so there are actually several ways you can create a string in memory (or in a disk file) that would give you the same effect as the character "Ồ".

Check if all characters in UTF16 string are valid?

I have a problem where I have UTF16 strings (std::wstring) that might have "invalid" characters which causes my console terminal to stop printing (see question).
I wonder if there is a fast way to check all the characters in a string and replace any invalid chars with ?.
I know I could do something along these lines with a regex, but it would be difficult to make it validate all valid chars, and also slow. Is there e.g. a numeric range for the char codes that I might use e.g. all char codes between 26-5466 is valid?
It should be possible to use std::ctype<wchar_t> to determine if a character is printable:
std::local loc;
std::replace_if(string.begin(), string.end(),
[&](wchar_t c)->bool { return !std::isprint(c, loc); }, L'?');
I suspect your problem is not related to the validity of characters, but to the capability of the console to print them.
The definition UNICODE does to "printable" does not necessarily coincide to the effective capability of the console itself to "print".
Character like '€' are "printable" but -for example- not on winXP consoles.

What exactly does U+ stand for and why can't I create a table of Unicode intermediate strings in my C++ application?

I'm trying to convert an application from Java + Swing to C++ + Qt. At one point I had to deal with some Unicode intermediates. In Java, this was fairly easy:
private static String[] hiraganaTable = {
"\u3042", "\u3044", "\u3046", "\u3048", "\u304a",
"\u304b", "\u304d", "\u304f", "\u3051", "\u3053",
...
}
...whereas in C++ I'm having problems:
QString hiraganaTable[] = {
"\x30\x42", "\x30\x44", "\x30\x46", "\x30\x48", "\x30\x4a",
"\x30\x4b", "\x30\x4d", "\x30\x4f", "\x30\x51", "\x30\x53",
...
};
I couldn't use \u in VS2008 because I got a heap of warnings of the form:
character represented by universal-character-name '\u3042' cannot be represented in the current code page (1250)
And don't call me stupid, I tried to use File->Advanced Save Options to no avail, the codepage didn't seem to change at all. Seems like this is a known problem: How to create a UTF-8 string literal in Visual C++ 2008
The table I'm using is fairly short, so with the help of Vim and some introductory-level regexp-magic, I was able to convert it to \x30\x42 notation. Unfortunately, the QStrings would not initialize properly from such an input. I tried everything. fromAscii(), fromUtf8(), fromLocal8Bit(), QString(QByteArray), the works. Then, trying to write U+3042 without BOM to a file and then viewing it in hex mode, I found out it actually turns out to be "E3 81 82". Suddenly, an entry like this seemed to work with QString::fromAscii(). Now I'm left wondering how much does exactly the "U+" stand for in "U+3042" (since 0xE38182 - 0x3042 = E35140, maybe I'd better add this Magic Constant to all my would-be Unicode chars?). How should I proceed from here to get an array of proper UTF-8 strings?
The problem is that C++ is based on C, which dates back to the ASCII age. The "default" C strings "abc" are 8 bits. Your Visual C++ compiler has 16 bits Unicode (UTF-16) literals, though, with a slightly different syntax: L"abc\u3042". The type of such literals is wchar_t[N] instead of char[N], you can store them in a std::wstring.
Qt fully understands wchar_t and QStrings can be directly constructed from them without conversion problems.
What you're seeing is the UTF-8 encoding of that character.
>>> u'\u3042'.encode('utf-8').encode('hex')
'e38182'
If you write them all out in UTF-8 then you should be fine.
The "U+" just indicates that you're looking at a Unicode codepoint as opposed to some specific encoding.
EDIT:
A small scriptlet to help you get started, in Python (same language as above):
>>> print ',\n'.join(', '.join('"%s"' % (y.encode('utf-8').encode('string-escape')
,) for y in x) for x in [u'あいうえお', u'かきくけこ', u'さしすせそ'])
"\xe3\x81\x82", "\xe3\x81\x84", "\xe3\x81\x86", "\xe3\x81\x88", "\xe3\x81\x8a",
"\xe3\x81\x8b", "\xe3\x81\x8d", "\xe3\x81\x8f", "\xe3\x81\x91", "\xe3\x81\x93",
"\xe3\x81\x95", "\xe3\x81\x97", "\xe3\x81\x99", "\xe3\x81\x9b", "\xe3\x81\x9d"
"U+dddd" where each d is a hexadecimal digit denotes a Unicode code point.
You cannot store 16-bit values in 8-bit chars; that's the main problem you're having.
Use wide characters, e.g. (these are string literals) L"\0x3042" or L"\u3042".
Then figure out how to make QString accept those.
Note: Visual C++ will emit sillywarning for the \U notation used within literals, while g++ will emit sillywarnings for that notation used outside literals.
Cheers & hth.,

How to use Unicode in C++?

Assuming a very simple program that:
ask a name.
store the name in a variable.
display the variable content on the screen.
It's so simple that is the first thing that one learns.
But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.
So, if you know how to do this in C++, please show me an example (that I can compile and test)
Thanks.
user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if
I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.
Svisstack : Also thanks for your help. But when I compile your code I get the following error:
warning: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'
You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:
With the wchar_t wide character type, ANSI/ISO C provides for
inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.
and that
The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently,
programs that need to be portable across any C or C++ compiler should not use wchar_t
for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide
characters, which may be Unicode characters in some compilers.
So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.
A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:
#include <iostream>
int main()
{
setlocale(LC_ALL, "");
std::cout << "What's your name? ";
std::string name;
std::getline(std::cin, name);
std::cout << "Hello there, " << name << "." << std::endl;
return 0;
}
...
$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8
But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.
Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)
On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "è" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.
Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)
Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.
In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.
Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:
#include <iostream>
#include <string>
int main() {
std::wstring name;
std::wcout << L"Enter your name: ";
std::wcin >> name;
std::wcout << L"Hello, " << name << std::endl;
}
There are other ways, but this is sort of the "minimal change" answer.
Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html
The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.
How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.
The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?
The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!
In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.
Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.
Can I handle UNICODE string as C Style string?
In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.
A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.
This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.
How do I index a UNICODE string?
It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.
What happens to the string terminating NULL byte?
Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.
Does Visual C++ Debugger shows UNICODE text?
Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).
Summary:
C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.
Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.
You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.
#include <stdio.h>
#include <wchar.h>
int main()
{
wchar_t name[256];
wprintf(L"Type a name: ");
wscanf(L"%s", name);
wprintf(L"Typed name is: %s\n", name);
return 0;
}