Dealing with Japanese characters [duplicate] - c++

This question already has an answer here:
warning: multi-character character constant [-Wmultichar]
(1 answer)
Closed 5 years ago.
So I was reading 'Accelerated C++' and there I read about wchar_t, I Googled a Japanese character and threw in the following statement in my program:
wchar_t japs = 'の';
It gave me this error:
input.cpp:20:20: warning: multi-character character constant [-Wmultichar]
wchar_t japs = 'の';
I don't know Japanese but I am clueless about what is happening here. Googled a bit, some solutions were talking about, it being a Linux issue, some solutions were talking about UTF-8 encoding.
Can someone really tell, what is actually happening? My environment in Ubuntu?

Your Editor supports Utf8. If you enter this character 'の' it will be encoded as a sequence of characters [ 0xe3, 0x81, 0xae ].
wchar_t is a typedef for a integer value. you should use the UTF8 encoding and store the characters in strings. e.g. char japs[] = "の";
If your terminal supports utf-8 (it normaly does) you can use japanese characters in c-strings, as you would use latin characters. bit keep in mind that one japanese character occupies three or more bytes in a c-string.
This type of string is called a multi-byte-string. if you like trouble you can convert a string with utf8 encoded characters to an array of wchar_t type. usualy each character will take 32bit. see "man mbstowcs".

Related

How to use Unicode character literal for Han characters in Clojure [duplicate]

This question already has answers here:
Using Emoji literals in Clojure source
(2 answers)
Closed 4 years ago.
I'm trying to create a Unicode character for U+20BB7, but I can't seem to figure out a way.
\uD842\uDFB7
The above doesn't work. I'm starting to think that you can't use literal Unicode character syntax for characters above \uFFFF.
Are my only option to use a string instead?
"\uD842\uDFB7"
Since as a string it works?
You can only use a string here - you're basically trying to shove two 'char' (16bit) values into one. See [1]
Unicode Character Representations
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which
defined characters as fixed-width 16-bit entities. The Unicode
Standard has since been changed to allow for characters whose
representation requires more than 16 bits. The range of legal code
points is now U+0000 to U+10FFFF, known as Unicode scalar value.
(Refer to the definition of the U+n notation in the Unicode Standard.)
1: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

byte representation of ASCII symbols in std::wstring with different locales

Windows C++ app. We have a string that contain only ASCII symbols: std::wstring(L"abcdeABCDE ... any other ASCII symbol"). Note that this is std::wstring that uses wchar_t.
Question - do byte representation of this string depends on the localization settings, or something else? Can I assume that if I receive such string (for example, from WindowsAPI) while app is running its bytes will be the same as on the my PC?
In general, for characters (not escape sequence) wchar_t and wstring have to use the same codes as ASCII (just extended to 2 bytes).
But I am not sure about codes less then 32 and of course codes greater than 128 can has different meaning (as in ASCII) in the moment of output, so to avoid problem on output set particular locale explicitly, e.g.:
locale("en_US.UTF-8")
for standard output
wcout.imbue(locale("en_US.UTF-8"));
UPDATE:
I found one more suggestion about adding
std::ios_base::sync_with_stdio(false);
before setting localization with imbue
see details on How can I use std::imbue to set the locale for std::wcout?
The byte representation of the literal string does not depend on the environment. It's hardcoded to the binary data from the editor. However, the way that binary data is interpreted depends on the current code page, so you can end up with different results when converted at runtime to a wide string (as opposed to defining the string using a leading L, which means that the wide characters will be set at compile time.)
To be safe, use setlocale() to guarantee the encoding used for conversion. Then you don't have to worry about the environment.
This might help: "By definition, the ASCII character set is a subset of all multibyte-character sets. In many multibyte character sets, each character in the range 0x00 – 0x7F is identical to the character that has the same value in the ASCII character set. For example, in both ASCII and MBCS character strings, the 1-byte NULL character ('\0') has value 0x00 and indicates the terminating null character."
From:
Visual Studio Character Sets 'Not set' vs 'Multi byte character set'

How does string work with non-ascii symbols while char does not?

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.
So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?
Does string somehow switch over to Unicode?
In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.
It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.
To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.
From the source code char c = 'ø';:
source_file.cpp:2:12: error: character too large for enclosing character literal type
char c = '<U+00F8>';
^
What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)
When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.
To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.
What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.
The ASCII for C++ is only 128 characters.
If you want 'ø' which is ASCII-EXTENDED 248 out of (255) which is 8 bit (is not a character value) that included 7 bit from ASCII.
you can try char ch1 ='\xD8';

Manipulating strings of multibyte characters

I am a novice C programmer. I am trying to write a C program which sometimes deals with English text (fits into 8-bit chars) and sometimes Japanese text (needs 16 bits).
Do I need to set aside 16 bits for every character, even the English text if I use the same code to manipulate either country's text?
What are some of the ways of encoding multibyte characters?
What if the compiler can't store multibyte strings compactly?
I'm confused. Please help me out here. Kindly, support your answers with code examples. Also, please explain the same with context of C++ as I am learning C++ also & have beginner-level experience in this language too.
Thanks in advance.
This was a interview question asked to one of my acquaintance a few days back.
In C++ you can use std::wstring which uses wchar_t as the underlying char type. In C++11 you can also use std::u16string or std::u32string depending on the amount of storage for a character you need.
C also have wchar_t defined in <wchar.h>.
Okay, after doing a little bit of research, I think I got an answer:
mbstowcs ("multibyte string to wide character string") and wcstombs ("wide character string to multibyte string") convert between arrays of wchar_t (in which every character takes 16 bits, or two bytes) and multibyte strings (in which individual characters are stored in one byte if possible).

c++: getting ascii value of a wide char

let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.