this->textBox1->Name = L"textBox1";
Although it seems to work without the L, what is the purpose of the prefix? The way it is used doesn't even make sense to a hardcore C programmer.
It's a wchar_t literal, for extended character set. Wikipedia has a little discussion on this topic, and c++ examples.
'L' means wchar_t, which, as opposed to a normal character, requires 16-bits of storage rather than 8-bits. Here's an example:
"A" = 41
"ABC" = 41 42 43
L"A" = 00 41
L"ABC" = 00 41 00 42 00 43
A wchar_t is twice big as a simple char. In daily use you don't need to use wchar_t, but if you are using windows.h you are going to need it.
It means the text is stored as wchar_t characters rather than plain old char characters.
(I originally said it meant unicode. I was wrong about that. But it can be used for unicode.)
It means that it is a wide character, wchar_t.
Similar to 1L being a long value.
It means it's an array of wide characters (wchar_t) instead of narrow characters (char).
It's a just a string of a different kind of character, not necessarily a Unicode string.
L is a prefix used for wide strings. Each character uses several bytes (depending on the size of wchar_t). The encoding used is independent from this prefix. I mean it must not be necessarily UTF-16 unlike stated in other answers here.
Here is an example of the usage:
By adding L before the char you can return Unicode characters as char32_t type:
char32_t utfRepresentation()
{
if (m_is_white)
{
return L'♔';
}
return L'♚';
};
Related
While debugging on gcc, I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135". Which makes sense -- 万 is 0x4E07, and 0x4E in octal is 116.
Now on Apple LLVM 9.1.0 on an Intel-powered Macbook, I find that that same literal is not handled as the same string, ie:
u16string{u"万不得已"} == u16string{u"\007\116\015\116\227\137\362\135"}
goes from true to false. I'm still on a little-endian system, so I don't understand what's happening.
NB. I'm not trying to use the correspondence u"万不得已" == u"\007\116\015\116\227\137\362\135". I just want to understand what's happening.
I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135"
No, actually it is not. And here's why...
u"..." string literals are encoded as a char16_t-based UTF-16 encoded string on all platforms (that is what the u prefix is specifically meant for).
u"万不得已" is represented by this UTF-16 codeunit sequence:
4E07 4E0D 5F97 5DF2
On a little-endian system, that UTF-16 sequence is represented by this raw byte sequence:
07 4E 0D 4E 97 5F F2 5D
In octal, that would be represented by "\007\116\015\116\227\137\362\135" ONLY WHEN using a char-based string (note the lack of a string prefix, or u8 would also work for this example).
u"\007\116\015\116\227\137\362\135" is NOT a char-based string! It is a char16_t-based string, where each octal number represents a separate UTF-16 codeunit. Thus, this string actually represents this UTF-16 codeunit sequence:
0007 004E 000D 004E 0097 005F 00F2 005D
That is why your two u16string objects are not comparing as the same string value. Because they are really not equal.
You can see this in action here: Live Demo
char in c++ has a memory of 1 byte but most of unicode characters require 2 bytes.
Does this mean that unicode can't be stored in characters in c++?
no char isn't the only. If you are on Windows there is wchar_t (WCHAR) or generally consider that short is 2-bytes also, but it's more about the way you want to implement and use it, the protocol ex:
#if !defined(_NATIVE_WCHAR_T_DEFINED)
typedef unsigned short WCHAR;
#else
typedef wchar_t WCHAR;
#endif
WCHAR* strDemo = L"consider the L";
but you need to dig more on web. they are also called multi-byte string so consider that in you searchs.
ex:
like in more general old-school cross platform BSD way:
https://www.freebsd.org/cgi/man.cgi?query=multibyte&apropos=0&sektion=0&format=html
http://utf8everywhere.org. and do not miss this
also since you asked the question at first place I assumed you should know about boost too.
C, C++ also support 16-bit character type wchar_t used for unicode utf-16.
Often via Macro define WCHAR Or TCHAR.
You can force 16-bit character literal / source code constants:
wchar_t c = L'a';
and the same with 16bit character Strings:
wchar_t[256] s = L"utf-16";
First of all you have to be aware that there is something called encoding.
So there are multiple ways to represent non ASCII characters.
Most popular encoding nowadays is UTF-8 which represents single non ASCII character as multiple bytes 2-4. In this encoding you CAN'T store this kind character in single char variable.
There are other encodings where small subset of non ASCII characters are represented as single byte, for example ISO 8859-2. Encoding is defined by locale and Windows is preferring such encoding, that is why Java Rookie answer had a chance to work for you.
Other systems are usually using UTF-8 for std::string so single character ca be represented by multiple bytes.
Another approach is to use wchar_t wstring wcout wcin, note there are still some issues with that.
To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.
// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character
// encoding supports this character
Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:
#include <iostream>
int main() {
std::cout << "Hello, ф or \u0444!\n";
}
you can also use wchar_t
This is a very basic doubt, but clarify me please
#define TLVTAG_APPLICATIONMESSAGE_V "\xDF01"
printf("%s\n", TLVTAG_APPLICATIONMESSAGE_V);
means what will be printed.
To go step by step (using the C++ standard, 2.13.2 and 2.13.4 as references):
The #define means that you substitute the second thing wherever the first appears, so the printf is processed as printf("%s\n", "\xDF01");.
The "\xDF01" is a string of one character (plus the zero-byte terminator), and the \x means to take the next characters as a hex value, so it attempts to treat DF01 as a number in hex, and fit it into a char.
Since a standard quoted string contains chars, not wchar_ts, and you're almost certainly working with an 8-bit char, the result is implementation-defined, and without the documentation for your implementation it's really impossible to speculate further.
Now, if the string were L"\xDF01", its elements would be wchar_ts, which are wide characters, normally 16 or 32 bits, and the DF01 value would turn into one character (presumably Unicode) value, and the print statement would print characters \xDF and characters \x01, not necessarily in that order, since printf prints char, not wchar_t. wprintf would print out the whole wchar_t.
It seems somebody is trying to print an unicode character -> �
I could have sworn I used a chr() function 40 minutes ago but can't find the file. I know it can go up to 256 so I use this:
std::string chars = "";
chars += (char) 42; //etc
So that's alright, but I really want to access unicode characters. Can I do (w_char) 512? Or maybe something just like the unichr() function in python, I just can't find a way to access any of those characters.
The Unicode character type is probably wchar_t in your compiler. Also, you will want to use std::wstring to hold a wide string.
std::wstring chars = L"";
chars += (wchar_t) 442; //etc
Why would you ever want to do that instead of just 'x' or whatever the character is? L'x' for wide characters.
However, if you have a desperate need, you can do (wchar_t)1004. wchar_t can be 16bit (normally Visual Studio) or 32bit(normally GCC). C++0x comes with char16_t and char32_t, and std::u16string.
let's say i have a char array like "äa".
is there a way to get the ascii value (e.g 228) of the first char, which is a multibyte?
even if i cast my array to a wchar_t * array, i'm not able to get the ascii value of "ä", because its 2 bytes long.
is there a way to do this, im trying for 2 days now :(
i'm using gcc.
thanks!
You're contradicting yourself. International characters like ä are (by definition) not in the ASCII character set, so they don't have an "ascii value".
It depends on the exact encoding of your two-character array, if you can get the code point for a single character or not, and if so which format it will be in.
You are very confused. ASCII only has values smaller than 128. Value 228 corresponds to ä in 8 bit character sets ISO-8859-1, CP1252 and some others. It also is the UCS value of ä in the Unicode system. If you use string literal "ä" and get a string of two characters, the string is in fact encoded in UTF-8 and you may wish to parse the UTF-8 coding to acquire Unicode UCS values.
More likely what you really want to do is converting from one character set to another. How to do this heavily depends on your operating system, so more information is required. You also need to specify what exactly you want? A std::string or char* of ISO-8859-1, perhaps?
There is a standard C++ template function to do that conversion, ctype::narrow(). It is part of the localization library. It will convert the wide character to the equivalent char value for you current local, if possible. As the other answers have pointed out, there isn't always a mapping, which is why ctype::narrow() takes a default character that it will return if there is no mapping.
Depends on the encoding used in your char array.
If your char array is Latin 1 encoded, then it it 2 bytes long (plus maybe a NUL terminator, we don't care), and those 2 bytes are:
0xE4 (lower-case a umlaut)
0x61 (lower-case a).
Note that Latin 1 is not ASCII, and 0xE4 is not an ASCII value, it's a Latin 1 (or Unicode) value.
You would get the value like this:
int i = (unsigned char) my_array[0];
If your char array is UTF-8 encoded, then it is three bytes long, and those bytes are:
binary 11000011 (first byte of UTF-8 encoded 0xE4)
binary 10100100 (second byte of UTF-8 encoded 0xE4)
0x61 (lower-case a)
To recover the Unicode value of a character encoded with UTF-8, you either need to implement it yourself based on http://en.wikipedia.org/wiki/UTF-8#Description (usually a bad idea in production code), or else you need to use a platform-specific unicode-to-wchar_t conversion routine. On linux this is mbstowcs or iconv, although for a single character you can use mbtowc provided that the multi-byte encoding defined for the current locale is in fact UTF-8:
wchar_t i;
if (mbtowc(&i, my_array, 3) == -1) {
// handle error
}
If it's SHIFT-JIS then this doesn't work...
what you want is called transliteration - converting letters of one language to another. it has nothing about unicode and wchars. you need to have a table of mapping.