I could have sworn I used a chr() function 40 minutes ago but can't find the file. I know it can go up to 256 so I use this:
std::string chars = "";
chars += (char) 42; //etc
So that's alright, but I really want to access unicode characters. Can I do (w_char) 512? Or maybe something just like the unichr() function in python, I just can't find a way to access any of those characters.
The Unicode character type is probably wchar_t in your compiler. Also, you will want to use std::wstring to hold a wide string.
std::wstring chars = L"";
chars += (wchar_t) 442; //etc
Why would you ever want to do that instead of just 'x' or whatever the character is? L'x' for wide characters.
However, if you have a desperate need, you can do (wchar_t)1004. wchar_t can be 16bit (normally Visual Studio) or 32bit(normally GCC). C++0x comes with char16_t and char32_t, and std::u16string.
Related
char in c++ has a memory of 1 byte but most of unicode characters require 2 bytes.
Does this mean that unicode can't be stored in characters in c++?
no char isn't the only. If you are on Windows there is wchar_t (WCHAR) or generally consider that short is 2-bytes also, but it's more about the way you want to implement and use it, the protocol ex:
#if !defined(_NATIVE_WCHAR_T_DEFINED)
typedef unsigned short WCHAR;
#else
typedef wchar_t WCHAR;
#endif
WCHAR* strDemo = L"consider the L";
but you need to dig more on web. they are also called multi-byte string so consider that in you searchs.
ex:
like in more general old-school cross platform BSD way:
https://www.freebsd.org/cgi/man.cgi?query=multibyte&apropos=0&sektion=0&format=html
http://utf8everywhere.org. and do not miss this
also since you asked the question at first place I assumed you should know about boost too.
C, C++ also support 16-bit character type wchar_t used for unicode utf-16.
Often via Macro define WCHAR Or TCHAR.
You can force 16-bit character literal / source code constants:
wchar_t c = L'a';
and the same with 16bit character Strings:
wchar_t[256] s = L"utf-16";
First of all you have to be aware that there is something called encoding.
So there are multiple ways to represent non ASCII characters.
Most popular encoding nowadays is UTF-8 which represents single non ASCII character as multiple bytes 2-4. In this encoding you CAN'T store this kind character in single char variable.
There are other encodings where small subset of non ASCII characters are represented as single byte, for example ISO 8859-2. Encoding is defined by locale and Windows is preferring such encoding, that is why Java Rookie answer had a chance to work for you.
Other systems are usually using UTF-8 for std::string so single character ca be represented by multiple bytes.
Another approach is to use wchar_t wstring wcout wcin, note there are still some issues with that.
To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.
// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character
// encoding supports this character
Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:
#include <iostream>
int main() {
std::cout << "Hello, ф or \u0444!\n";
}
you can also use wchar_t
Recent times I am coming across the conversion of UTF-8 encoding to string and vice vera. I understood that UTF-8 encoding is used to hold almost all the characters in the world while using char which is built in data type for string, only ASCII values can be stored.For a character in UTF-8 encoding the number of bytes required in memory is varied from one byte to 4 bytes but for 'char' type it is usually 1 byte.
My question is what happens in conversion from wstring to string or wchar to char ?
Does the characters which require more than one byte is skipped? It seems it depends on implementation but I want to know what is the correct way of doing it.
Also does wchar is required to store unicode characters ? As far as I understood UNICODE characters can be stored in normal string as well. Why should we use wstring or wchar ?
Depends how you convert them.
You need to specify the source encoding type and the target encoding type.
wstring is not a format, it just defines a data type.
Now usually when one says "Unicode", one means UTF16 which is what Microsoft Windows uses, and that is usuasly what wstring contains.
So, the right way to convert from UTF8 to UTF16:
std::string utf8String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::wstring utf16String = convert.from_bytes( utf8String );
And the other way around:
std::wstring utf16String = "blah blah";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf16String = convert.to_bytes( utf16String );
And to add to the confusion:
When you use std::string on a windows platform (like when you use a multibyte compilation), It's NOT UTF8. They use ANSI.
More specifically, the default encoding language your windows is using.
When compiling in Unicode the windows API commands expect these formats:
CommandA - multibyte - ANSI
CommandW - Unicode - UTF16
Make your source files UTF-8 encoded, set the character encoding to UNICODE in your IDE.
Use std::string and widen them for WindowsAPI calls.
std::string somestring = "こんにちは";
WindowsApiW(widen(somestring).c_str());
I know it sounds kind of hacky but a more profound explaination of this issue can be found at utf8everywhere.org.
In my code I can do:
wchar_t *s = L"...some chinese/japanese/etc string..";
and this works okay.
but if I do:
char *s = "...some chinese/japanese/etc string..."
I end up with s assigned to "???????" (not a display problem, there are actual question marks in the value).
Given that I'm on a US/1252 Win 7 (VS2010) and Unicode-compiled apps, how do I create a MBCS chinese string given a constant string literal? I do not want it to be unicode, but rather the MBCS representation of the chinese characters.
So far the only way I've been able to do it is use the unicode version and convert it to MBCS using WideCharToMultiByte. Do I really need to do that, or enter it as a byte array?
Yes, you really do need to do that. There are no MBCS string literals in C++.
(In theory you could do something like char *s = "...\xa7\f6\d5..." with the right bytes,
but that would be difficult to write and read.)
How can I convert from ANSI character (char) to Unicode character (wchar_t) and vice versa?
Is there any cross-platform source code for this purpose?
Yes, in <cstdlib> you have mbstowcs() and wcstombs().
I've previously posted some code on how to use this, maybe that's helpful. Make sure you run the function twice, once to get the length and once to do the actual conversion. (Here's a little discussion of what the functions mean.) Instead of the manual char array, I would probably prefer a std::vector<char> or std::vector<wchar_t>, coming to think of it.
Note that wchar_t has nothing to do with Unicode. If you need Unicode, you need to further convert from wchar_t to Unicode using a separate library (like iconv()), and don't use wchar_t as the data type for Unicode codepoints. Instead, use uint32_t on legacy systems or char32_t on modern ones.
Apparently this works, I don't know if it will always work or if it's a coincidence, but I thought it was worth showing:
const char* c = "hey yo";
wstring s(c, c + 6);
wcout << s << endl;
wcin.get();
prints
hey yo
Look at libraries like ICU and iconv if you really are using Unicode and not just 16 bit characters. That is Unicode does not just deal with single characters not even 16 bit ones as plain wchar_t does.
I have a Code::Blocks 10.05 rev 0 and gcc 4.5.2 Linux/unicode 64bit and
WxWidgets version 2.8.12.0-0
I have a simple problem:
#define _TT(x) wxT(x)
string file_procstatus;
file_procstatus.assign("/PATH/TO/FILE");
printf("%s",file_procstatus.c_str());
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
Printf outputs "/PATH/TO/FILE" normally while wxLogVerbose turns into crap. When I want to change std::string to wxString I have to do following:
wxString buf;
buf = wxString::From8BitData(file_procstatus.c_str());
Somebody has an idea what might be wrong, why do I need to change from 8bit data?
This is to do with how the character data is stored in memory. Using the "string" you produce a string of type char using the ASCII character set whereas I would assume that the _TT macro expands to L"string" which create a string of type wchar_t using a Unicode character set (UTF-32 on Linux I believe).
the printf function is expecting a char string whereas wxLogVerbose I assume is expecting a wchar_t string. This is where the need for conversion comes from. ASCII used one byte per character (8 bit data) but wchar_t strings use multiple bytes per character so the problem is down to the character encoding.
If you don't want to have to call this conversion function then do something like the following:
wstring file_procstatus = wxT("/PATH/TO/FILE");
wxLogVerbose(_TT("%s"),file_procstatus.c_str());
The following article gives best explanation about differences in Unicode and ASCII character set, how they are stored in memory and how string functions work with them.
http://allaboutcharactersets.blogspot.in/