So, I want to be able to use Chinese characters in my C++ program, and I need to use some type, to hold such characters beyond the ASCII range.
However, I tried to run the following code, and it worked.
#include <iostream>
int main() {
char snet[4];
snet[0] = '你';
snet[1] = '爱';
snet[2] = '我';
std::cout << snet << std::endl;
int conv = static_cast<int>(snet[0]);
std::cout << conv << std::endl; // -96
}
This doesn't make sense, as since a sizeof(char) in C++, for the g++ compiler evaluates to 1, yet Chinese characters cannot be expressed in a single byte.
Why are the Chinese characters here being allowed to be housed in a char type?
What type should be used to house Chinese characters or non-ASCII characters in C++?
When you compile the code using -Wall flag you will see warnings like:
warning: overflow in implicit constant conversion [-Woverflow]
snet[2] = '我';
warning: multi-character character constant [-Wmultichar]
snet1 = '爱';
Visual C++ in Debug mode, gives the following warning:
c:\users\you\temp.cpp(9): warning C4566: character represented by universal-character-name '\u4F60' cannot be represented in the current code page (1252)
What is happening under the curtains is that your two byte Chinese characters are implicitly converted to a char. That conversion overflows and therefore you are seeing a negative value or something weird when you print it in the console.
Why are the Chinese characters here being allowed to be housed in a char type?
You can, but you shouldn't, the same way that you can define char c = 1000000;
What type should be used to house Chinese characters or non-ASCII characters in C++?
If you want to store Chinese characters and you can use C++11, go for UTF-8 encoding with std::string (live example).
std::string msg = u8"你爱我";
Related
Being in a non English speaking country I wanted to do a test with char array and non ASCII character.
I compiled this code with MSVC and Mingwin GCC :
#include <iostream>
int main()
{
constexpr char const* c = "é";
int i = 0;
char const* s;
for (s = c; *s; s++)
{
i++;
}
std::cout << "Size: " << i << std::endl;
std::cout << "Char size: " << sizeof(char) << std::endl;
}
Both display Char size: 1 but MSVC displays Size: 1 and Mingwin GCC displays Size: 2.
Is this an undefined behaviour caused by the non ASCII character or is there an other reason behind it (GCC encoding in UTF-8 and MSVC in UTF-16 maybe) ?
The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.
GCC defaults to UTF-8 in which the character é uses two code units and my guess is that MSVC uses code page 1252, in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)
Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset option.
Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset option to explicitly choose the source encoding and defaults to UTF-8.
If you intent the literal to be UTF-8 encoded into bytes, then you should use u8-prefixed literals which are guaranteed to use this encoding:
constexpr auto c = u8"é";
Note that the type auto here will be const char* in C++17, but const char8_t* since C++20. s must be adjusted accordingly. This will then guarantee an output of 2 for the length (number of code units). Similarly there are u and U for UTF-16 and UTF-32 in both of which only one code unit would be used for é, but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8) respectively (types char16_t and char32_t).
char str[] = "C:\Windows\system32"
auto raw_string = convert_to_raw(str);
std::cout << raw_string;
Desired output:
C:\Windows\system32
Is it possible? I am not a big fan of cluttering my path strings with extra backslash. Nor do I like an explicit R"()" notation.
Any other work-around of reading a backslash in a string literally?
That's not possible, \ has special meaning inside a non-raw string literal, and raw string literals exist precisely to give you a chance to avoid having to escape stuff. Give up, what you need is R"(...)".
Indeed, when you write something like
char const * str{"a\nb"};
you can verify yourself that strlen(str) is 3, not 4, which means that once you compile that line, in the binary/object file there's only one single character, the newline character, corresponding to \n; there's no \ nor n anywere in it, so there's no way you can retrieve them.
As a personal taste, I find raw string literals great! You can even put real Enter in there. Often just for the price of 3 characters - R, (, and ) - in addtion to those you would write anyway. Well, you would have to write more characters to escape anything needs escaping.
Look at
std::string s{R"(Hello
world!
This
is
Me!)"};
That's 28 keystrokes from R to last " included, and you can see in a glimpse it's 6 lines.
The equivalent non-raw string
std::string s{"Hello\nworld!\nThis\nis\nMe!"};
is 30 keystrokes from R to last " included, and you have to parse it carefully to count the lines.
A pretty short string, and you already see the advantage.
To answer the question, as asked, no it is not possible.
As an example of the impossibility, assume we have a path specified as "C:\a\b";
Now, str is actually represented in memory (in your program when running) using a statically allocated array of five characters with values {'C', ':', '\007', '\010', '\000'} where '\xyz' represents an OCTAL representation (so '\010' is a char equal to numerically to 8 in decimal).
The problem is that there is more than one way to produce that array of five characters using a string literal.
char str[] = "C:\a\b";
char str1[] = "C:\007\010";
char str2[] = "C:\a\010";
char str3[] = "C:\007\b";
char str4[] = "C:\x07\x08"; // \xmn uses hex coding
In the above, str1, str2, str3, and str4 are all initialised using equivalent arrays of five char.
That means convert_to_raw("C:\a\b") could quite legitimately assume it is passed ANY of the strings above AND
std::cout << convert_to_raw("C:\a\b") << '\n';
could quite legitimately produce output of
C:\007\010
(or any one of a number of other strings).
The practical problem with this, if you are working with windows paths, is that c:\a\b, C:\007\010, C:\a\010, C:\007\b, and C:\x07\x08 are all valid filenames under windows - that (unless they are hard links or junctions) name DIFFERENT files.
In the end, if you want to have string literals in your code representing filenames or paths, then use \\ or a raw string literal when you need a single backslash. Alternatively, write your paths as string literals in your code using all forward slashes (e.g. "C:/a/b") since windows API functions accept those too.
In C++ Primer 5th Edition I saw this
when I tried to use it---
At this time it didn't work, but the program's output did give a weird symbol, but signed is totally blank And also they give some warnings when I tried to compile it. But C++ primer and so many webs said it should work... So I don't think they give the wrong information did I do something wrong?
I am newbie btw :)
But C++ primer ... said it should work
No it doesn't. The quote from C++ primer doesn't use std::cout at all. The output that you see doesn't contradict with what the book says.
So I don't think they give the wrong information
No1.
did I do something wrong?
It seems that you've possibly misunderstood what the value of a character means, or possibly misunderstood how character streams work.
Character types are integer types (but not all integer types are character types). The values of unsigned char are 0..255 (on systems where size of byte is 8 bits). Each2 of those values represent some textual symbol. The mapping from a set of values to a set of symbols is called a "character set" or "character encoding".
std::cout is a character stream. << is stream insertion operator. When you insert a character into a stream, the behaviour is not to show the numerical value. Instead, the behaviour to show the symbol that the value is mapped to3 in the character set that your system uses. In this case, it appears that the value 255 is mapped to whatever strange symbol you saw on the screen.
If you wish to print the numerical value of a character, what you can do is convert to a non-character integer type and insert that to the character stream:
int i = c;
std::cout << i;
1 At least, there's no wrong information regarding your confusion. The quote is a bit inaccurate and outdated in case of c2. Before C++20, the value was "implementation defined" rather than "undefined". Since C++20, the value is actually defined, and the value is 0 which is the null terminator character that signifies end of a string. If you try to print this character, you'll see no output.
2 This was bit of a lie for simplicity's sake. Some characters are not visible symbols. For example, there is the null terminator charter as well as other control characters. The situation becomes even more complex in the case of variable width encodings such as the ubiquitous Unicode, where symbols may consist of a sequence of several char. In such encoding, and individual char cannot necessarily be interpreted correctly without other char that are part of such sequence.
3 And this behaviour should feel natural once you grok the purpose of character types. Consider following program:
unsigned char c = 'a';
std::cout << c;
It would be highly confusing if the output would be a number that is the value of the character (such as 97 which may be the value of the symbol 'a' on the system) rather than the symbol 'a'.
For extra meditation, think about what this program might print (and feel free to try it out):
char c = 57;
std::cout << c << '\n';
int i = c;
std::cout << i << '\n';
c = '9';
std::cout << c << '\n';
i = c;
std::cout << i << '\n';
This is due to the behavior of the << operator on the char type and the character stream cout. Note, the << is known as formatted output means it does some implicit formatting.
We can say that the value of a variable is not the same as its representation in certain contexts. For example:
int main() {
bool t = true;
std::cout << t << std::endl; // Prints 1, not "true"
}
Think of it this way, why would we need char if it would still behave like a number when printed, why not to use int or unsigned? In essence, we have different types so to have different behaviors which can be deduced from these types.
So, the underlying numeric value of a char is probably not what we looking for, when we print one.
Check this for example:
int main() {
unsigned char c = -1;
int i = c;
std::cout << i << std::endl; // Prints 255
}
If I recall correctly, you're somewhat close in the Primer to the topic of built-in types conversions, it will bring in clarity when you'll get to know these rules better. Anyway, I'm sure, you will benefit greatly from looking into this article. Especially the "Printing chars as integers via type casting" part.
I am using a raspberry pi and trying to print unicode characters with something like this:
test.cpp:
#include<iostream>
using namespace std;
int main() {
char a=L'\u1234';
cout << a << endl;
return 0;
}
When I compile with g++, I get this warning:
test.cpp: In function "int main()":
test.cpp:4:9: warning: large integer implicitly truncated to unsigned type [-Woverflow]
And the output is:
4
Also, this is not in the GUI and my distribution is raspbian wheezy if that is relevant.
As a reference to one of the previous answers, you should not use wchar_t and w* functions on Linux. POSIX APIs use char data type and most POSIX implementations use UTF-8 as a default encoding. Quoting the C++ standard (ISO/IEC 14882:2011)
5.3.3 Sizeof
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.
The result of sizeof applied to any other fundamental type (3.9.1) is
implementation-defined. [ Note: in particular, sizeof(bool),
sizeof(char16_t), sizeof(char32_t), and sizeof(wchar_t) are
implementation-defined. 74 — end note ]
UTF-8 uses 1-byte code units and up to 4 code units to represent a code point, so char is enough to store UTF-8 strings, though to manipulate them you are going to need to find out if a specific code unit is represented by multiple bytes and build your processing logic with that in mind. wchar_t has an implementation-defined size and the Linux distributions that I have seen have a size of 4 bytes for this data type.
There is another problem that the mapping from the source code to the object code may transform your encoding in a compiler-specific way:
2.2 Phases of translation
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary.
Anyway, in the most cases you don't have any conversions on your source code so the strings that you put into char* stay unmodified. If you encode your source code with UTF-8 then you are going to have bytes representing UTF-8 code units in your char*s.
As for your code example: it does not work as expected because 1 char has a size of 1 byte. Unicode code points may require several (up to 4) UTF-8 code units to be serialized (for UTF-8 1 code unit == 1 byte). You can see here that U+1234 requires three bytes E1 88 B4 when UTF-8 is used and, therefore, cannot be stored in a single char. If you modify your code as follows it's going to work just fine:
#include <iostream>
int main() {
char* str = "\u1234";
std::cout << str << std::endl;
return 0;
}
This is going to output ሴ though you may see nothing depending on your console and the installed fonts, the actual bytes are going to be there. Note that with double quotes you also have a \0 terminator in-memory.
You could also use an array, but not with single quotes since you would need a different data type (see here for more information):
#include <iostream>
int main() {
char* str = "\u1234";
std::cout << str << std::endl;
// size of the array is 4 because \0 is appended
// for string literals and there are 3 bytes
// needed to represent the code point
char arr[4] = "\u1234";
std::cout.write(arr, 3);
std::cout << std::endl;
return 0;
}
The output is going to be ሴ on the two different lines in this case.
You must set the local before you can use it, unless your native system is using it.
setlocale(LC_CTYPE,"");
To print the stirng use wcout instead of cout.
#include<iostream>
#include <locale>
int main()
{
setlocale(LC_CTYPE,"");
wchar_t a=L'\u1234';
std::wcout << a << std::endl;
return 0;
}
You have to use wide characters:
try with:
#include<iostream>
using namespace std;
int main()
{
wchar_t a = L'\u1234';
wcout << a << endl;
}
I ran the same code which determines number of characters in a wide-character string. The tested string has ascii, numbers and Korean language.
#include <iostream>
using namespace std;
template <class T,class trait>
void DumpCharacters(T& a)
{
size_t length = a.size();
for(size_t i=0;i<length;i++)
{
trait n = a[i];
cout<<i<<" => "<<n<<endl;
}
cout<<endl;
}
int main(int argc, char* argv[])
{
wstring u = L"123abc가1나1다";
wcout<<u<<endl;
DumpCharacters<wstring,wchar_t>(u);
string s = "123abc가1나1다";
cout<<s<<endl;
DumpCharacters<string,char>(s);
return 0;
}
The obvious thing is that wstring.size() in Visual C++ 2010 returns the number of letters (11 characters), regardless if it is ascii or international character. However, it returns the byte count of string data (17 bytes) in XCode 4.2 in Mac OS X.
Please reply me how to get the character length of a wide-character string, not byte count in xcode.
--- added on 12 Feb --
I found that wcslen() also returns 17 in xcode. it returns 11 in vc++.
Here's the tested code:
const wchar_t *p = L"123abc가1나1다";
size_t plen = wcslen(p);
--- added on 18 Feb --
I found that llvm 3.0 causes the wrong length. This problem is fixed after changing compiler frontend from llvm3.0 to 4.2
wcslen() works differently in Xcode and VC++ says the details.
It is an error if the std::wstring version uses 17 characters: it should only use 11 characters. Using recent SVN heads of gcc and clang it uses 11 characters for the std::wstring and 17 characters for the std::string. I think this is what expected.
Please note that the standard C++ library internally has a different idea of what a "character" is than what might be expected when multi-word encodings (e.g. UTF-8 for words of type char and UTF-16 for words with 16 bits) are used. Here is the first paragraph of the chapter describing string (21.1 [strings.general]):
This Clause describes components for manipulating sequences of any non-array POD (3.9) type. In this Clause such types are called char-like types , and objects of char-like types are called char-like objects or simply characters.
This basically means that when using Unicode the various functions won't pay attention to what constitutes a code point but rather process the strings as a sequence of words. This is severe impacts and what will happen e.g. when producing substrings because these may easily split multi-byte characters apart. Currently, the standard C++ library doesn't have any support for processing multi-bytes encodings internally because it is assumed that the translation from an encoding to characters is done when reading data (and correspondingly the other way when writing data). If you are processing multi-byte encoded strings internally, you need be aware of this as there is no support at all.
It is recognized that this state of affairs is actually a problem. For C++2011 the character type char32_t was added which should support Unicode character still better than wchar_t (because Unicode uses 20 bits while wchar_t was allowed to only support 16 bits which is a choice made on some platforms at a time when Unicode was promising to use at most 16 bits). However, this would still not deal with combining characters. It is recognized by the C++ committee that this is a problem and that proper character processing in the standard C++ library would be something nice to have but so far nobody as come forward with a comprehensive proposal to address this problem (if you feel you want to propose something like this but you don't know how, please feel free to contact me and I will help you with how to submit a proposal).
XCode 4.2 apparently used UTF-8 (or something very similar) as narrow multibyte encoding to represent your characters string literal "123abc가1나1다" in the program's source code when initializing string s. The UTF-8 representation of that string happens to be 17 bytes long.
The wide character representation (stored in u) is 11 wide characters. There are many ways to convert from narrow to wide encoding. Try this:
#include <iostream>
#include <clocale>
#include <cstdlib>
int main()
{
std::wstring u = L"123abc가1나1다";
std::cout << "Wide string containts " << u.size() << " characters\n";
std::string s = "123abc가1나1다";
std::cout << "Narrow string contains " << s.size() << " bytes\n";
std::setlocale(LC_ALL, "");
std::cout << "Which can be converted to "
<< std::mbstowcs(NULL, s.c_str(), s.size())
<< " wide characters in the current locale,\n";
}
Use .length(), not .size() to get the string length.
std::string and std::wstring are typedefs of std::basic_string templated on char and wchar_t. The size() member function returns the number of elements in the string - the number of char's or wchar_t's. "" and L"" don't deal with encodings.