Here is the very simple C++ code:
char a00 = 'Z';
char a01 = '\u0444';
char a02[5] = {'H','e','l','l','o'};
char a03[] = {'W','o','r','l','d','\0','Z','Z'};
cout << "Simple char: " << a00
<< "\nUTF-8 char: " << a01
<< "\nFull char array: " << a02
<< "\n2nd in char array: " << a02[1]
<< "\nWith null character: " << a03 << endl;
My problem is when Netbeans 8.1 tries to show the output of such a program, it does not create the UTF-8 character.
The character should look like this: ф (see: link)
Instead, I get the following output:
(image)
I have tried adding -J-Dfile.encoding=UTF-8 to netbeans_default-options inside the netbeans.conf file located at inside the etc folder. It made no difference.
UTF-8 is a multibyte character encoding which means most of the characters occupy several bytes. So a single char is not enough to hold most UTF-8 characters.
You can store them in a string like this:
std::string s = "\u0444";
Related
I am trying to print these data type. But I get a very strange output instead of what I expect.
#include <iostream>
using namespace std;
int main(void)
{
char data1 = 0x11;
int data2 = 0XFFFFEEEE;
char data3 = 0x22;
short data4 = 0xABCD;
cout << data1 << endl;
cout << data2 << endl;
cout << data3 << endl;
cout << data4 << endl;
}
Most likely you expect data1 and data3 to be printed as some kind of numbers. However, the data type is character, which is why C++ (or C) would interpret them as characters, mapping 0x11 to the corresponding ASCII character (a control character), similar for 0x22 except some other character (see an ASCII table).
If you want to print those characters as number, you need to convert them to int prior to printing them out like so (works for C and C++):
cout << (int)data1 << endl;
Or more C++ style would be:
cout << static_cast<int>(data1) << endl;
If you want to display the numbers in hexadecimal, you need to change the default output base using the hex IO manipulator. Afterwards all output is done in hexadecimal. If you want to switch back to decimal output, use dec. See cppreference.com for details.
cout << hex << static_cast<int>(data1) << endl;
I am trying to read a string(ver) from a binary file. the number of characters(numc) in the string is also read from the file.This is how I read the file:
uint32_t numc;
inFile.read((char*)&numc, sizeof(numc));
char* ver = new char[numc];
inFile.read(ver, numc);
cout << "the version is: " << ver << endl;
what I get is the string that I expect plus some other symbols. How can I solve this problem?
A char* string is a nul terminated sequence of characters. Your code ignores the nul termination part. Here's how it should look
uint32_t numc;
inFile.read((char*)&numc, sizeof(numc));
char* ver = new char[numc + 1]; // allocate one extra character for the nul terminator
inFile.read(ver, numc);
ver[numc] = '\0'; // add the nul terminator
cout << "the version is: " << ver << endl;
Also sizeof(numc) not size(numc) although maybe that's a typo.
I want to write a small program that is able to display unicode characters not included in ASCII or LATIN_1 using wchar_t.
I'm using C++14 and I've configured my text editor to store characters according to the UTF-8 standard. I've tried using both char16_t and char32_t but the result stays the same.
inside main()
wchar_t spade = L'\u2660';
wchar_t heart = L'\u2665';
wchar_t diamond = L'\u2666';
wchar_t clover = L'\u2663';
cout << spade << endl;
cout << heart << endl;
cout << diamond << endl;
cout << clover << endl;
The code above outputs the decimal values 9824 9829 9830 9827, instead of the unicode character symbols.
you need to use std::wcout to print Unicode characters
std::cout does not have any overloads of operator<< that accept wchar_t, char16_t or char32_t as input. So the compiler promotes those values to int, which is why you see numeric values outputted.
You need to use std::wcout instead of std::cout when outputting wchar_t data.
Alternatively, if your console supports UTF-8, you can use std::cout with UTF-8 strings, instead of wide (UTF-16/32) strings.
const char *spade = u8"♠";
const char *heart = u8"♥";
const char *diamond = u8"♦";
const char *clover = u8"♣";
cout << spade << endl;
cout << heart << endl;
cout << diamond << endl;
cout << clover << endl;
I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.
Sample code:
include <iostream>
#include "sys/types.h"
using namespace std;
int main()
{
std::basic_string<u_int16_t> ustr1(std::basic_string<u_int16_t>((u_int16_t*)"ยฤขฃ", 4));
std::basic_string<u_int16_t> ustr2(std::basic_string<u_int16_t>((u_int16_t*)"abcd", 4));
for (int i = 0; i < ustr1.length(); i++)
cout << "Char: " << ustr1[i] << endl;
for (int i = 0; i < ustr2.length(); i++)
cout << "Char: " << ustr2[i] << endl;
if (ustr1 == ustr2)
cout << "Strings are equal" << endl;
cout << "string length: " << ustr1.length() << "\t" << ustr2.length() << endl;
return 0;
}
The strings contain Thai characters and ascii characters, and the intent behind using basic_string<u_int16_t> is to facilitate storage of characters which cannot be accommodated within a single byte. The code was run on a Linux box, whose encoding type is en_US.UTF-8. The output is:
$ ./a.out
Char: 47328
Char: 57506
Char: 42168
Char: 47328
Char: 25185
Char: 25699
Char: 17152
Char: 24936
string length: 4 4
A few questions:
Do the character values in the output correspond to en_US.UTF-8 code points? If not, what are they?
Would the std::string operators like ==, !=, < etc., be able to work with Unicode code points? If so, would it be a mere comparison of each code points in the corresponding locations? Would std::map work on similar lines?
Would changing the locale to UTF-16 result in the strings getting stored as UTF-16 code points?
Thanks!
I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.
They don't.
std::string is a sequence of chars or bytes. It is not a "high-level" string taking any encoding into account. You must do that yourself, e.g. by using a library dedicated to that purpose such as ICU.
Switching from std::string (i.e. std::basic_string<char>) to std::basic_char<u_int16_t> doesn't change that; it just means you have a sequence of "wide" characters instead.
And std::map has nothing to do with this at all.
Further reading:
https://stackoverflow.com/a/17106065/560648
https://www.reddit.com/r/cpp/comments/1y3n33/why_does_c_seem_to_pretend_unicode_doesnt_exist/
I've been trying to understand the working principle of the operator<< of std::cout in C++. I've found that it prints UTF-8 symbols, for instance:
The simple program is:
#include <iostream>
unsigned char t[] = "ي";
unsigned char m0 = t[0];
unsigned char m1 = t[1];
int main()
{
std::cout << t << std::endl; // Prints ي
std::cout << (int)t[0] << std::endl; // Prints 217
std::cout << (int)t[1] << std::endl; // Prints 138
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
}
DEMO
How does the terminal that produces output determine that it must interpret t as a single symbol ي, but not as two symbols � �?
You are dealing with two different types, unsigned char[] and unsigned char.
If you were to do sizeof on t, you'd find that it occupied
three bytes, and strlen( t ) will return 2. On the other
hand, m0 and m1 are single characters.
When you output a unsigned char[], it is converted to an
unsigned char*, and the stream outputs all of the bytes until
it encounters a '\0' (which is the third byte in t). When
you output an unsigned char, the stream outputs just that
byte. So in your first line, the output device receives
2 bytes, and then the end of line. In the last two, it receives
1 byte, and then the end of line. And that byte, followed by
the end of line, is not a legal UTF-8 character, so the display
device displays something to indicate that there was an error,
or that it did not understand.
When working with UTF-8 (or any other multibyte encoding), you
cannot extract single bytes from a string and expect them to
have any real meaning.
The terminal is determining how to display the bytes you are feeding it. You are feeding it a newline (std::endl) between the two bytes of the 2-byte UTF-8-encoded Unicode character. Instead of this:
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
Try this:
std::cout << m0 << m1 << std::endl; // Prints ي
Why do m0 and m1 print as � in your original code?
Because your code is sending the bytes [217, 110, 138, 110], which is not interpretable as UTF-8. (Assuming std::endl corresponds to the \n character, value 110.)