UTF-8 symbol written to the terminal output - c++

I've been trying to understand the working principle of the operator<< of std::cout in C++. I've found that it prints UTF-8 symbols, for instance:
The simple program is:
#include <iostream>
unsigned char t[] = "ي";
unsigned char m0 = t[0];
unsigned char m1 = t[1];
int main()
{
std::cout << t << std::endl; // Prints ي
std::cout << (int)t[0] << std::endl; // Prints 217
std::cout << (int)t[1] << std::endl; // Prints 138
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
}
DEMO
How does the terminal that produces output determine that it must interpret t as a single symbol ي, but not as two symbols � �?

You are dealing with two different types, unsigned char[] and unsigned char.
If you were to do sizeof on t, you'd find that it occupied
three bytes, and strlen( t ) will return 2. On the other
hand, m0 and m1 are single characters.
When you output a unsigned char[], it is converted to an
unsigned char*, and the stream outputs all of the bytes until
it encounters a '\0' (which is the third byte in t). When
you output an unsigned char, the stream outputs just that
byte. So in your first line, the output device receives
2 bytes, and then the end of line. In the last two, it receives
1 byte, and then the end of line. And that byte, followed by
the end of line, is not a legal UTF-8 character, so the display
device displays something to indicate that there was an error,
or that it did not understand.
When working with UTF-8 (or any other multibyte encoding), you
cannot extract single bytes from a string and expect them to
have any real meaning.

The terminal is determining how to display the bytes you are feeding it. You are feeding it a newline (std::endl) between the two bytes of the 2-byte UTF-8-encoded Unicode character. Instead of this:
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
Try this:
std::cout << m0 << m1 << std::endl; // Prints ي
Why do m0 and m1 print as � in your original code?
Because your code is sending the bytes [217, 110, 138, 110], which is not interpretable as UTF-8. (Assuming std::endl corresponds to the \n character, value 110.)

Related

Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout`

I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
The first print shows: ???
The loop prints the hex for the three characters as c6 d8 c5
When foo[i] is cast to char (or wchar_t), nothing is printed
Environmental variable $LANG is set to default en_US.UTF-8
In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use std::wstring over std::string?
On Linux? Almost never, unless you use a toolkit/framework.
Short explanation why:
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.
In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).
That's why std::cout << char(-49) << char(-120); will also print ψ.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.

C++ convert string of bytes to ints from file

I'm completely new to C++, so I guess this might be a very trivial question. If this is a duplicate of an already answered question (I bet it is...), please point me to that answer!
I have a file with the following cut from hexdump myfile -n 4:
00000000 02 00 04 00 ... |....|
00000004
My problem/confusion comes when trying to read these values and convert them to ints ( [0200]_hex --> [512]_dec and [0400]_hex --> [1024]_dec).
A minimum working example based on this answer:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(void){
char fn[] = "myfile";
ifstream file;
file.open(fn, ios::in | ios::binary);
string fbuff = " ";
file.read((char *)&fbuff[0], 2);
cout << "fbuff: " << fbuff << endl;
// works
string a = "0x0200";
cout << "a: " << a << endl;
cout << "stoi(a): " << stoi(a, nullptr, 16) << endl;
// doesn't work
string b = "\x02\x00";
cout << "b: " << b << endl;
cout << "stoi(b): " << stoi(b, nullptr, 16) << endl;
// doesn't work
cout << "stoi(fbuff): " << stoi(fbuff, nullptr, 16) << endl;
file.close();
return(0);
}
What I cant get my head around is the difference between a and b; the former defined with 0x (which works perfect) and the latter defined with \x and breaks stoi. My guess is that whats being read from the file is in the \x-format, based on the output when running the code within sublime-text3 (below), and every example I've seen only deals with for example 0x0200-formatted inputs
// Output from sublime, which just runs g++ file.cpp && ./file.cpp
fbuff: <0x02> <0x00>
a: 0x0200
stoi(a): 512
b:
terminate called after throwing an instance of 'std::invalid_argument'
what(): stoi
[Finished in 0.8s with exit code -6]
Is there a simple way to read two, or more, bytes, group them and convert into a proper short/int/long?
The literal string "0x0200" is really an array of seven bytes:
0x30 0x78 0x30 0x32 0x30 0x30 0x00
The first six are ASCII encoded characters for '0', 'x', '0', '2', '0' and '0'. The last is the null-terminator that all strings have.
The literal string "\x00\x02" is really an array of three bytes:
0x00 0x02 0x00
That is not really what is normally called a "string", but rather just a collection of bytes. And it's nothing that can be parsed as a string by std::stoi. And as std::stoi can't parse it the function will throw an exception.
You might want to get a couple of good books to read and learn more about strings.
Note: This answer assumes ASCII encoding and 8-bit bytes, which is by far the most common.

String and integer multiplication in C++

I wrote the following code
#include <iostream>
#define circleArea(r) (3.1415*r*r)
int main() {
std::cout << "Hello, World!" << std::endl;
std::cout << circleArea('10') << std::endl;
std::cout << 3.1415*'10'*'10' << std::endl;
std::cout << 3.1415*10*10 << std::endl;
return 0;
}
The output was the following
Hello, World!
4.98111e+08
4.98111e+08
314.15
The doubt i have is why is 3.1415 * '10'*'10' value 4.98111e+08. i thought when i multiply a string by a number, number will be converted to a string yielding a string.Am i missing something here?
EDIT: Rephrasing question based on comments, i understood that single quotes and double are not same. So, '1' represents a single character. But, what does '10' represent
'10' is a multicharacter literal; note well the use of single quotation marks. It has a type int, and its value is implementation defined. Cf. "10" which is a literal of type const char[3], with the final element of that array set to NUL.
Typically its value is '1' * 256 + '0', which in ASCII (a common encoding supported by C++) is 49 * 256 + 48 which is 12592.

Using memcpy trying to copy one struct into a char[] buffer

#define ECHOMAX 100
struct tDataPacket
{
int iPacket_number;
char sData[ECHOMAX];
};
int main () {
tDataPacket packet;
packet.iPacket_number=10;
strcpy(packet.sData,"Hello world");
char buffer[sizeof(tDataPacket)];
memcpy(buffer,&packet.iPacket_number,sizeof(int));
memcpy(buffer+sizeof(int),packet.sData,ECHOMAX);
std::cout<<"Buffer = "<<buffer<<"END";
return 0;
}
In the above code I am trying to pack my structure in a char[] buffer so that I can send it to a UDP socket. But the output of the program is "" string. So nothing is getting copied to 'buffer'. Am I missing anything??
When you copy the int, at least one of the first "n" characters of the buffer will be zero (where "n" is the size of an int on your platform). For example for a 4-byte int:
x00 x00 x00 x0a or x0a x00 x00 x00
Depending on the endianness of your processor.
Printing out the zero will have the effect of terminating the output string.
You have no code to sensibly print the contents of the buffer, so you are expecting this to work by magic. The stream's operator << function expects a pointer to a C-style string, which the buffer isn't.
It's "" because int iPacket_number is probably laid out in memory as:
0x00 0x00 0x00 0x0a
which is an empty string (nul-terminator in the first character).
Firstly you probably want some sort of marshalling so that the on-the-wire representation is well established and portable (think endian differences between platforms).
Secondly you shouldn't need to "print" the resulting string; it makes no sense.
Thirdly you want unsigned char, not (signed) char.
You can't print an integer as text, because it's not text.
You will need to do a loop (or something like that) to print the actual contents of the buffer:
std::cout << "Buffer=";
for(size_t i = 0; i < sizeof(tDataPacket); i++)
{
std::cout << hex << (unsigned int)buffer[i] << " ";
if ((i & 0xf) == 0xf) std::cout << endl; // Newline every 16.
}
std::cout << "END" << endl;
You can do this but it's not really relevant to display binary data like that:
std::cout<<"Buffer = "; for each (auto c in buffer)
{
std::cout<< c;
}
std::cout <<"END";

Reading from a file byte per byte C++

Im trying to write a program in C++ that will take 2 files and compare them byte by byte.
I was looking at the following post
Reading binary istream byte by byte
Im not really sure about parts of this. When using get(char& c) it reads in a char and stores it in c. Is this storing as, say 0x0D, or is it storing the actual char value "c" (or whatever)?
If i wish to use this method to compare two files byte by byte would i just use get(char& c) on both then compare the chars that were got, or do i need to cast to byte?
(I figured starting a new post would be better since the original is quite an old one)
chars are nothing but a "special type of storage" (excuse the expression) for integers, in memory there is no difference between 'A' and the decimal value 65 (ASCII assumed).
c will in other words contain the read byte from the file.
To answer your added question; no, there is no cast required doing c1 == c2 will be just fine.
char c1 = 'A', c2 = 97, c3 = 0x42;
std::cout << c1 << " " << c2 << " " << c3 << std::endl;
std::cout << +c1 << " " << +c2 << " " << +c3 << std::endl;
/* Writing +c1 in the above will cast c1 to an int, it's is the same thing as writing (int)c1 or the more correct (c++ish) static_cast<int> (c1). */
output:
A a B
65 97 66
Ehm,
a char contains 1 Byte
The interpretation of that value is indeed depending on you, the programmer.
If you print that byte in the cout stream it is interpreted via ASCII Code and therefor if your char was 0x63 then it will print 'c' on the screen.
If you just use the value you can use it as you like..
char c = 0x63;
c++;
// c is now: 0x64
Note that you can also input decimals