std::string and std::map operations on Unicode string - c++

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.
Sample code:
include <iostream>
#include "sys/types.h"
using namespace std;
int main()
{
std::basic_string<u_int16_t> ustr1(std::basic_string<u_int16_t>((u_int16_t*)"ยฤขฃ", 4));
std::basic_string<u_int16_t> ustr2(std::basic_string<u_int16_t>((u_int16_t*)"abcd", 4));
for (int i = 0; i < ustr1.length(); i++)
cout << "Char: " << ustr1[i] << endl;
for (int i = 0; i < ustr2.length(); i++)
cout << "Char: " << ustr2[i] << endl;
if (ustr1 == ustr2)
cout << "Strings are equal" << endl;
cout << "string length: " << ustr1.length() << "\t" << ustr2.length() << endl;
return 0;
}
The strings contain Thai characters and ascii characters, and the intent behind using basic_string<u_int16_t> is to facilitate storage of characters which cannot be accommodated within a single byte. The code was run on a Linux box, whose encoding type is en_US.UTF-8. The output is:
$ ./a.out
Char: 47328
Char: 57506
Char: 42168
Char: 47328
Char: 25185
Char: 25699
Char: 17152
Char: 24936
string length: 4 4
A few questions:
Do the character values in the output correspond to en_US.UTF-8 code points? If not, what are they?
Would the std::string operators like ==, !=, < etc., be able to work with Unicode code points? If so, would it be a mere comparison of each code points in the corresponding locations? Would std::map work on similar lines?
Would changing the locale to UTF-16 result in the strings getting stored as UTF-16 code points?
Thanks!

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.
They don't.
std::string is a sequence of chars or bytes. It is not a "high-level" string taking any encoding into account. You must do that yourself, e.g. by using a library dedicated to that purpose such as ICU.
Switching from std::string (i.e. std::basic_string<char>) to std::basic_char<u_int16_t> doesn't change that; it just means you have a sequence of "wide" characters instead.
And std::map has nothing to do with this at all.
Further reading:
https://stackoverflow.com/a/17106065/560648
https://www.reddit.com/r/cpp/comments/1y3n33/why_does_c_seem_to_pretend_unicode_doesnt_exist/

Related

Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout`

I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
The first print shows: ???
The loop prints the hex for the three characters as c6 d8 c5
When foo[i] is cast to char (or wchar_t), nothing is printed
Environmental variable $LANG is set to default en_US.UTF-8
In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use std::wstring over std::string?
On Linux? Almost never, unless you use a toolkit/framework.
Short explanation why:
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.
In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).
That's why std::cout << char(-49) << char(-120); will also print ψ.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.

Unicode chars shown as decimal numbers instead of symbols, how do I fix this?

I want to write a small program that is able to display unicode characters not included in ASCII or LATIN_1 using wchar_t.
I'm using C++14 and I've configured my text editor to store characters according to the UTF-8 standard. I've tried using both char16_t and char32_t but the result stays the same.
inside main()
wchar_t spade = L'\u2660';
wchar_t heart = L'\u2665';
wchar_t diamond = L'\u2666';
wchar_t clover = L'\u2663';
cout << spade << endl;
cout << heart << endl;
cout << diamond << endl;
cout << clover << endl;
The code above outputs the decimal values 9824 9829 9830 9827, instead of the unicode character symbols.
you need to use std::wcout to print Unicode characters
std::cout does not have any overloads of operator<< that accept wchar_t, char16_t or char32_t as input. So the compiler promotes those values to int, which is why you see numeric values outputted.
You need to use std::wcout instead of std::cout when outputting wchar_t data.
Alternatively, if your console supports UTF-8, you can use std::cout with UTF-8 strings, instead of wide (UTF-16/32) strings.
const char *spade = u8"♠";
const char *heart = u8"♥";
const char *diamond = u8"♦";
const char *clover = u8"♣";
cout << spade << endl;
cout << heart << endl;
cout << diamond << endl;
cout << clover << endl;

Why do I obtain this strange character?

Why does my C++ program create the strange character shown below in the pictures? The picture on the left with the black background is from the terminal. The picture on the right with the white background is from the output file. Before, it was a "\v" now it changes to some sort of astrological symbol or symbol to denote males. 0_o This makes no sense to me. What am I missing? How can I have my program output just a backslash v?
Please see my code below:
// SplitActivitiesFoo.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <iostream>
#include <vector>
#include <fstream>
using namespace std;
int main()
{
string s = "foo:bar-this-is-more_text#\venus \"some more text here to read.\"";
vector<string> first_part;
fstream outfile;
outfile.open("out.foobar");
for (int i = 0; i < s.size(); ++i){
cout << "s[" << i << "]: " << s[i] << endl;
outfile << s[i] << endl;
}
return 0;
}
Also, assume that I do not want to modify my string 's' in this case. I want to be able to parse each character of the string and work around the strange character somehow.This is because in the actual program the string will be read in from a file and parsed then sent to another function. I guess I could figure out a way to programmatically add backslashes...
How can I have my program output just a backslash v?
If you want a backslash, then you need to escape it: "#\\venus".
This is required because a backslash denotes that the next character should be interpreted as something special (note that you were already using this when you wanted double-quotes). So the compiler has no way of knowing you actually wanted a backslash unless you tell it.
A literal backslash character therefore has the syntax \\. This is the case in both string literals ("\\") and character literals ('\\').
Why does my C++ program create the strange character shown below in the picture?
Your string contains the \v control character (vertical tab), and the way it's displayed is dependent on your terminal and font. It looks like your terminal is using symbols from the traditional MSDOS code page.
I found an image for you here, which shows exactly that symbol for the vertical tab (vt) entry at value 11 (0x0b):
Also, assume that I do not want to modify my string 's' in this case. I want to be able to parse each character of the string and work around the strange character somehow.
Well, I just saw you add the above part to your question. Now you're in difficult territory. Because your string literal does not actually contain the character v or any backslashes. It only appears that way in code. As already said, the compiler has interpreted those characters and substituted them for you.
If you insist on printing v instead of a vertical tab for some crazy reason that is hopefully not related to an XY Problem, then you can construct a lookup-table for every character and then replace undesirables with something else:
char lookup[256];
std::iota( lookup, lookup + 256, 0 ); // Using iota from <numeric>
lookup['\v'] = 'v';
for (int i = 0; i < s.size(); ++i)
{
cout << "s[" << i << "]: " << lookup[s[i]] << endl;
outfile << lookup[s[i]] << endl;
}
Now, this won't print the backslashes. To undo the string further check out std::iscntrl. It's locale-dependent, but you could utilise it. Or just something naive like:
const char *lookup[256] = { 0 };
s['\f'] = "\\f";
s['\n'] = "\\n";
s['\r'] = "\\r";
s['\t'] = "\\t";
s['\v'] = "\\v";
s['\"'] = "\\\"";
// Maybe add other controls such as 0x0E => "\\x0e" ...
for (int i = 0; i < s.size(); ++i)
{
const char * x = lookup[s[i]];
if( x ) {
cout << "s[" << i << "]: " << x << endl;
outfile << x << endl;
} else {
cout << "s[" << i << "]: " << s[i] << endl;
outfile << s[i] << endl;
}
}
Be aware there is no way to correctly reconstruct the escaped string as it originally appeared in code, because there are multiple ways to escape characters. Including ordinary characters.
Most likely the terminal that you are using cannot decipher the vertical space code "\v", thus printing something else. On my terminal it prints:
foo:bar-this-is-more_text#
enus "some more text here to read."
To print the "\v" change or code to:
String s = "foo:bar-this-is-more_text#\\venus \"some more text here to read.\"";
What am I missing? How can I have my program output just a backslash v?
You are escaping the letter v. To print backslash and v, escape the backslash.
That is, print double backslash and a v.
\\v

Trying to output everything inside an exe file

I'm trying to output the plaintext contents of this .exe file. It's got plaintext stuff in it like "Changing the code in this way will not affect the quality of the resulting optimized code." all the stuff microsoft puts into .exe files. When I run the following code I get the output of M Z E followed by a heart and a diamond. What am I doing wrong?
ifstream file;
char inputCharacter;
file.open("test.exe", ios::binary);
while ((inputCharacter = file.get()) != EOF)
{
cout << inputCharacter << "\n";
}
file.close();
I would use something like std::isprint to make sure the character is printable and not some weird control code before printing it.
Something like this:
#include <cctype>
#include <fstream>
#include <iostream>
int main()
{
std::ifstream file("test.exe", std::ios::binary);
char c;
while(file.get(c)) // don't loop on EOF
{
if(std::isprint(c)) // check if is printable
std::cout << c;
}
}
You have opened the stream in binary, which is good for the intended purpose. However you print every binary data as it is: some of thes characters are not printable, giving weird output.
Potential solutions:
If you want to print the content of an exe, you'll get more non-printable chars than printable ones. So one approach could be to print the hex value instead:
while ( file.get(inputCharacter ) )
{
cout << setw(2) << setfill('0') << hex << (int)(inputCharacter&0xff) << "\n";
}
Or you could use the debugger approach of displaying the hex value, and then display the char if it's printable or '.' if not:
while (file.get(inputCharacter)) {
cout << setw(2) << setfill('0') << hex << (int)(inputCharacter&0xff)<<" ";
if (isprint(inputCharacter & 0xff))
cout << inputCharacter << "\n";
else cout << ".\n";
}
Well, for the sake of ergonomy, if the exe file contains any real exe, you'd better opt for displaying several chars on each line ;-)
Binary file is a collection of bytes. Byte has a range of values 0..255. Printable characters that can be safely "printed" form a much narrower range. Assuming most basic ASCII encoding
32..63
64..95
96..126
plus, maybe, some higher than 128, if your codepage has them
see ascii table.
Every character that falls out of that range may, at least:
print out as invisible
print out as some weird trash
be in fact a control character that will change settings of your terminal
Some terminals support "end of text" character and will simply stop printing any text afterwards. Maybe you hit that.
I'd say, if you are interested only in text, then print only that printables and ignore others. Or, if you want everything, then maybe write them out in hex form instead?
This worked:
ifstream file;
char inputCharacter;
string Result;
file.open("test.exe", ios::binary);
while (file.get(inputCharacter))
{
if ((inputCharacter > 31) && (inputCharacter < 127))
Result += inputCharacter;
}
cout << Result << endl;
cout << "These are the ascii characters in the exe file" << endl;
file.close();

Right Justifying output stream in C++

I'm working in C++. I'm given a 10 digit string (char array) that may or may not have 3 dashes in it (making it up to 13 characters). Is there a built in way with the stream to right justify it?
How would I go about printing to the stream right justified? Is there a built in function/way to do this, or do I need to pad 3 spaces into the beginning of the character array?
I'm dealing with ostream to be specific, not sure if that matters.
You need to use std::setw in conjunction with std::right.
#include <iostream>
#include <iomanip>
int main(void)
{
std::cout << std::right << std::setw(13) << "foobar" << std::endl;
return 0;
}
Yes. You can use setw() to set the width. The default justification is right-justified, and the default padding is space, so this will add spaces to the left.
stream << setw(13) << yourString
See: setw(). You'll need to include <iomanip>.
See "setw" and "right" in your favorite C++ (iostream) reference for further details:
cout << setw(13) << right << your_string;
Not a unique answer, but an additional "gotcha" that I discovered and is too long for a comment...
All the formatting stuff is only applied once to yourString. Anything additional, like << yourString2 doesn't abide by the same formatting rules. For instance if I want to right-justify two strings and pad 24 asterisks (easier to see) to the left, this doesn't work:
std::ostringstream oss;
std::string h = "hello ";
std::string t = "there";
oss << std::right << std::setw(24) << h << t;
std::cout << oss.str() << std::endl;
// this outputs
******************hello there
That will apply the correct padding to "hello " only (that's 18 asterisks, making the entire width including the trailing space 24 long), and then "there" gets tacked on at the end, making the end result longer than I wanted. Instead, I wanted
*************hello there
Not sure if there's another way (you could simply redo the formatting I'm sure), but I found it easiest to simply combine the two strings into one:
std::ostringstream oss;
std::string h = "hello ";
std::string t = "there";
// + concatenates t onto h, creating one string
oss << std::right << std::setw(24) << h + t;
std::cout << oss.str() << std::endl;
// this outputs
*************hello there
The whole output is 24 long like I wanted.
Demonstration