Printing Latin characters in Linux terminal using `std::wstring` and `std::wcout` - c++

I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.
Trying to debug, I have something like the following:
std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
std::wcout << std::hex << (int)foo[i] << " ";
std::wcout << (char)foo[i];
}
Characteristics of output I get:
The first print shows: ???
The loop prints the hex for the three characters as c6 d8 c5
When foo[i] is cast to char (or wchar_t), nothing is printed
Environmental variable $LANG is set to default en_US.UTF-8

In the conclusion of the answer I linked (which I still recommend reading) we can find:
When I should use std::wstring over std::string?
On Linux? Almost never, unless you use a toolkit/framework.
Short explanation why:
First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).
Now let's have a look at such simple program:
#include <iostream>
int main()
{
std::string foo = "ψA"; // character 'A' is just control sample
std::wstring bar = L"ψA"; // --
for (int i = 0; i < foo.length(); ++i) {
std::cout << static_cast<int>(foo[i]) << " ";
}
std::cout << std::endl;
for (int i = 0; i < bar.length(); ++i) {
std::wcout << static_cast<int>(bar[i]) << " ";
}
std::cout << std::endl;
return 0;
}
The output is:
-49 -120 65
968 65
What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.
In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.
Let's also check sizes of those types:
std::cout << "sizeof(char) : " << sizeof(char) << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;
Output:
sizeof(char) : 1
sizeof(wchar_t) : 4
1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).
UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.
Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).
That's why std::cout << char(-49) << char(-120); will also print ψ.
But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.
The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.
And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.

Related

How do I find 8-bit substrings in strings with ascii values exceeding 127?

I'm struggling to work through an issue I'm running into trying to work with bitwise substrings in strings. In the example below, this simple little function does what it is supposed to for values 0-127, but fails if I attempt to work with ASCII values greater than 127. I assume this is because the string itself is signed. However, if I make it unsigned, I not only run into issues because apparently strlen() doesn't operate on unsigned strings, but I get a warning that it is a multi-char constant. Why the multiple chars? I think I have tried everything. Is there something I could do to make this work on values > 127?
#include <iostream>
#include <cstring>
const unsigned char DEF_KEY_MINOR = 0xAD;
const char *buffer = { "jhsi≠uhdfiwuui73" };
size_t isOctetInString(const char *buffer, const unsigned char octet)
{
size_t out = 0;
for (size_t i = 0; i < strlen(buffer); ++i)
{
if(!(buffer[i] ^ octet))
{
out = i;
break;
}
}
return out;
}
int main() {
std::cout << isOctetInString(buffer, 'i') << "\n";
std::cout << isOctetInString(buffer, 0x69) << "\n";
std::cout << isOctetInString(buffer, '≠') << "\n";
std::cout << isOctetInString(buffer, 0xAD) << "\n";
return 0;
}
output
3
3
0
0
Edit
Based on comments I have tried a few different things including casting the octet and buffer to unsigned int, and wchar_t, and removing the unsigned char from the octet parameter type. With either of these the outputs I am getting are
3
3
6
0
I even tried substituting the ≠ char in the buffer with
const char *buffer = {'0xAD', "jhsiuhdfiwuui73"};
however I still get warnings about multibyte characters.
As I said before, my main concern is to be able to find the bit sequence 0xAD within a string, but I am seeing now that using ascii characters or any construct making use of the ascii character set will cause issues. Since 0xAD is only 8 bits, there must be a way of doing this. Does anyone know a method for doing so?
Sign extension -- buffer[i]^octet is really unsigned(int(buffer[i])) ^ unsigned(octet). If you want buffer[] to be unsigned char, you have to define it that way.
There are multiple sources of confusion in your problem:
searching for an unsigned char value in a string can be done with strchr() which converts both the int argument and the characters in the char array to unsigned char for the comparison.
your function uses if(!(buffer[i] ^ octet)) to detect a match, which does not work if char is signed because the expression is evaluated as if(!((int)buffer[i] ^ (int)octet)) and the sign extension only occurs for buffer[i]. A simple solution is:
if ((unsigned char)buffer[i] == octet)
Note that the character ≠ might be encoded as multiple bytes on your target system, both in the source code and the terminal handling, for example code point ≠ is 8800 or 0x2260 is encoded as 0xE2 0x89 0xA0 in UTF-8. The syntax '≠' would then pose a problem. I'm not sure how C++ deals with multi-byte character constants, but C would accept them with an implementation specific value.
To see how your system handles non-ASCII bytes, you could add these lines to your main() function:
std::cout << "≠ uses " << sizeof("≠") - 1 << "bytes\n";
std::cout << "'≠' has the value " << (int)'≠' << "\n";
or more explicitly:
printf("≠ is encoded as");
for (size_t i = 0; i < sizeof("≠") - 1; i++) {
printf(" %02hhX", "≠"[i]);
}
printf(" and '≠' has a value of 0x%X\n", '≠');
On my linux system, the latter outputs:
≠ is encoded as E2 89 A0 and '≠' has a value of 0xE289A0
On my MacBook, compilation fails with this error:
notequal.c:8:48: error: character too large for enclosing character literal type
printf(" and '≠' has a value of 0x%X\n", '≠');

UTF-8 symbol written to the terminal output

I've been trying to understand the working principle of the operator<< of std::cout in C++. I've found that it prints UTF-8 symbols, for instance:
The simple program is:
#include <iostream>
unsigned char t[] = "ي";
unsigned char m0 = t[0];
unsigned char m1 = t[1];
int main()
{
std::cout << t << std::endl; // Prints ي
std::cout << (int)t[0] << std::endl; // Prints 217
std::cout << (int)t[1] << std::endl; // Prints 138
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
}
DEMO
How does the terminal that produces output determine that it must interpret t as a single symbol ي, but not as two symbols � �?
You are dealing with two different types, unsigned char[] and unsigned char.
If you were to do sizeof on t, you'd find that it occupied
three bytes, and strlen( t ) will return 2. On the other
hand, m0 and m1 are single characters.
When you output a unsigned char[], it is converted to an
unsigned char*, and the stream outputs all of the bytes until
it encounters a '\0' (which is the third byte in t). When
you output an unsigned char, the stream outputs just that
byte. So in your first line, the output device receives
2 bytes, and then the end of line. In the last two, it receives
1 byte, and then the end of line. And that byte, followed by
the end of line, is not a legal UTF-8 character, so the display
device displays something to indicate that there was an error,
or that it did not understand.
When working with UTF-8 (or any other multibyte encoding), you
cannot extract single bytes from a string and expect them to
have any real meaning.
The terminal is determining how to display the bytes you are feeding it. You are feeding it a newline (std::endl) between the two bytes of the 2-byte UTF-8-encoded Unicode character. Instead of this:
std::cout << m0 << std::endl; // Prints �
std::cout << m1 << std::endl; // Prints �
Try this:
std::cout << m0 << m1 << std::endl; // Prints ي
Why do m0 and m1 print as � in your original code?
Because your code is sending the bytes [217, 110, 138, 110], which is not interpretable as UTF-8. (Assuming std::endl corresponds to the \n character, value 110.)

Initializing an unsigned char array with hex values in C++

I would like to initialize an unsigned char array with 16 hex values. However, I don't seem to know how to properly initialize/access those values. When I try to access them as I might want to intuitively, I'm getting no value at all.
This is my output
The program was run with the following command: 4
Please be a value! -----> p
Here's some plaintext
when run with the code below -
int main(int argc, char** argv)
{
int n;
if (argc > 1) {
n = std::stof(argv[1]);
} else {
std::cerr << "Not enough arguments\n";
return 1;
}
char buff[100];
sprintf(buff,"The program was run with the following command: %d",n);
std::cout << buff << std::endl;
unsigned char plaintext[16] =
{0x0f, 0xb0, 0xc0, 0x0f,
0xa0, 0xa0, 0xa0, 0xa0,
0x00, 0x00, 0xa0, 0xa0,
0x00, 0x00, 0x00, 0x00};
unsigned char test = plaintext[1]^plaintext[2];
std::cout << "Please be a value! -----> " << test << std::endl;
std::cout << "Here's some plaintext " << plaintext[3] << std::endl;
return 0;
}
By way of context, this is part of a group project for school. We are ultimately trying to implement the Serpent cipher, but keep on getting tripped up by unsigned char arrays. Our project specification says that we must have two functions that take what would be Byte arrays in Java. I assume the closest relative in C++ is an unsigned char[]. Otherwise I would use vector. Elsewhere in the code I've implemented a setKey function which takes an unsigned char array, packs its values into 4 long long ints (the key needs to be 256 bits) and performs various bit-shifting and xor operations on those ints to generate the keys necessary for the cryptographic algorithm. Hope that's enough background on what I'm looking to do. I'm guessing I'm just overlooking some basic C++ functionality here. Thanks for any and all help!
A char is an 8-bit value capable of storing -128 <= n <= +127, frequently used to store character representations in different encodings and commonly - in Western, Roman-alphabet installations - char is used to indicate representation of ASCII or utf encoded values. 'Encoded' means the symbols/letter in the character set have been assigned numeric values. Think of the periodic table as an encoding of elements, so that 'H' (Hydrogen) is encoded as 1, Germanium as 32. In the ASCII (and UTF-8) tables, position 32 represents the character we call "space".
When you use operator << on a char value, the default behavior is to assume you are passing it a character encoding, e.g. an ASCII character code. If you do
char c = 'z';
char d = 122;
char e = 0x7A;
char f = '\x7a';
std::cout << c << d << e << f << "\n";
All four assignments are equivalent. 'z' is a shortcut/syntactic-sugar for char(122), 0x7A is hex for 122, and '\x7a' is an escape that forms the ascii character with a value of 0x7a or 122 - i.e. z.
Where many new programmers go wrong is that they do this:
char n = 8;
std::cout << n << endl;
this does not print "8", it prints ASCII character at position 8 in the ASCII table.
Think for a moment:
char n = 8; // stores the value 8
char n = a; // what does this store?
char n = '8'; // why is this different than the first line?
Lets rewind a moment: when you store 120 in a variable, it can represent the ASCII character 'x', but ultimately what is being stored is just the numeric value 120, plain and simple.
Specifically: When you pass 122 to a function that will ultimately use it to look up a font entry from a character set using the Latin1, ISO-8859-1, UTF-8 or similar encodings, then 120 means 'z'.
At the end of the day, char is just one of the standard integer value types, it can store values -128 <= n <= +127, it can trivially be promoted to a short, int, long or long long, etc, etc.
While it is generally used to denote characters, it also frequently gets used as a way of saying "I'm only storing very small values" (such as integer percentages).
int incoming = 5000;
int outgoing = 4000;
char percent = char(outgoing * 100 / incoming);
If you want to print the numeric value, you simply need to promote it to a different value type:
std::cout << (unsigned int)test << "\n";
std::cout << unsigned int(test) << "\n";
or the preferred C++ way
std::cout << static_cast<unsigned int>(test) << "\n";
I think (it's not completely clear what you are asking) that the answer is as simple as this
std::cout << "Please be a value! -----> " << static_cast<unsigned>(test) << std::endl;
If you want to output the numeric value of a char or unsigned char, you have to cast it to an int or unsigned first.
Not surprisingly, by default, chars are output as characters not integers.
BTW this funky code
char buff[100];
sprintf(buff,"The program was run with the following command: %d",n);
std::cout << buff << std::endl;
is more simply written as
std::cout << "The program was run with the following command: " << n << std::endl;
std::cout and std::cin always treats char variable as a char
If you want to input or output as int, you must manually do it like below.
std::cin >> int_var; c = int_var;
std::cout << (int)c;
If using scanf or printf, there is no such problem as the format parameter ("%d", "%c", "%s") tells howto covert input buffer (integer, char, string).

unsigned char limited to 127 on osx lion?

I am facing a strange issue, working on my mac osx lion (under xcode4/clang, though it is reproducible with gcc4.2).
It seems, that I can not assign any value above 127 for a unsigned char variable. So, when I assign
v = (unsigned char) 156;
or, simply
std::cout << (unsigned char) 231 << std::endl;
my program does not produce any output.
When I run this code
std::cout << "Unsigned chars range up to " << UCHAR_MAX << std::endl;
I get the following output:
Unsigned chars range up to 255
However, when I run something like this, the program generates outputs up to some different arbitrary value (such as c = 114, c = 252, etc etc) each time.
for(unsigned char c = 0; c < CHAR_MAX; c++)
std::cout << "c = " << 2*c << std::endl;
Changing the CHAR_MAX to UCHAR_MAX, the program ends without an output again :(
Thanks in advance
cout is converting the numeric value to a character from the character set (Well, it's attempting to ... when you don't see anything it's not a valid character for your charset, and technically it's the terminal that's deciding this).
Cast it to unsigned int instead.
Edit to add: ildjarn makes a very valid point in his comment to your question; if you ran this in the debugger you'd see that the value was indeed what you expected.
What symbol are you expecting to see to represent character (unsigned char)231? On most systems, these are extended characters that need special terminal settings to be displayed as anything coherent if even visible.
Try this:
unsigned char testChar = (unsigned char)231;
unsigned int testInt = (unsigned int)testChar;
std::cout << testInt << std::endl;
The value of unsigned char is not limited to 127, however standard ASCII is only 7 bits (so 128 values). Any value above 127 does not represent any character (unless you use some kind of extended ASCII), so nothing is printed.

How to convert a char array to a list of HEX values?

Would I would like to be able to do is convert a char array (may be binary data) to a list of HEX values of the form: ab 0d 12 f4 etc....
I tried doing this with
lHexStream << "<" << std::hex << std::setw (2) << character << ">";
but this did not work since I would get the data printing out as:
<ffe1><2f><ffb5><54>< 6><1b><27><46><ffd9><75><34><1b><ffaa><ffa2><2f><ff90><23><72><61><ff93><ffd9><60><2d><22><57>
Note here that some of the values would have 4 HEX values in them? e.g.
What I would be looking for is what they have in wireshark, where they represent a char aray (or binary data) in a HEX format like:
08 0a 12 0f
where each character value is represented by just 2 HEX characters of the form shown above.
It looks like byte values greater than 0x80 are being sign-extended to short (I don't know why it's stopping at short, but that's not important right now). Try this:
IHexStream << '<' << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(static_cast<unsigned char>(character))
<< '>';
You may be able to remove the outer cast but I wouldn't rely on it.
EDIT: added std::setfill call, which you need to get <06> instead of < 6>. Hat tip to jkerian; I hardly ever use iostreams myself. This would be so much shorter with fprintf:
fprintf(ihexfp, "<%02x>", (unsigned char)character);
As Zack mentions, The 4-byte values are because it is interpreting all values over 128 as negative (the base type is signed char), then that 'negative value' is extended as the value is expanded to a signed short.
Personally, I found the following to work fairly well:
char *myString = inputString;
for(int i=0; i< length; i++)
std::cout << std::hex << std::setw(2) << std::setfill('0')
<< static_cast<unsigned int>(myString[i]) << " ";
I think the problem is that the binary data is being interpreted as a multi-byte encoding when you're reading the characters. This is evidenced byt he fact that each of the 4-character hex codes in your example have the high bit set in the lower byte.
You probably want to read the binary stream in ascii mode.