Weird casting from string element

Weird casting from string element - c++

I get std:string which should include bytes (array of chars), I'm trying to display the bytes, but first byte always include weird data:
4294967169, how this can be a byte/char?!
void do_log(const std::string & data) {
std::stringstream ss;
ss<< "do_log: ";
int min = min((data.length()), (20)); // first 20 bytes
for (int i=0;i<min;i++)
{
ss << setfill ('0') << setw(2) << hex << (unsigned int)data.at(i) <<" ";
}
log(ss.str());
}
Data I log is:
ffffff81 0a 53 3a 30 30 38 31 30 36 43 38
Why and how this ffffff81 appear? if the string.at should return char?

When you write (unsigned int)data.at(i), data.at(i) is a char which then is submitted to integer promotion. If on your system char is signed, the values greater than 127 are interpreted as negative number. The sign bit will remain in the integer promotion, giving such strange results.
You can verify if char is signed by looking at numeric_limits<char>::is_signed
You can easily solve the issue by getting rid of the bits added by the integer promotion, by ANDing the integer with 0xff: (static_cast<int>(data.at(i)) & 0xff)
Another way is to force your compiler to work with unsigned chars. For example, with option -funsigned-char on gcc or /J with MSVC.

string contains signed characters.
So a character 0x81 is interpreted as a negative number, like 0xFFFFFF81.
You cast this character to an unsigned int, so it becomes a very large number.

Related

Why is array of characters(char type) working with unicode characters (c++)?

When i wrte this code :
using namespace std;
int main(){
char x[] = "γεια σας";
cout << x;
return 0;
}
I noticed that compiler gave me output which i excepted γεια σας Although the type of array is char, That is, it should just accept ASCII characters.
So why compiler didn't give error?

Here's some code showing what C++ really does:
#include <iostream>
#include <iomanip>
using namespace std;
int main(){
char x[] = "γεια σας";
cout << x << endl;
auto len = strlen(x);
cout << "Length (in bytes): " << len << endl;
for (int i = 0; i < len; i++)
cout << "0x" << setw(2) << hex << static_cast<int>(static_cast<unsigned char>(x[i])) << ' ';
cout << endl;
return 0;
}
The output is:
γεια σας
Length (in bytes): 15
0xce 0xb3 0xce 0xb5 0xce 0xb9 0xce 0xb1 0x20 0xcf 0x83 0xce 0xb1 0xcf 0x82
So the string takes up 15 bytes and is encoded as UTF-8. UTF-8 is a Unicode encoding using between 1 and 4 bytes per character (in the sense of the smallest unit you can select with the text cursor). UTF-8 can be saved in a char array. Even though it's called char, it basically corresponds to a byte and not what we typically think of as a character.

What you have got with 99.99% likelihood is Unicode code points stored in UTF-8 format. Each code point is turned into one to four chars.
Unicode in the ASCII range is turned into one ASCII byte from 0x00 to 0x7f. There are 2048 code points translated to two bytes with the binary pattern 110x xxxx 10yy yyyy, 65536 are translated to three code points 1110 xxxx 10yy yyyy 10zz zzzz, and the rest becomes four chars 1111 0xxx 10yy yyyy 10zz zzzz 10uu uuuu.
Most C and C++ string functions work just fine with UTF-8. An exception is strncpy or strncat which could create an incomplete code point. The old Interview problem “reverse the strings in a character” becomes more complicated because reversing the bytes inside a code point produces nonsense.

Although the type of array is char, That is, it should just accept ASCII characters.
You've assumed wrongly.
Unicode has several transformation formats. One popular such format is UTF-8. The code units of UTF-8 are 8 bits wide, as implied by the name. It is always possible to use char to represent the code units of UTF-8, because char is guaranteed to be at least 8 bits wide.

How to get Windows-1252 character values in c++?

I have a weird input file with all kinds of control characters like nulls. I want to remove all control characters from this Windows-1252 encoded text file, but if you do this:
std::string test="tést";
for (int i=0;i<test.length();i++)
{
if (test[i]<32) test[i]=32; // change all control characters into spaces
}
It will change the é into a space as well.
So if you have a string like this, encoded in Windows-1252:
std::string test="tést";
The hex values would be:
t é s t
74 E9 73 74
See https://en.wikipedia.org/wiki/ASCII and https://en.wikipedia.org/wiki/Windows-1252
test[0] would equal to decimal 116 (=0x74), but apparently with é/0xE9, test[1] does not equal the decimal value 233.
So how can you recognize that é properly?

32 is a signed integer, comparing the char with the signed integer is performed by the compiler as signed: E9 (-23)<32 which return true.
Using an unsigned literal of 32, that is 32umakes the comparison to be performed on unsigned values: E9 (233) < 32 which return false.
Replace :
if (test[i]<32) test[i]=32;
By:
if (test[i]<32u) test[i]=32u;
And you should get the expected result.
Test this here:
https://onlinegdb.com/BJ8tj0kbd
Note: you can check that char is signed with the following code:
#include <limits>
...
std::cout << std::numeric_limits<char>::is_signed << std::endl;

Change
if (test[i]<32)
to
if (test[i] >= 0 && test[i] < 32)
chars are often signed types and 0xE9 is a negative value in an eight bit integer.

How to read ASCII value from a character and convert it into hexadecimal formatted string

Need to read the value of character as a number and find corresponding hexadecimal value for that.
#include <iostream>
#include <iomanip>
using namespace std;
int main() {
char c = 197;
cout << hex << uppercase << setw(2) << setfill('0') << (short)c << endl;
}
Output:
FFC5
Expected output:
C5

The problem is that when you use char c = 197 you are overflowing the char type, producing a negative number (-59). Starting there it doesn't matter what conversion you make to larger types, it will remain a negative number.
To fully understand why you must know how two's complement works.
Basically, -59 and 192 have the same binary representation: 1100 0101, depending on the data type it is interpreted in one way or another. When you print it using hexadecimal format, the binary representation (the actual value stored in memory) is the one used, producing C5.
When the char is converted into an short/unsigned short, it is converting the -59 into its short/unsigned short representation, which is 1111 1111 1100 0101 (FFC5) for both cases.
The correct way to do it would be to store the initial value (197) into a variable which data type is able to represent it (unsigned char, short, unsigned short, ...) from the very beginning.

Formatting output of a signed hex digits

I have been using stringstream to convert my data and it has been working great except for one case.
I am subtracting two integer values that can end up being negative or positive. I take that value and send it to my stringstream object using std::hex as well as it gets dumped to std::cout.
My problem is my field for this value can only be 3 digits long and when I get a negative value it pads it with too many leading F's. I can't seem to get any std functions to help (setw, setfill, ...).
Can anyone point me in the right direction?
Example:
Value - Value = -9, So what I want is FF9 but what I get is FFFFFFF9.
My code to send the value to my stringstream object ss
ss << hex << value - LocationCounter;

You are trying to output a value that is 12 bits max in size. There is no 12-bit data type, so the closest you can get is to use a 16-bit signed type with its high 4 bits set to 0. For instance, calculate your desired value into an 8-bit signed type first (which will reduce its effective range to -128 .. 127), then sign-extend it to a 16-bit signed type, zero the high 4 bits, and finally output the result as hex:
signed char diff = (signed char)(value - LocationCounter);
// the setw() and setfill() are used to pad
// values that are 8 bits or fewer in size...
ss << hex << setw(3) << setfill('0') << (((signed short)diff) & 0x0fff);
To read the value back, read the 12-bit hex into a signed short and then truncate its value to a signed char:
signed short tmp;
ss >> hex >> tmp;
signed char diff = (signed char)tmp;

Garbage values when array declared as unsigned char

When an array is declared as unsigned char and initialized with values in the range 0x00-0xff and printed using cout, I get garbage values as follows
+ ( �
~ � � �
� O
� � <
May I know how to use use single byte for the numbers and yet be able to use cout ?

Because it's an unsigned char, std::cout is passing them to the terminal and it's being displayed as a character set (Well, attempting, anyway - the values are outside the range of valid printable characters for the character set you're using).
Cast to unsigned int when outputting with cout.

Char types are displayed as characters by default. If you want them displayed as integers, you will have to convert them first:
unsigned char value = 42;
std::cout << static_cast<unsigned int>(value);

Those aren't garbage values. Those are what the character represents. To print it as an int, simply cast to unsigned int at output time:
cout << (unsigned int) some_char;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Weird casting from string element - c++

string contains signed characters. So a character 0x81 is interpreted as a negative number, like 0xFFFFFF81. You cast this character to an unsigned int, so it becomes a very large number.

Related

Why is array of characters(char type) working with unicode characters (c++)?

How to get Windows-1252 character values in c++?

How to read ASCII value from a character and convert it into hexadecimal formatted string

Formatting output of a signed hex digits

Garbage values when array declared as unsigned char

Categories

Resources