Standard string behaviour with characters in C++

Standard string behaviour with characters in C++ - c++

I have a problem which I do not understand. I add characters to a standard string. Whe I take them out the value printed is not what I expected.
int main (int argc, char *argv[])
{
string x;
unsigned char y = 0x89, z = 0x76;
x += y;
x += z;
cout << hex << (int) x[0] << " " <<(int) x[1]<< endl;
}
The output:
ffffff89 76
What I expected:
89 76
Any ideas as what is happening here?
And how do I fix it?

The string operator [] is yielding a char, i.e. a signed value. When you cast this to an int for output it will be a signed value also.
The input value cast to a char is negative and therefore the int also will be. Thus you see the output you described.

Most likely char is signed on your platform, therefore 0x89 and 0x76 become negative when it's represented by char.
You've to make sure that the string has unsigned char as value_type, so this should work:
typedef basic_string<unsigned char> ustring; //string of unsigned char!
ustring ux;
ux += y;
ux += z;
cout << hex << (int) ux[0] << " " <<(int) ux[1]<< endl;
It prints what you think should print:
89 76
Online demo : http://www.ideone.com/HLvcv

You have to account for the fact that char may be signed. If you promote it to int directly, the signed value will be preserved. Rather, you first have to convert it to the unsigned type of the same width (i.e. unsigned char) to get the desired value, and then promote that value to an integer type to get the correct formatted printing.
Putting it all together, you want something like this:
std::cout << (int)(unsigned char)(x[0]);
Or, using the C++-style cast:
std::cout << static_cast<int>(static_cast<unsigned char>(x[0]))

The number 0x89 is 137 in decimal system. It exceeds the cap of 127 and is now a negative number and therefore you see those ffffffthere. You could just simply insert (unsigned char) after the (int) cast. You would get the required result.
-Sandip

Related

Is there an alternative to char for storing one byte numeric values?

A char stores a numeric value from 0 to 255. But there seems to also be an implication that this type should be printed as a letter rather than a number by default.
This code produces 22:
int Bits = 0xE250;
signed int Test = ((Bits & 0x3F00) >> 8);
std::cout << "Test: " << Test <<std::endl; // 22
But I don't need Test to be 4 bytes long. One byte is enough. But if I do this:
int Bits = 0xE250;
signed char Test = ((Bits & 0x3F00) >> 8);
std::cout << "Test: " << Test <<std::endl; // "
I get " (a double quote symbol). Because char doesn't just make it an 8 bit variable, it also says, "this number represents a character".
Is there some way to specify a variable that is 8 bits long, like char, but also says, "this is meant as a number"?
I know I can cast or convert char, but I'd like to just use a number type to begin with. It there a better choice? Is it better to use short int even though it's twice the size needed?

cast your character variable to int before printing
signed char Test = ((Bits & 0x3F00) >> 8);
std::cout << "Test: " <<(int) Test <<std::endl;

Char array to hex string conversion - unexpected output

I have a simple program converting dynamic char array to hex string representation.
#include <iostream>
#include <sstream>
#include <iomanip>
#include <string>
using namespace std;
int main(int argc, char const* argv[]) {
int length = 2;
char *buf = new char[length];
buf[0] = 0xFC;
buf[1] = 0x01;
stringstream ss;
ss << hex << setfill('0');
for(int i = 0; i < length; ++i) {
ss << std::hex << std::setfill('0') << std::setw(2) << (int) buf[i] << " ";
}
string mystr = ss.str();
cout << mystr << endl;
}
Output:
fffffffc 01
Expected output:
fc 01
Why is this happening? What are those ffffff before fc? This happens only on certain bytes, as you can see the 0x01 is formatted correctly.

Three things you need to know to understand what's happening:
The first thing is that char can be either signed or unsigned, it's implementation (compiler) specific
When converting a small signed type to a large signed type (like e.g. a signed char to an int), they will be sign extended
How negative values are stored using the most common two's complement system, where the highest bit in a value defines if a value is negative (bit is set) or not (bit is clear)
What happens here is that char seems to be signed, and 0xfc is considered a negative value, and when you convert 0xfc to an int it will be sign-extended to 0xfffffffc.
To solve it use explicitly unsigned char and convert to unsigned int.

This is called "sign extension".
char is a signed type, so 0xfc will become negative value if you force it in to a char.
Its decimal value is -4
When you cast it to int, it extends the sign bit to give you the same value.
(It happens here (int) buf[i])
On your system, int is 4 bytes, so you get the extra bytes filled with ff.

How to avoid 0xFF prefix when converting char to short?

When I do:
cout << std::hex << (short)('\x3A') << std::endl;
cout << std::hex << (short)('\x8C') << std::endl;
I expect the following output:
3a
8c
but instead, I have:
3a
ff8c
I suppose that this is due to the way char—and more precisely a signed char—is stored in memory: everything below 0x80 would not be prefixed; the value 0x80 and above, on the other hand, would be prefixed with 0xFF.
When given a signed char, how do I get a hexadecimal representation of the actual character inside it? In other words, how do I get 0x3A for \x3A, and 0x8C for \x8C?
I don't think a conditional logic is well suited here. While I can subtract 0xFF00 from the resulting short when needed, it doesn't seem very clear.

Your output might make more sense if you looked at it in decimal instead of hexadecimal:
std::cout << std::dec << (short)('\x3A') << std::endl;
std::cout << std::dec << (short)('\x8C') << std::endl;
output:
58
-116
The values were cast to short, so we are (most commonly) dealing with 16 bit values. The 16-bit binary representation of -116 is 1111 1111 1000 1100, which becomes FF8C in hexadecimal. So the output is correct given what you requested (on systems where char is a signed type). So not so much the way the char is stored in memory, but more the way the bits are interpreted. As an unsigned value, the 8-bit pattern 1000 1100 represents -116, and the conversion to short is supposed to preserve this value, rather than preserving the bits.
Your desired output of a hexadecimal 8C corresponds (for a short) to the decimal value 140. To get this value out of 8 bits, the value has to be interpreted as an unsigned 8-bit value (since the largest signed 8-bit value is 127). So the data needs to be interpreted as an unsigned char before it gets expanded to some flavor of short. For a character literal like in the example code, this would look like the following.
std::cout << std::hex << (unsigned short)(unsigned char)('\x3A') << std::endl;
std::cout << std::hex << (unsigned short)(unsigned char)('\x8C') << std::endl;
Most likely, the real code would have variables instead of character literals. If that is the case, then rather than casting to an unsigned char, it might be more convenient to declare the variable to be of unsigned char type. Which is possibly the type you should be using anyway, based on the fact that you want to see its hexadecimal value. Not definitively, but this does suggest that the value is seen simply as a byte of data rather than as a number, and that suggests that an unsigned type is appropriate. Have you looked at std::byte?
One other nifty thought to throw out: the following also gives the desired output as a reasonable facsimile of using an unsigned char variable.
#include <iostream>
unsigned char operator "" _u (char c) { return c; } // Suffix for unsigned char literals
int main()
{
std::cout << std::hex << (unsigned short)('\x3A'_u) << std::endl;
std::cout << std::hex << (unsigned short)('\x8C'_u) << std::endl;
}

A more straightforward approach is to cast a signed char to an unsigned char. In other words, this:
cout << std::hex << (short)(unsigned char)('\x3A') << std::endl;
cout << std::hex << (short)(unsigned char)('\x8C') << std::endl;
produces the expected result:
3a
8c
Not sure this is particularly clear, though.

Casting from size_t to char and around

I'm making the transition from C to C++11 now and I try to learn more about casting.
At the end of this question you see a small program which asked a number as input and then shows it as number and as character. Then it is cast to a char, and after that I cast it back to a size_t.
When I give 200 as input, the first cout prints 200, but the second cout prints 18446744073709551560.
How do I made it to print 200 again? Do I use the wrong cast? I have already tried different cast as dynamic and reintepret.
#include<iostream>
using namespace std;
int main(){
size_t value;
cout << "Give a number between 32 and 255: ";
cin >> value;
cout << "unsigned value: " << value << ", as character: `" << static_cast<char>(value) << "\'\n";
char ch = static_cast<char>(value);
cout << "unsigned value: " << static_cast<size_t>(ch) << ", as character: `" << ch << "\'\n";
}

size_t is unsigned, plain char's signed-ness is implementation-defined.
Casting 200 to a signed char will yield a negative result as 200 is larger than CHAR_MAX which is 127 for the most common situation, an 8-bit char. (Advanced note - this conversion is also implementation-defined but for all practical purposes you can assume a negative result; in fact usually -56).
Casting that negative value back to an unsigned (but wider) integer type will yield a rather large value, because unsigned arithmetic wraps around.
You might cast the value to an unsigned char first (yielding the expected small positive value), then cast that the wider unsigned type.
Most compilers have a switch that lets you toggle plain char over to unsigned so you could experiment with that. When you come to write portable code, try to write code that will work correctly in both cases!

Why is (int)'\xff' != 0xff but (int)'\x7f' == 0x7f?

Consider this code :
typedef union
{
int integer_;
char mem_[4];
} MemoryView;
int main()
{
MemoryView mv;
mv.integer_ = (int)'\xff';
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \xff\xff\xff\xff
mv.integer_ = 0xff;
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \xff\x00\x00\x00
// now i try with a value less than 0x80
mv.integer_ = (int)'\x7f'
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \x7f\x00\x00\x00
mv.integer_ = 0x7f;
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \x7f\x00\x00\x00
// now i try with 0x80
mv.integer_ = (int)'\x80'
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \x80\xff\xff\xff
mv.integer_ = 0x80;
for(int i=0;i<4;i++)
std::cout << mv.mem_[i]; // output is \x80\x00\x00\x00
}
I tested it with both GCC4.6 and MSVC2010 and results was same.
When I try with values less than 0x80 output is correct but with values bigger than 0x80,
left three bytes are '\xff'.
CPU : Intel 'core 2 Duo'
Endianness : little
OS : Ubuntu 12.04LTS (64bit), Windows 7(64 bit)

It's implementation-specific whether type char is signed or unsigned.
Assigning a variable of type char the value of 0xFF might either yield 255 (if type is really unsigned) or -1 (if type is really signed) in most implementations (where the number of bits in char is 8).
Values less, or equal to, 0x7F (127) will fit in both an unsigned char and a signed char which explains why you are getting the result you are describing.
#include <iostream>
#include <limits>
int
main (int argc, char *argv[])
{
std::cerr << "unsigned char: "
<< +std::numeric_limits<unsigned char>::min ()
<< " to "
<< +std::numeric_limits<unsigned char>::max ()
<< ", 0xFF = "
<< +static_cast<unsigned char> ('\xFF')
<< std::endl;
std::cerr << " signed char: "
<< +std::numeric_limits<signed char>::min ()
<< " to "
<< +std::numeric_limits<signed char>::max ()
<< ", 0xFF = "
<< +static_cast<signed char> ('\xFF')
<< std::endl;
}
typical output
unsigned char: 0 to 255, 0xFF = 255
signed char: -128 to 127, 0xFF = -1
To circumvent the problem you are experiencing explicitly declare your variable as either signed or unsigned, in this case casting your value into a unsigned char will be sufficient:
mv.integer_ = static_cast<unsigned char> ('\xFF'); /* 255, NOT -1 */
side note:
you are invoking undefined behaviour when reading a member of a union that is not the last member you wrote to. the standard doesn't specify what will be going on in this case. sure, under most implementations it will work as expected. accessing union.mem_[0] will most probably yield the first byte of union.integer_, but this is not guarenteed.

The type of '\xff' is char. char is a signed integral type on a lot of platforms, so the value of '\xff is negative (-1 rather than 255). When you convert (cast) that to an int (also signed), you get an int with the same, negative, value.
Anything strictly less than 0x80 will be positive, and you'll get a positive out of the conversion.

Because '\xff' is a signed char (default for char is signed in many architectures, but not always) - when converted to an integer, it is sign-extended, to make it 32-bit (in this case) int.
In binary arithmetic, nearly all negative representations use the highest bit to indicate "this is negative" and some sort of "inverse" logic to represent the value. The most common is to use "two's complement", where there is no "negative zero". In this form, all ones is -1, and the "most negative number" is a 1 followed by a lot of zeros, so 0x80 in 8 bits is -128, 0x8000 in 16 bits is -32768, and 0x80000000 is -2147 million (and some more digits).
A solution, in this case, would be to use static_cast<unsigned char>('\xff').

Basically, 0xff stored in a signed 8 bit char is -1. Whether a char without signedor unsigned specifier is signed or unsigned depends on the compiler and/or platform and in this case it seems to be.
Cast to an int, it keeps the value -1, which stored in a 32 bit signed int is 0xffffffff.
0x7f on the other hand stored in an 8 bit signed char is 127, which cast to a 32 bit int is 0x0000007f.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Standard string behaviour with characters in C++ - c++

The string operator [] is yielding a char, i.e. a signed value. When you cast this to an int for output it will be a signed value also. The input value cast to a char is negative and therefore the int also will be. Thus you see the output you described.

The number 0x89 is 137 in decimal system. It exceeds the cap of 127 and is now a negative number and therefore you see those ffffffthere. You could just simply insert (unsigned char) after the (int) cast. You would get the required result. -Sandip

Related

Is there an alternative to char for storing one byte numeric values?

Char array to hex string conversion - unexpected output

How to avoid 0xFF prefix when converting char to short?

Casting from size_t to char and around

Why is (int)'\xff' != 0xff but (int)'\x7f' == 0x7f?

Categories

Resources