Can anyone explain the following behaviour to a relative newbie...
const char cInputFilenameAndPath[] = "W:\\testerfile.bin";
int filesize = 4584;
char * fileinrampointer;
fileinrampointer = (char*) malloc(filesize);
ifstream fsInputFileStream;
fsInputFileStream.open(cInputFilenameAndPath, fstream::in | fstream::binary);
fsInputFileStream.read((char *)(fileinrampointer), filesize);
for(int f=0; f<4; f++)
{
printf("%x\n", *fileinrampointer);
fileinrampointer++;
}
I was expecting the above code to rread the first 4 bytes of the file I just read into memory. In the loop I am just displaying the current byte pointed to by the pointer then incrementing the pointer ready to display the next byte.
When I run the code I get:
37
ffffff94
42
ffffffd2
The values are correct but every other value seems to be padded up to a 64 bit number.
Because I'm asking it to display the value indicated by a 'char sized' pointer, I was expecting char size results but every other result comes out as a long long.
If I asign *fileinrampointer to an unsigned __int8 it leaves me with the value I want (without the leading 1s) which solves the problem, but I'm just wondering if anyone can explain what is happening above?
The expression *fileinrampointer is of type signed char, and it is being promoted to a signed int while being passed to printf. Thus, the sign bit propagates. Later on, you print it out with %x which means unsigned int in hex, which causes you to print all the 1's (as opposed to correctly interpret them as a part of a 2's complement signed integer). Also, ffffffd2 is 8 hex digits which means it's a 32bit signed integer.
If you declare fileinrampointer as unsigned char or unsigned __int8 the sign bit doesn't propagate during promotion. You may as well leave it signed and cast it
printf("%x\n", static_cast<unsigned char>(*fileinrampointer) );
ISO/IEC 9899:1999 6.5.2.2:
6 . If the expression that denotes the called function has a type that does not include a prototype, the integer promotions are performed on each argument, and arguments that have type float are promoted to double. These are called the default argument promotions. [...]
[...]
7. If the expression that denotes the called function has a type that does include a prototype, the arguments are implicitly converted, as if by assignment, to the types of the corresponding parameters, taking the type of each parameter to be the unqualified version of its declared type. The ellipsis notation in a function prototype declarator causes argument type conversion to stop after last declared parameter. The default argument promotions are performed on trailing arguments.
This clearly backs up my statement that this is integer promotion, and not printf interpretation.
Also see
ISO/IEC 9899:1999 7.15.1.1
glibc manual A.2.2.4
glibc manual 12.12.4
securecoding.cert.org
You are not asking it to display a value indicated by a char sized pointer, you are asking it to display a hexidecimal integer (%x) using the contents of a char pointer. Not tried it but you could try casting it:
printf("%x\n", (unsigned int)(*fileinrampointer));
Related
#include <stdio.h>
int main() {
int i,n;
int a = 123456789;
void *v = &a;
unsigned char *c = (unsigned char*)v;
for(i=0;i< sizeof a;i++) {
printf("%u ",*(c+i));
}
char *cc = (char*)v;
printf("\n %d", *(cc+1));
char *ccc = (char*)v;
printf("\n %u \n", *(ccc+1));
}
This program generates the following output on my 32 bit Ubuntu machine.
21 205 91 7
-51
4294967245
First two lines of output I can understand =>
1st Line : sequence of storing of bytes in memory.
2nd Line : signed value of the second byte value (2's complement).
3rd Line : why such a large value ?
please explain the last line of output. WHY three bytes of 1's are added
because (11111111111111111111111111001101) = 4294967245 .
Apparently your compiler uses signed characters and it is a little endian, two's complement system.
123456789d = 075BCD15h
Little endian: 15 CD 5B 07
Thus v+1 gives value 0xCD. When this is stored in a signed char, you get -51 in signed decimal format.
When passed to printf, the character *(ccc+1) containing value -51 first gets implicitly type promoted to int, because variadic functions like printf has a rule stating that all small integer parameters will get promoted to int (the default argument promotions). During this promotion, the sign is preserved. You still have value -51, but for a 32 bit signed integer, this gives the value 0xFFFFFFCD.
And finally the %u specifier tells printf to treat this as an unsigned integer, so you end up with 4.29 bil something.
The important part to understand here is that %u has nothing to do with the actual type promotion, it just tells printf how to interpret the data after the promotion.
-51 store in 8 bit hex is 0xCD. (Assuming 2s compliment binary system)
When you pass it to a variadic function like printf, default argument promotion takes place and char is promoted to int with representation 0xFFFFFFCD (for 4 byte int).
0xFFFFFFCD interpreted as int is -51 and interpreted as unsigned int is 4294967245.
Further reading: Default argument promotions in C function calls
please explain the last line of output. WHY three bytes of 1's are
added
This is called sign extension. When a smaller signed number is assigned (converted) to larger number, its signed bit get's replicated to ensure it represents same number (for example in 1s and 2s compliment).
Bad printf format specifier
You are attempting to print a char with specifier "%u" which specifies unsigned [int]. Arguments which do not match the conversion specifier in printf is undefined behavior from 7.19.6.1 paragraph 9.
If a conversion specification is invalid, the behavior is undefined. If
any argument is not the correct type for the corresponding conversion
specification, the behavior is undefined.
Use of char to store signed value
Also to ensure char contains signed value, explicitly use signed char as char may behave as signed char or unsigned char. (In latter case, output of your snippet may be 205 205). In gcc you can force char to behave as unsigned char with -funsigned-char option.
I would like to store an unsigned char into a char by means of a shift. As the two data types have the same length (1 byte on my machine), I would have expected the following code to work:
#include <iostream>
#include <cstring>
#include <cstdio>
using namespace std;
int main () {
printf ("%d\n", sizeof(char));
printf ("%d\n", sizeof(unsigned char));
unsigned char test = 49;
char testchar = (char) (test - 127);
printf ("%x\n", testchar);
return 0;
}
but it doesn't. In particular, I got the following output:
1
1
ffffffb2
that suggests that the char has been casted to int. Does anybody has an explanation and, hopefully, a solution?
%x is a specifier for a 4-byte int. To print one byte char use %hhx.
printf typecasts its arguments according to the format specifiers passed to it.That is why testchar was type promoted to int.
printf is a variable argument function, and as such it's arguments are subject to default promotion rules. For this case, your char is promoted to an int, and in that process is sign extended.
A 2's complement int of 4 bytes with the binary pattern 0xffffffb2 is -78. Print it as a char with the %hhx specifier.
See also Which integral promotions do take place when printing a char?
%x is only for printing unsigned int, however you supply a char.
Using %x with a negative value of char causes undefined behaviour.
Aside: The C Standard specification of printf is not particularly clear; some feel that passing anything except exactly an unsigned int causes undefined behaviour. Others (including myself) feel that it's OK to pass arguments that are not specifically unsigned int, but after the default argument promotions, have type int with a non-negative value. The standard does guarantee that non-negative ints have the same representation as the unsigned int with the same value.
Some of the other answers suggest %hhx, but that is not any better than %x. The standard (on a sensible interpretation) specifies that %hhx only be used with an unsigned char argument, and %hhd only be used with a signed char argument. There is actually no modifier for plain char.
Either way you look at it, nowhere can printf be used to convert negative values to positive representations in a well-defined manner. You must convert the argument yourself and then use a matching format specifier. In this case:
printf ("%hhx\n", (unsigned char)testchar);
would be one option. IMO %x could be used here, but as mentioned above, some disagree.
NB. The wrong format specifier is used in printf ("%d\n", sizeof(char)); and the line following that. The specifier for size_t is %zu. So you could either use %zu, or cast the argument to int, or even better:
printf("1\n");
What happens is !!!!
1) unsigned char test = 49; // hex value 31 gets assigned
2) char testchar = (char) (test - 127); // 49-127 = -78 ie; 0xb2 (as unsigned),converting it to signed char results F padding before b2 to indicate it as negative
3) printf ("%x\n", testchar); //Since %x is a specifier for a 4-byte int (as #Don't You Worry Child said) ffffffb2, 4 byte output is obtained
So try as per #Don't You Worry Child said
I would have expected the following code to work:
It won't.
Ignoring the issues other people have pointed out with how you're printing the character, there is no guarantee in the standard that your code will work. Why?
Because char does not have to be signed. Whether char is signed or unsigned is implementation-dependent. Some implementations make char signed, others make it unsigned.
As such, there's no guarantee that (char) (test - 127) will produce a value that can be represented by char.
C++(14) does allow lossless conversion between unsigned char and char. The stadnard says (3.9.1/1):
For each value i of type unsigned char in the range 0 to 255 inclusive, there exists a value j of type char such that the result of an integral conversion (4.7) from i to char is j, and the result of an integral conversion from j to unsigned char is i.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers? To make sense of my question, have a look at the code below -
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
both the printf's output 𤭢 correctly, where f0 a4 ad a2 is the encoding for the Unicode code-point U+24B62 (𤭢) in hex.
Even memcpy also correctly copied the bits held by a char.
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification. But as the above example showed, the output doesn't seem to be affected by any padding as such.
I have used VC++ Express 2010 and MinGW to compile the above. Although VC gave the warning
warning C4309: '=' : truncation of constant value
the output doesn't seems to reflect that.
P.S. This could be marked a possible duplicate of Should a buffer of bytes be signed or unsigned char buffer? but my intent is different. I am asking why something which seems to be working as fine with char should be typed unsigned char?
Update: To quote from N3337,
Section 3.9 Types
2 For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes (1.7) making up the object can be copied into
an array of char or unsigned char. If the content of the array of char
or unsigned char is copied back into the object, the object shall
subsequently hold its original value.
In view of the above fact and that my original example was on Intel machine where char defaults to signed char, am still not convinced if unsigned char should be preferred over char.
Anything else?
In C the unsigned char data type is the only data type that has all the following three properties simultaneously
it has no padding bits, that it where all storage bits contribute to the value of the data
no bitwise operation starting from a value of that type, when converted back into that type, can produce overflow, trap representations or undefined behavior
it may alias other data types without violating the "aliasing rules", that is that access to the same data through a pointer that is typed differently will be guaranteed to see all modifications
if these are the properties of a "binary" data type you are looking for, you definitively should use unsigned char.
For the second property we need a type that is unsigned. For these all conversion are defined with modulo arihmetic, here modulo UCHAR_MAX+1, 256 in most 99% of the architectures. All conversion of wider values to unsigned char thereby just corresponds to truncation to the least significant byte.
The two other character types generally don't work the same. signed char is signed, anyhow, so conversion of values that don't fit it is not well defined. char is not fixed to be signed or unsigned, but on a particular platform to which your code is ported it might be signed even it is unsigned on yours.
You'll get most of your problems when comparing the contents of individual bytes:
char c[5];
c[0] = 0xff;
/*blah blah*/
if (c[0] == 0xff)
{
printf("good\n");
}
else
{
printf("bad\n");
}
can print "bad", because, depending on your compiler, c[0] will be sign extended to -1, which is not any way the same as 0xff
The plain char type is problematic and shouldn't be used for anything but strings. The main problem with char is that you can't know whether it is signed or unsigned: this is implementation-defined behavior. This makes char different from int etc, int is always guaranteed to be signed.
Although VC gave the warning ... truncation of constant value
It is telling you that you are trying to store int literals inside char variables. This might be related to the signedness: if you try to store an integer with value > 0x7F inside a signed character, unexpected things might happen. Formally, this is undefined behavior in C, though practically you'd just get a weird output if attempting to print the result as an integer value stored inside a (signed) char.
In this specific case, the warning shouldn't matter.
EDIT :
In other related questions unsigned char is highlighted because it is the only (byte/smallest) data type which is guaranteed to have no padding by the C-specification.
In theory, all integer types except unsigned char and signed char are allowed to contain "padding bits", as per C11 6.2.6.2:
"For unsigned integer types other than unsigned char, the bits of the
object representation shall be divided into two groups: value bits and
padding bits (there need not be any of the latter)."
"For signed integer types, the bits of the object representation shall
be divided into three groups: value bits, padding bits, and the sign
bit. There need not be any padding bits; signed char shall not have
any padding bits."
The C standard is intentionally vague and fuzzy, allowing these theoretical padding bits because:
It allows different symbol tables than the standard 8-bit ones.
It allows implementation-defined signedness and weird signed integer formats such as one's complement or "sign and magnitude".
An integer may not necessarily use all bits allocated.
However, in the real world outside the C standard, the following applies:
Symbol tables are almost certainly 8 bits (UTF8 or ASCII). Some weird exceptions exist, but clean implementations use the standard type wchar_t when implementing symbols tables larger than 8 bits.
Signedness is always two's complement.
An integer always uses all bits allocated.
So there is no real reason to use unsigned char or signed char just to dodge some theoretical scenario in the C standard.
Bytes are usually intended as unsigned 8 bit wide integers.
Now, char doesn't specify the sign of the integer: on some compilers char could be signed, on other it may be unsigned.
If I add a bit shift operation to the code you wrote, then I will have an undefined behaviour. The added comparison will also have an unexpected result.
char c[5], d[5];
c[0] = 0xF0;
c[1] = 0xA4;
c[2] = 0xAD;
c[3] = 0xA2;
c[4] = '\0';
c[0] >>= 1; // If char is signed, will the 7th bit go to 0 or stay the same?
bool isBiggerThan0 = c[0] > 0; // FALSE if char is signed!
printf("%s\n", c);
memcpy(d, c, 5);
printf("%s\n", d);
Regarding the warning during the compilation: if the char is signed then you are trying to assign the value 0xf0, which cannot be represented in the signed char (range -128 to +127), so it will be casted to a signed value (-16).
Declaring the char as unsigned will remove the warning, and is always good to have a clean build without any warning.
The signed-ness of the plain char type is implementation defined, so unless you're actually dealing with character data (a string using the platform's character set - usually ASCII), it's usually better to specify the signed-ness explicitly by either using signed char or unsigned char.
For binary data, the best choice is most probably unsigned char, especially if bitwise operations will be performed on the data (specifically bit shifting, which doesn't behave the same for signed types as for unsigned types).
I am asking why something which seems to be working as fine with char should be typed unsigned char?
If you do things which are not "correct" in the sense of the standard, you rely on undefined behaviour. Your compiler might do it the way you want today, but you don't know what it does tomorrow. You don't know what GCC does or VC++ 2012. Or even if the behaviour depends on external factors or Debug/Release compiles etc. As soon as you leave the safe path of the standard, you might run into trouble.
Well, what do you call "binary data"? This is a bunch of bits, without any meaning assigned to them by that specific part of software that calls them "binary data". What's the closest primitive data type, which conveys the idea of the lack of any specific meaning to any one of these bits? I think unsigned char.
Is it really necessary to use unsigned char to hold binary data as in some libraries which work on character encoding or binary buffers?
"really" necessary? No.
It is a very good idea though, and there are many reasons for this.
Your example uses printf, which not type-safe. That is, printf takes it's formatting cues from the format string and not from the data type. You could just as easily tried:
printf("%s\n", (void*)c);
... and the result would have been the same. If you try the same thing with c++ iostreams, the result will be different (depending on the signed-ness of c).
What reasoning could possibly advocate the use of unsigned char instead of a plain char?
Signed specifies that the most significant bit of the data (for unsigned char the 8-th bit) represents the sign. Since you obviously do not need that, you should specify your data is unsigned (the "sign" bit represents data, not the sign of the other bits).
I am reading chapter 2 of Advanced Linux Programming:
http://www.advancedlinuxprogramming.com/alp-folder/alp-ch02-writing-good-gnu-linux-software.pdf
In the section 2.1.3 Using getopt_long, there is an example program that goes a bit like this:
int main (int argc, char* argv[]) {
int next_option;
// ...
do {
next_option = getopt_long (argc, argv, short_options, long_options, NULL);
switch (next_option) {
case ‘h’: /* -h or --help */
// ...
}
// ...
The bit that caught my attention is that next_option is declared as an int. The function getopt_long() apparently returns an int representing the short command line argument which is used in the following switch statement. How come that integer can be compared to a character in the switch statement?
Is there an implicit conversion from a char (a single character?) to an int? How is the code above valid? (see full code in linked pdf)
Neither C nor C++ have a type that can store "characters" as values with some dedicated character-specific properties. In that sense, there's no "character" type neither in C nor in C++.
In both C++ and C languages char is an integral type. It contains numbers. It is just a smallest (in terms of range) integral type. Conversion between char and int exists, just like it exists between int and long or int and short. char has no special status among other integral types (aside from the fact that char type it is distinct from signed char type).
A literal of the form 'h' in C++ has type char, but as any other integral type it is comparable to int. That's why you can use it in case label the way it is used in your original example.
In other words, your original code is as "strange" as
switch (next_option) {
case 1L: ...
// ...
}
would be. In this case the switch argument is an int, but the case label is a long. The code is valid. Do you find it surprising? Probably not. Your example with 'h' is in not much different.
You are mistaken -- getopt_long(3) returns an int.
Several functions return int in C, but char in C++. Returning an int when a char would make more sense is simply an old C cultural decision. Plus, in a few cases, it's necessary so that a function can return sentinels like EOF.
As the other answerer says, you're asking the wrong question here. But to answer the question you did ask:
No implicit casting from char* to an int is available. On x86 machines, both int and char* are 32 bits long, so it's "safe" to explicitly cast:
int x = (int*) &someChar;
BUT HIGHLY NOT RECOMMENDED!!!
On x64 machines, this will not work! int remains 32 bits long, but all pointers are now 64 bits long... so you'll lose data in the process!
According to the man page, getopt_long returns an int. And yes, there is an implicit cast from char to int; a char is just a one-byte integer value.
So in this case, the cast is not happening when assigning to next_option, but in the case statement where you have a character constant being compared to an int. Of course, this is assuming you compile this as C++. In C++, a character constant is of type char, but in C it's of type int, so if you compile this code as C then there's no type conversion at all.
(And in your question you mention char*, but you probably meant char; there are no pointers being used here.)
Think of a char as a 8bit int. You can perform integer operations on chars and you can even declare them as unsigned. You wouldn't be surprised if you could compare a short and a long. Why should comparing a char and an int be different?
Is it safe to convert, say, from an unsigned char * to a signed char * (or just a char *?
The access is well-defined, you are allowed to access an object through a pointer to signed or unsigned type corresponding to the dynamic type of the object (3.10/15).
Additionally, signed char is guaranteed not to have any trap values and as such you can safely read through the signed char pointer no matter what the value of the original unsigned char object was.
You can, of course, expect that the values you read through one pointer will be different from the values you read through the other one.
Edit: regarding sellibitze's comment, this is what 3.9.1/1 says.
A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.9); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. For unsigned character types, all possible bit patterns of the value representation represent numbers.
So indeed it seems that signed char may have trap values. Nice catch!
The conversion should be safe, as all you're doing is converting from one type of character to another, which should have the same size. Just be aware of what sort of data your code is expecting when you dereference the pointer, as the numeric ranges of the two data types are different. (i.e. if your number pointed by the pointer was originally positive as unsigned, it might become a negative number once the pointer is converted to a signed char* and you dereference it.)
Casting changes the type, but does not affect the bit representation. Casting from unsigned char to signed char does not change the value at all, but it affects the meaning of the value.
Here is an example:
#include <stdio.h>
int main(int args, char** argv) {
/* example 1 */
unsigned char a_unsigned_char = 192;
signed char b_signed_char = b_unsigned_char;
printf("%d, %d\n", a_signed_char, a_unsigned_char); //192, -64
/* example 2 */
unsigned char b_unsigned_char = 32;
signed char a_signed_char = a_unsigned_char;
printf("%d, %d\n", b_signed_char, b_unsigned_char); //32, 32
return 0;
}
In the first example, you have an unsigned char with value 192, or 110000000 in binary. After the cast to signed char, the value is still 110000000, but that happens to be the 2s-complement representation of -64. Signed values are stored in 2s-complement representation.
In the second example, our unsigned initial value (32) is less than 128, so it seems unaffected by the cast. The binary representation is 00100000, which is still 32 in 2s-complement representation.
To "safely" cast from unsigned char to signed char, ensure the value is less than 128.
It depends on how you are going to use the pointer. You are just converting the pointer type.
You can safely convert an unsigned char* to a char * as the function you are calling will be expecting the behavior from a char pointer, but, if your char value goes over 127 then you will get a result that will not be what you expected, so just make certain that what you have in your unsigned array is valid for a signed array.
I've seen it go wrong in a few ways, converting to a signed char from an unsigned char.
One, if you're using it as an index to an array, that index could go negative.
Secondly, if inputted to a switch statement, it may result in a negative input which often is something the switch isn't expecting.
Third, it has different behavior on an arithmetic right shift
int x = ...;
char c = 128
unsigned char u = 128
c >> x;
has a different result than
u >> x;
Because the former is sign-extended and the latter isn't.
Fourth, a signed character causes underflow at a different point than an unsigned character.
So a common overflow check,
(c + x > c)
could return a different result than
(u + x > u)
Safe if you are dealing with only ASCII data.
I'm astonished it hasn't been mentioned yet: Boost numeric cast should do the trick - but only for the data of course.
Pointers are always pointers. By casting them to a different type, you only change the way the compiler interprets the data pointed to.