What's the deal with changing pointers?

What's the deal with changing pointers? - c++

I just have a quick question on the extraneous number (1684096032) outputted to the screen. I expected the integer ASCII value (97) to be outputted, as that is the ASCII value of lowercase 'a'. The console gave me a ginormous number instead...
using namespace std;
int main(){
char dadum[50] = "Papa Dadi";
char* wicked = dadum;
int* baboom = (int*)wicked;
cout << baboom[1] << endl;
cout<<"Hello World";
return 0;
}

A Word of Warning
Casting a a pointer to int* that wasn’t derived from an int* is undefined behavior, and the compiler is allowed to do literally anything. This isn’t just theoretical: on some workstations I’ve coded on, an int* must be aligned on a four-byte boundary, whereas a char* might not be, trying to load a misaligned 32-bit value would cause a CPU fault, and therefore code like this might or might not crash at runtime with a “bus error.”
Where 1684096032 Comes From
The compiler here is doing the “common-sense” thing for a desktop computer in 2019. It’s letting you “shoot yourself in the foot,” so to speak, by emitting a load instruction from the address you cast to int*, no questions asked. What you get is the garbage that CPU instruction gives you.
What’s going on is that "Papa Dadi" is stored as an array of bytes in memory, including the terminating zero byte. It’s equivalent to, {'P','a','p','a',' ','D','a','d', 'i', '\0'}. ("Papa Dadi" is just syntactic sugar. You could write it either way) These are stored as their ASCII values1 { 0x50, 0x61, 0x70, 0x61, 0x20, 0x44, 0x61, 0x64, 0x69, 0x00 }.
You happen to be compiling on a machine with four-byte int, so when you alias wicked to the int* baboom, baboom[0] aliases bytes 0–3 of wicked and baboom[1] bytes 4–7. Therefore, on the implementation you tested, you happened to get back the bytes " Dad", or in Hex, 20 44 61 64.
Next, you happen to be compiling for a little-endian machine, so that gets loaded from memory in “back-words order,” 0x64614420.
This has the decimal value 1684096032.
What You Meant to Do
Based on your remarks:2
cout << (int)wicked[1];
1 Evidently you aren’t running on the sole exception, an IBM mainframe compiler. These still use EBCDIC by default.
2 The consensus here is never to write using namespace std;. Personally, I learned to use <iostream.h> and the C standard library long before namespace std and I still prefer to omit std for those, so you can tell my code by the block of commands like using std::cout;.

You are trying to print out the second value of integer array which points to the string "Papa Dadi"
You should know that the integer is 4-bytes and char is 1-byte. so each one element in the integer array will skip 4-bytes(characters)
the print out you see is "0x64614420" in hex, which when you think about big-endian you will see the following
0x20 is space character.
0x44 is 'D'
0x61 is 'a'
0x64 is 'd'
so you need to convert big-endian to little-endian if you care about order. however I am not sure what are you trying to achieve.

Related

Convert BYTE array into unsigned long long int

I'm trying to convert a BYTE array into an equivalent unsigned long long int value but my coding is not working as expected. Please help with fixing it or suggest an alternative method for the same.
Extra Information: These 4 bytes are combined as a hexadecimal number and an equivalent decimal number is an output. Say for a Given byteArray= {0x00, 0xa8, 0x4f, 0x00}, Hexadecimal number is 00a84f00 and it's equivalent decimal number is 11030272.
#include <iostream>
#include <string>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
BYTE byteArray[4] = { 0x00, 0x08, 0x00, 0x00 };
std::string str(reinterpret_cast<char*>(&byteArray[0]), 4);
std::cout << str << std::endl;
unsigned long long ull = std::strtoull(str.c_str(), NULL, 0);
printf ("The decimal equivalents are: %llu", ull);
return EXIT_SUCCESS;
}
I'm getting the following output:
The decimal equivalents are: 0
While the expected output was:
The decimal equivalents are: 2048

When you call std::strtoull(str.c_str(), NULL, 0);, its first argument supplied is equivalent to an empty string, as string is essentially a null-terminated sequence of characters.
Second, std::strtoull() does not convert with byte sequences, it converts with the literal meaning of strings. i.e. you'll get 2048 with std::strtoull("2048", NULL, 10).
Another thing to note is that unsigned long long is a 64-bit data type, whereas your byte array only provides 32 bits. You need to fill the other 32 bytes with zero to get the correct result. I use a direct assignment, but you could also use std::memset() here.
What you want to do is:
ull = 0ULL;
std::memcpy(&ull, byteArray, 4);
Given your platform has little-endian, the result should be 2048.

What you first must remember is that a string, is really a null-terminated string. Secondly, a string is a string of characters, which is not what you have. The third problem is that you have an array of four bytes, which corresponds to an unsigned 32-bit integer, and you want an (at least) 64-bit types which is 8 bytes.
You can solve all these problems with a temporary variable, a simple call to std::memcpy, and an assignment:
uint32_t temp;
std::memcpy(&temp, byteArray, 4);
ull = temp;
Of course, this assumes that the endianness is correct.
Note that I use std::memcpy instead of std::copy (or std::copy_n) because std::memcpy is explicitly mentioned to be able to bypass strict aliasing this way, while I don't think the std::copy functions are. Also the std::copy functions are more for copying elements and not anonymous bytes (even if they can do that too, but with a clunkier syntax).

Given the answers are using std::memcpy, I want to point out that there's a more idiomatic way of doing this operation:
char byteArray[] = { 0x00, 0x08, 0x00, 0x00 };
uint32_t cp;
std::copy(byteArray, byteArray + sizeof(cp), reinterpret_cast<char*>(&cp));
std::copy is similar to std::memcpy, but is the C++ way of doing it.
Note that you need to cast the address of the output variable cp to one of: char *, unsigned char *, signed char *, or std::byte *, because otherwise the operation wouldn't be byte oriented.

Can anyone please explain what this C++ code is doing?

char b = 'a';
int *a = (int*)&b;
std::cout << *a;
What could be the content of *a? It is showing garbage value. Can you anyone please explain. Why?

Suppose char takes one byte in memory and int takes two bytes (the exact number of bytes depends of the platform, but usually they are not same for char and int). You set a to point to the memory location same as b. In case of b dereferencing will consider only one byte because it's of type char. In case of a dereferencing will access two bytes and thus will print the integer stored at these locations. That's why you get a garbage: first byte is 'a', the second is random byte - together they give you a random integer value.

Either the first or the last byte should be hex 61 depending on byte order. The other three bytes are garbage. best to change the int to an unsigned int and change the cout to hex.
I don't know why anyone would want to do this.

You initialize a variable with the datatype char ...
a char in c++ should have 1 Byte and an int should contain 2 Byte. Your a points to the address of the b variable... an adress should be defined as any hexadecimal number. Everytime you call this "program" there should be any other hexadecimal number, because the scheduler assigns any other address to your a variable if you start this program new.

Think of it as byte blocks. Char has one byte block (8 bits). If you set a conversion (int*) then you get the next 7 byte blocks from the char's address. Therefore you get 7 random byte blocks which means you'll get a random integer. That's why you get a garbage value.

The code invokes undefined behavior, garbage is a form of undefined behavior, but your program could also cause a system violation and crash with more consequences.
int *a = (int*)&b; initializes a pointer to int with the address of a char. Dereferencing this pointer will attempt to read an int from that address:
If the address is misaligned and the processor does not support misaligned accesses, you may get a system specific signal or exception.
If the address is close enough to the end of a segment that accessing beyond the first byte causes a segment violation, that's what you can get.
If the processor can read the sizeof(int) bytes at the address, only one of those will be a, (0x61 in ASCII) but the others have undetermined values (aka garbage). As a matter of fact, on some architectures, reading from uninitialized memory may cause problems: under valgrind for example, this will cause a warning to be displayed to the user.
All the above are speculations, undefined behavior means anything can happen.

Endianness of string literals and usage of strings in case statements

On my machine, the following program writes 1234 to it's output.
const char str[] = "1234";
printf("%c%c%c%c\n",
(int) (0xff & (*(uint32_t*) str) >> 0),
(int) (0xff & (*(uint32_t*) str) >> 8),
(int) (0xff & (*(uint32_t*) str) >> 16),
(int) (0xff & (*(uint32_t*) str) >> 24));
This implies that str is internally represented as 0x34333231, and the first byte str[0] represents the least significant 8 bits.
Does this mean str is encoded in little endian? And is the output of this program platform-dependent?
Also, is there a convenient way to use 1, 2, 4 and 8 character string literals in switch case statements? I can't find any way to convert the strings to integers, as *(const uint32_t* const) "1234" is not a constant expression, and 0x34333231/0x31323334 might be platform dependent and must be notated in hexadecimal.
edit:
In other words, is 0xff & *(uint32_t*) str always equal to str[0]?
Eh, never mind, just realized it is and also why.

You're confusing endianness of a string (which doesn't exists so long as we're talking about ASCII strings) with the endianness of an integer. The integer on your system is little endian.
To answer your second question, no you can't switch on strings. If you're really desperate for the increase in speed you could make one for little endian systems and one for big endian systems.

Endianness refers to the order of bytes in a larger value. Strings are (at least in C and C++) an array of bytes so endianness doesn't apply.
You actually can do what you mention in the last paragraph using multicharacter literals, though it's implementation defined exactly how it works and the string must be no longer than sizeof(int).
C++ standard, §2.14.3/1 - Character literals
(...) An ordinary character literal that contains more than one c-char is a multicharacter literal . A multicharacter literal has type int and implementation-deﬁned value.
For instance, 'abcd' is a value of type int with an implementation-defined value. This value probably would depend on endianness. Since it is an integer, you are allowed to switch on it.

The bytes are layed out as at increasing memory addresses as 0x31, 0x32, 0x33, 0x34.
In a 32-bit integer is little endian you've got 0x34333231. If big endian 0x31323334.
(Also in general integers are aligned on even or 4-fold addresses.)

Memory interpretation while casting primitives

In languages like C/C++, when we do:
char c = 'A';
We allocate memory to store number 65 in binary:
stuff_to_the_left_01000001_stuff_to_the_right
Then if we do:
int i = (int) c;
As I understand it, we're saying to the compiler that it should interpret bit pattern layed out as stuff_to_the_left_01000001__00000000_00000000_00000000_stuff_to_the_right, which may or may not turn out to be 65.
The same happens when we perform a cast during an operation
cout << (int) c << endl;
In all of the above, I got 'A' for character and 65 in decimal. Am I being lucky or am I missing something fundamental?

Casts in C do not reinterpret anything. They are value conversions. (int)c means take the value of c and convert it to int, which is a no-op on essentially all systems. (The only way it could fail to be a no-op is if the range of char is larger than the range of int, for example if char and int are both 32-bit but char is unsigned.)
If you want to reinterpret the representation (bit pattern) underlying a value, that value must first exist as an object (lvalue), not just the value of an expression (typically called "rvalue" though this language is not used in the C standard). Then you can do something like:
*(new_type *)&object;
However, except in the case where new_type is a character type, this invokes undefined behavior by violating the aliasing rules. C++ has a sort of "reinterpret cast" to do this which can presumably avoid breaking aliasing rules, but as I'm not familiar with C++, I can't provide you with good details on it.
In your C++ example, the reason you get different results is operator overloading. (int)'A' does not change the value or how it's interpreted; rather, the expression having a different type causes a different overload of the operator<< function to be called. In C, on the other hand, (int)'A' is always a no-op, because 'A' has type int to begin with in C.

Am I being lucky or am I missing something fundamental?
Yes, you are missing something fundamental: the compiler does not read the char from the memory as if the memory represented an int. Instead, it reads a char as a char, and then sign-extends the value to fit in an int, so char -1 becomes int -1 as well. Sign-extending means adding 1s or 0s to the left of the most significant byte being extended, depending on the sign bit of that number. Unsigned types are always padded by zeros*.
Sign extension is usually done in a register by executing a dedicated hardware instruction, so it runs very fast.
* As Eric Postpischil noted in a comment, char type may be signed or unsigned, depending on the C implementation.

When you allocate a char, there's no such thing as stuff to the left or right. It's eight bit, nothing more. So then when you cast an eight-bit value to 32 bits, you still get 65:
0100.0001 to 0000.0000 0000.0000 0000.0000 0100.0001
No magic, no luck.

In your code "i" has its own address and "c" has its own. Value is being 'copied' from c to i.
As for "(int) c", same is done again. Though compiler does that for us, as follows.
|--- i ---|- c-|
0x01 0x02 0x03 0x04
+--------------------......
| 00 | 00 | 08 | 08 |......
+--------------------......
You would have been correct, if this was pointer based allocation.
e.g.
0x01 0x02 0x03
+---------------......
| 07 | 10 | 08 |......
+---------------......
int *p;
char c = 10;
p = &c;
print(*p); //not a real method just something that can print.
Here *p would have combined values from mem address 0x02 and 0x03.

Well, the thing is, that this behavior can change depending on the platform you're compiling to and the compiler your're using.
The ISO standard defines (int) to be a cast.
In this case, your compiler will interpret (int)c like static_cast(c) //in c++
Now, you're lucky, your compiler interprets (int) as a simple cast. It's common behavior for any c/c++ compiler but there might be some evil, no-name c++ compilers, which will do a reinterpret cast on that one, ending up in an unpredictable result (depending on the platform).
That is why you should use the static_cast(c) to be 100% shure
and if you want to reinterpret it, of course reinterpret_cast(c)
but, again, it's usually a cast in c style and therefor the c will be casted into an integer.

what's the size of hex value of some memory address converted to int or other type?

For example:
int* x = new int;
int y = reinterpret_cast<int>(x);
y now holds the integer value of the memory address of variable x.
Variable y is of size int. Will that int size always be large enough to store the converted memory address of ANY TYPE being converted to int?
EDIT:
Or is safer to use long int to avoid a possible loss of data?
EDIT 2: Sorry people, to make this question more understandable the thing I want to find out here is the size of returned HEX value as a number, not size of int nor size of pointer to int but plain hex value. I need to get that value in in human-readable notation. That's why I'm using reinterpret_cast to convert that memory HEX value to DEC value. But to store the value safely I also need to fing out into what kind of variable to it: int, long - what type is big enough?

No, that's not safe. There's no guarantee sizeof(int) == sizeof(int*)
On a 64 bit platform you're almost guaranteed that it's not.
As for the "hexadecimal value" ... I'm not sure what you're talking about. If you're talking about the textual representation of the pointer in hexadecimal ... you'd need a string.
Edit to try and help the OP based on comments:
Because computers don't work in hex. I don't know how else to explain it. An int stores some amount of bits (binary), as does a long. Hexadecimal is a textual representation of those bits (specifically, the base16 representation). strings are used for textual representations of values. If you need a hexadecimal representation of a pointer, you would need to convert that pointer to text (hex).
Here's a c++ example of how you would do that:
test.cpp
#include <string>
#include <iostream>
#include <sstream>
int main()
{
int *p; // declare a pointer to an int.
std::ostringstream oss; // create a stringstream
std::string s; // create a string
// this takes the value of p (the memory address), converts it to
// the hexadecimal textual representation, and puts it in the stream
oss << std::hex << p;
// Get a std::string from the stream
s = oss.str();
// Display the string
std::cout << s << std::endl;
}
Sample output:
roach$ g++ -o test test.cpp
roach$ ./test
0x7fff68e07730
It's worth noting that the same thing is needed when you want to see the base10 (decimal) representation of a number - you have to convert it to a string. Everything in memory is stored in binary (base2)

On most 64-bit targets, int is still 32-bit, while pointer is 64bit, so it won't work.
http://en.wikipedia.org/wiki/64-bit#64-bit_data_models

What you probably want is to use std::ostream's formatting of addresses:
int x(0);
std::cout << &x << '\n';
As to the length of the produced string, you need to determine the size of the respective pointer: for each used byte the output will use two hex digit because each hex digit can represent 16 values. All bytes are typically used even if it is unlikely that you have memory for all bytes e.g. when the size of pointers is 8 bytes as happens on 64 bit systems. This is because the stacks often grow from the biggest address downwards while the executable code start at the beginning of the address range (well, the very first page may be unused to cause segmentation violations if it is touched in any way). Above the executable code live some data segments, followed by the heap, and lots of unused pages.

There is question addressing similar topic:
https://stackoverflow.com/a/2369593/1010666
Summary: do not try to write pointers into non-pointer variable.
If you need to print out the pointer value, there are other solutions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js