Understanding ARM assembly instructions and C/C++ pointers

Understanding ARM assembly instructions and C/C++ pointers - c++

I am trying to decode the assembly instruction that operates on address, 16 bit ARM thumb instruction. So, I don't think I should care about data type. Because I'm only interested in the 16 bits store there. I have separate interpreter to make sense of those bits, I don't want to use that as data anyway.
If I have a pointer p and I want to read 4 bytes (i.e data from p to p+3 address). Will casting p to int * and dereferencing give me the data?

You have a pointer to-some-type. Pointer arithmetic and dereferencing honors the data type.
Please note, you can only access the stored value of any variable (object) by an lvalue expression that has either a compatible type or a character pointer. Blindly forcing a pointer to cast to a different non-compatible type and attempt to dereference that will violate the strict aliasing rule and you'll face undefined behavior.
Quoting C11, chapter §6.5
An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:88)
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the
object,
— a type that is the signed or unsigned type corresponding to a qualified version of the
effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
You can however, always use a char * to point to any type and dereference and increment (and repeat) to get the individual values for the bytes but you need to take care of endianness yourself.
Related, quoting C11, chapter §6.3.2.3
[....] When a pointer to an object is converted to a pointer to a character type,
the result points to the lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining bytes of the object.

Related

Is converting an integer to a pointer always well defined?

Is this valid C++?
int main() {
int *p;
p = reinterpret_cast<int*>(42);
}
Assuming I never dereference p.
Looking up the C++ standard, we have
C++17 §6.9.2/3 [basic.compound]
3 Every value of pointer type is one of the following:
a pointer to an object or function (the pointer is said to point to the object or function), or
a pointer past the end of an object ([expr.add]), or
the null pointer value ([conv.ptr]) for that type, or
an invalid pointer value.
A value of a pointer type that is a pointer to or past the end of an
object represents the address of the first byte in memory
([intro.memory]) occupied by the object or the first byte in memory
after the end of the storage occupied by the object, respectively. [
Note: A pointer past the end of an object ([expr.add]) is not
considered to point to an unrelated object of the object's type that
might be located at that address. A pointer value becomes invalid when
the storage it denotes reaches the end of its storage duration; see
[basic.stc]. — end note ] For purposes of pointer arithmetic
([expr.add]) and comparison ([expr.rel], [expr.eq]), a pointer past
the end of the last element of an array x of n elements is considered
to be equivalent to a pointer to a hypothetical array element n of x
and an object of type T that is not an array element is considered to
belong to an array with one element of type T.
p = reinterpret_cast<int*>(42); does not fit into the list of possible values. And:
C++17 §8.2.10/5 [expr.reinterpret.cast]
A value of integral type or enumeration type can be explicitly
converted to a pointer. A pointer converted to an integer of
sufficient size (if any such exists on the implementation) and back to
the same pointer type will have its original value; mappings between
pointers and integers are otherwise implementation-defined. [ Note:
Except as described in 6.7.4.3, the result of such a conversion will
not be a safely-derived pointer value. — end note ]
C++ standard does not seem to say more about the integer to pointer conversion. Looking up the C17 standard:
C17 §6.3.2.3/5 (emphasis mine)
An integer may be converted to any pointer type. Except as
previously specified, the result is implementation-defined, might not
be correctly aligned, might not point to an entity of the referenced
type, and might be a trap representation.68)
and
C17 §6.2.6.1/5
Certain object representations need not represent a value of the
object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does not have
character type, the behavior is undefined. If such a representation is
produced by a side effect that modifies all or any part of the object
by an lvalue expression that does not have character type, the
behavior is undefined.50) Such a representation is called a trap
representation.
To me, it seems like any value that does not fit into the list in [basic.compound] is a trap representation, thus p = reinterpret_cast<int*>(42); is UB. Am I correct? Is there something else making p = reinterpret_cast<int*>(42); undefined?

This is not UB, but implementation-defined, and you already cited why (§8.2.10/5 [expr.reinterpret.cast]). If a pointer has invalid pointer value, it doesn't necessarily mean that it has a trap representation. It can have a trap representation, and the compiler must document this. All you have here is a not safely-derived pointer.
Note, that we generate pointers with invalid pointer value all the time: if an object is freed by delete, all the pointers which pointed to this object have invalid pointer value.
Using the resulting pointer is implementation defined as well (not UB):
[...] if the object to which the glvalue refers contains an invalid pointer value ([basic.stc.dynamic.deallocation], [basic.stc.dynamic.safety]), the behavior is implementation-defined.

The example shown is valid c++. On some platforms this is how you access "hardware resources" (and if it's not valid you have found a bug/mistake in standard text).
See also this answer for a better explanation.
Update:
The first sentence of reinterpret_cast as you quote yourself:
A value of integral type or enumeration type can be explicitly converted to a pointer.
I recommend you stop reading and rest yourself at this point. The rest of just a lot details including possible implementation specified behavior, etc. That doesn't make it UB/invalid.

Trap Representations
What: As covered by [C17 §6.2.6.1/5], a trap representation is a non-value. It is a bit pattern that fills the space allocated for an object of a given type, but this pattern does not correspond to a value of that type. It is a special pattern that can be recognized for the purpose of triggering behavior defined by the implementation. That is, the behavior is not covered by the standard, which means it falls under the banner of "undefined behavior". The standard sets out the possibilities for when a trap could be (not must be) triggered, but it makes no attempt to limit what a trap might do. For more information, see A: trap representation.
The undefined behavior associated with a trap representation is interesting in that an implementation has to check for it. The more common cases of undefined behavior were left undefined so that implementations do not need to check for them. The need to check for trap representations is a good reason to want few trap representations in an efficient implementation.
Who: The decision of which bit patterns (if any) constitute trap representations falls to the implementation. The standards do not force the existence of trap representations; when trap representations are mentioned, the wording is permissive, as in "might be", as opposed to demanding, as in "shall be". Trap representations are allowed, not required. In fact, N2091 came to the conclusion that trap representations are largely unused in practice, leading up to a proposal to remove them from the C standard. (It also proposes a backup plan if removal proves infeasible: explicitly call out that implementations must document which representations are trap representations, as there is no other way to know for sure whether or not a given bit pattern is a trap representation.)
Why: Theoretically, a trap representation could be used as a debugging aid. For example, an implementation could declare that 0xDDDD is a trap representation for pointer types, then choose to initialize all otherwise uninitialized pointers to this bit pattern. Reading this bit pattern could trigger a trap that alerts the programmer to the use of an uninitialized pointer. (Without the trap, a crash might not occur until later, complicating the debugging process. Sometimes early detection is the key.) In any event, a trap representation requires a trap of some sort to serve a purpose. An implementation would not define a trap representation without also defining its trap.
My point is that trap representations must be specified. They are deliberately removed from the set of values of a given type. They are not simply "everything else".
Pointer Values
C++17 §6.9.2/3 [basic.compound]
This section defines what an invalid pointer value is. It states "Every value of pointer type is one of the following" before listing four possibilities. That means that if you have a pointer value, then it is one of the four possibilities. The first three are fully specified (pointer to object or function, pointer past the end, and null pointer). The last possibility (invalid pointer value) is not fully specified elsewhere, so it becomes the catch-all "everything else" entry in the list (it is a "wild card", to borrow terminology from the comments). Hence this section defines "invalid pointer value" to mean a pointer value that does not point to something, does not point to the end of something, and is not null. If you have a pointer value that does not fit one of those three categories, then it is invalid.
In particular, if we agree that reinterpret_cast<int*>(42) does not point to something, does not point to the end of something, and is not null, then we must conclude that it is an invalid pointer value. (Admittedly, one could assume that the result of the cast is a trap representation for pointers in some implementation. In that case, yes, it does not fit into the list of possible pointer values because it would not be a pointer value, hence it's a trap representation. However, that is circular logic. Furthermore, based upon N2091, few implementations define any trap representations for pointers, so the assumption is likely groundless.)
[ Note: [...] A pointer value becomes invalid when the storage it denotes reaches the end of its storage duration; see [basic.stc]. — end note ]
I should first acknowledge that this is a note. It explains and clarifies without adding new substance. One should expect no definitions in a note.
This note gives an example of an invalid pointer value. It clarifies that a pointer can (perhaps surprisingly) change from "points to an object" to "invalid pointer value" without changing its value. Looking at this from a formal logic perspective, this note is an implication: "if [something] then [invalid pointer]". Viewing this as a definition of "invalid pointer" is a fallacy; it is merely an example of one of the ways one can get an invalid pointer.
Casting
C++17 §8.2.10/5 [expr.reinterpret.cast]
A value of integral type or enumeration type can be explicitly converted to a pointer.
This explicitly permits reinterpret_cast<int*>(42). Therefore, the behavior is defined.
To be thorough, one should make sure there is nothing in the standard that makes 42 "erroneous data" to the degree that undefined behavior results from the cast. The rest of [§8.2.10/5] does not do this, and:
C++ standard does not seem to say more about the integer to pointer conversion.
Is this valid C++?
Yes.

Can a reinterpret_cast change the object representation?

My mental model for a reinterpret_cast has always been, to treat the sequence of bits of an expression as if they were of a different type, and cppreference (note: this is not a quote from the C++ Standard) seems to agree with that:
Unlike static_cast, but like const_cast, the reinterpret_cast expression does not compile to any CPU instructions. It is purely a compiler directive which instructs the compiler to treat the sequence of bits (object representation) of expression as if it had the type new_type.
Looking for guarantees, I stumbled across a note under [expr.reinterpret.cast]:
[ Note: The mapping performed by reinterpret_cast might, or might not, produce a representation different from the original value. — end note ]
That left me wondering: Under which conditions does a reinterpret_cast produce a value with object representation different from the original value?

Here's an example: if you read the 4th bullet point:
A pointer can be explicitly converted to any integral type large enough to hold all values of its type. The mapping function is implementation-defined. [ Note: It is intended to be unsurprising to those who know the addressing structure of the underlying machine. — end note ]
Now, it is implementation defined, what value of i will have here:
void *ptr = <some valid pointer value>;
uintptr_t i = reinterpret_cast<uintptr_t>(ptr);
It can be anything, provided that reinterpret_casting i back we'll get ptr.
The representation of ptr and i could differ. The standard just says that the value of i should be "unsurprising". Even, if we reinterpret_cast ptr to a wider integer (for example, if a pointer is 32-bit, casting to unsigned long long int), the representation must differ, because the size of the variables differ.
So I think that cppreference description is misleading, because there can be reinterpret_casts, which actually need CPU instructions.
Here is another case (found by IInspectable), a comment by Keith Thompson:
The C compiler for Cray vector machines, such as the T90, do something similar. Hardware addresses are 8 bytes, and point to 8-byte words. void* and char* are handled in software, and are augmented with a 3-bit offset within the word -- but since there isn't actually a 64-bit address space, the offset is stored in the high-order 3 bits of the 64-bit word. So char* and int* are the same size, but have different internal representations -- and code that assumes that pointers are "really" just integers can fail badly.
char * and int * have different representations on Cray T90, so:
int *i = <some int pointer value>;
char *c = reinterpret_cast<char *>(i);
Here, i and c will have differing representations on Cray T90 (and doing this conversion definitely uses CPU instructions).
(I've verified this, chapter 3.1.2.7.1 of Cray C/C++ Reference Manual SR–2179 2.0)

You are correct that reinterpret_cast does not change the bit values, however that doesn't mean the resulting value doesn't change.
One simple example would be casting a 32-bit integral type to a char[4] with each element representing one octet of an IPv4 address.

Why are C++ array index values signed and not built around the size_t type (or am I wrong in that)?

It's getting harder and harder for me to keep track of the ever-evolving C++ standard but one thing that seems clear to me now is that array index values are meant to be integers (not long long or size_t or some other seemingly more appropriate choice for a size). I've surmised this both from the answer to this question (Type of array index in C++) and also from practices used by well established C++ libraries (like Qt) which also use a simple integer for sizes and array index operators. The nail in the coffin for me is that I am now getting a plethora of compiler warnings from MSVC 2017 stating that my const unsigned long long (aka const size_t) variables are being implicitly converted to type const int when used as an array index.
The answer given by Mat in the question linked above quotes the ISO C++ standard draft n3290 as saying
it shall be an integral constant expression and its value shall be greater than zero.
I have no background in reading these specs and precisely interpreting their language, so maybe a few points of clarification:
Does an "integral constant expression" specifically forbid things like long long which to me is an integral type, just a larger sized one?
Does what they're saying specifically forbid a type that is tagged unsigned like size_t?
If all I am seeing here is true, an array index values are meant to be signed int types, why? This seems counter-intuitive to me. The specs even state that the expression "shall be greater than zero" so we're wasting a bit if it is signed. Sure, we still might want to compare the index with 0 in some way and this is dangerous with unsigned types, but there should be cheaper ways to solve that problem that only waste a single value, not an entire bit.
Also, with registers ever widening, a more future-proof solution would be to allow larger types for the index (like long long) rather than sticking with int which is a problematic type historically anyways (changing its size when processors changed to 32 bits and then not when they went to 64 bits). I even see some people talking about size_t anecdotally like it was designed to be a more future-proof type for use with sizes (and not JUST the type returned in service of the sizeof operator). But of course, that might be apocryphal.
I just want to make sure my foundational programming understanding here is not flawed. When I see experts like the ISO C++ group doing something, or the engineers of Qt, I give them the benefit of the doubt that they have a good reason! For something like an array index, so fundamental to programming, I feel like I need to know what that reason is or I might be missing something important.

Looking at [expr.sub]/1 we have
A postfix expression followed by an expression in square brackets is a postfix expression. One of the expressions shall be a glvalue of type “array of T” or a prvalue of type “pointer to T” and the other shall be a prvalue of unscoped enumeration or integral type. The result is of type “T”. The type “T” shall be a completely-defined object type.67 The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise. The expression E1 is sequenced before the expression E2.
emphasis mine
So, the index of the subscript operator need to be a unscoped enumeration or integral type. Looking in [basic.fundamental] we see that standard integer types are signed char, short int, int, long int, and long long int, and their unsigned counterparts.
So any of the standard integer types will work and any other integer type, like size_t, will be valid types to use as an array index. The supplied value to the subscript operator can even have a negative value, so long as that value would access a valid element.

I would argue that the standard library API prefers that indexes be unsigned type. If you look at the documentation for std::size_t it notes
When indexing C++ containers, such as std::string, std::vector, etc, the appropriate type is the member typedef size_type provided by such containers. It is usually defined as a synonym for std::size_t.
This is reinforced when looking at signatures for functions such as std::vector::at
reference at( size_type pos );
const_reference at( size_type pos ) const;

I think you are confusing two types:
The first type is the type of object/value that can be use to define the size of an array. Unfortunately, the question that you link to uses index where they should have used array size. This must be an expression that must be evaluated at compile time and its value must be greater than zero.
int array[SomeExpression]; // Valid as long as SomeExpression can be evaluated
// at compile time and the value is greater than zero.
The second type is the type of object/value that can be used to access an array. Given the above array,
array[i] = SomeValue; // i is an index to access the array
i does not need to be evaluated at compile time, i must be in the range [0, SomeExpression-1]. However it is possible to use negative values as the index to access an array. Since array[i] is evaluated as *(array+i) (ignoring for the time being the overloaded operator[] functions), i can be a negative value if array happens to point to the middle of an array. My answer to another SO post has more information on the subject.
Just as an aside, since array[i] is evaluated as *(array+i), it is legal to use i[array] and is the same as array[i].

Using reinterpret_cast to convert integer to pointer and back to integer [duplicate]

According to http://en.cppreference.com/w/cpp/language/reinterpret_cast, it is known that reinterpret_cast a pointer to an integral of sufficient size and back yield the same value. I'm wondering whether the converse is also true by the standards. That is, does reinterpret_cast an integral to a pointer type of sufficient size and back yield the same value?

No, that is not guaranteed by the standard. Quoting all parts of C++14 (n4140) [expr.reinterpret.cast] which concern pointer–integer conversions, emphasis mine:
4 A pointer can be explicitly converted to any integral type large enough to hold it. The mapping function is
implementation-defined. [ Note: It is intended to be unsurprising to those who know the addressing structure
of the underlying machine. —end note ] ...
5 A value of integral type or enumeration type can be explicitly converted to a pointer. A pointer converted
to an integer of sufficient size (if any such exists on the implementation) and back to the same pointer type
will have its original value; mappings between pointers and integers are otherwise implementation-defined.
[ Note: Except as described in 3.7.4.3, the result of such a conversion will not be a safely-derived pointer
value. —end note ]
So starting with an integral value and converting it to a pointer and back (assuming no size issues) is implementation-defined. Which means you must consult your compiler's documentation to learn whether such a round trip preserves values or not. As such, it is certainly not portable.

I get exactly this problem in library exporting pointers to objects as opaque identifiers and now attempting to recover these pointers from external calls don't work for old x86 CPU's (in the time of windows 98). So, while we can expect that behaviour, this is false in general case. In 386-CPU the address is composed by overlapped pointers so the address of any memory position is not unique, and I found that conversion back don't recover original value.

What does it mean to reinterpret_cast a pointer as long?

I see this type of code a lot in a project I work on:
reinterpret_cast<long>(somePointer)
and I don't understand the point of this. It is used for user defined classes usually,
that is, somePointer is pointer to an instance of a user defined class.
Thanks for any help

It's used to convert the address of the pointer to its numeric representation and store it as a long.
It's typically used when you need to store a pointer in a particular context where an actual pointer isn't supported. For example, some APIs allow you to pass a numeric key which they will give you back in any callback from that API. You can create an object and cast it to a number and give it to the API. Then in the callback you cast the number back to a pointer with pointer=reinterpret_cast<type*>(number) and then access your data.

A pointer is actually nothing more than an integer value, it just that the compiler interprets it as a memory address.
This means that you can store it in an other integer type, if that other integer type is big enough to hold the address (for example, you can't store a 64-bit address in a 16-bit integer variable).
But since C++ considers pointers and normal integer values as distinct and different types, you can't just do a normal cast, you have to ask the compiler to reinterpret the pointer as a long, which is what reinterpret_cast does.

reinterpret_cast<long>(somePointer)
This piece of code casts the address holded by somePointer to a long, it's better to use intptr_t or uintptr_t for signed/unsigned integral types in C++11 in this. They are capable of holding a value converted from a void pointer and then be converted back to that type with a value that compares equal to the original pointer.

It is, as others have said, converting the pointer to an integer value. This is somewhat "dangerous", as a pointer may not actually fit in a long. There is a type in cstdint that is called intptr_t, which is intended for this purpose.
The reinterpret_cast<new_type>(value) really is a "Take the 'bits' of value and convert it so that it can be used as new_type" - and the compiler will just do what you ask of it, whether it makes much sense or not - so it's up to YOU as a programmer to make sure you only do this in a way that actually does something useful - if you happen to cut of 32 bits from a 64-bit pointer, and then try to make it into a pointer again later on, then it's YOUR fault if the pointer "isn't any good" after. reinterpret_cast should really only be used as a last resort, there are often other methods to do the same thing.

reinterpret_cast<x>(exp)
reinterprets the underlying bit pattern, returning a value of type x
It instructs the compiler to treat the sequence of bits of an exp as if it had the type x
With reinterpret_cast you can cast a pointer to a different type, and then reinterpret_cast it back to the original type to get the original value.

Like Sean said, it's useful for manipulating pointers as if they were numbers.
Here's an example from the MSDN website that uses it to implement a basic hash function (note that this uses an unsigned int rather than a long, but the principle is the same)
unsigned short Hash( void *p ) {
unsigned int val = reinterpret_cast<unsigned int>( p );
return ( unsigned short )( val ^ (val >> 16));
}
This takes an arbitrary pointer, and produces an (effectively) unique hash value for that pointer - by treating the pointer as an unsigned int, XORing the integral value with a bit-shifted version of itself, and then truncating the result into an unsigned short.
However, like everyone else has said, this is very dangerous and should be avoided.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js