Converting Function Address to 64-bit Integer: Undefined/Ill-behaved?

Converting Function Address to 64-bit Integer: Undefined/Ill-behaved? - c++

Background: I have a scenario in which I must allow comparison between two functor objects, using a unique ID to test if they're equal (I can't simply check if their addresses are the same, as the function pointers themselves aren't stored in the object). Initially, I had this idea, to simply start the id generator at 0 and increment ad infinitum:
struct GenerateUniqueID{
static std::size_t id_count = 0;
auto operator()() -> std::size_t { return (id_count++); }
};
...However, as I have literally thousands upon thousands of these objects created every few seconds, I actually managed to run into the case of id_count overflowing back to 0! The results were... unpleasant. Now, the second idea I had was that, since these functors are, obviously, wrappers around a function, I could perform the comparison by converting the address of the function pointer into a 64-bit integer, and storing that in the class for comparison. See:
//psuedocode
struct Functor{
std::uint64_t id;
auto generate_id_from_function_address(function f) -> void {
id = reinterpret_cast<std::uint64_t>(&f);
}
};
Now, my concern here is simple: is casting function pointers to 64-bit integers ill-behaved/undefined? On 32-bit architectures? On 64-bit architectures? On both? My main concern here is with virtual functions, as I know that for inline functions the compiler simply creates a non-inlined version, so there's no issue there.

Converting a regular pointer (let alone a function pointer) to uint64_t is implementation-defined, since pointers could be wider than 64 bits. The conversion is well-defined if you use uintptr_t (and that type exists).
Converting a function pointer to any integer type is implementation-defined (even if you use uintptr_t), because function pointers may be wider than regular pointers. Some other standards like POSIX explicitly allow this, so under POSIX it is safe to cast function pointers to data pointers like void* and to uintptr_t.
(Converting a pointer-to-member to an integer, data pointer, or regular function pointer is undefined, and in practice likely to always fail since they're bigger than regular pointers.)
However, it may be simpler to just use uint64_t instead of size_t for your unique IDs. It is basically impossible to overflow a uint64_t by incrementing it repeatedly due to their enormous range.

Related

Mask from bitfield in C++

Here's a little puzzle I couldn't find a good answer for:
Given a struct with bitfields, such as
struct A {
unsigned foo:13;
unsigned bar:19;
};
Is there a (portable) way in C++ to get the correct mask for one of the bitfields, preferably as a compile-time constant function or template?
Something like this:
constinit unsigned mask = getmask<A::bar>(); // mask should be 0xFFFFE000
In theory, at runtime, I could crudely do:
unsigned getmask_bar() {
union AA {
unsigned mask;
A fields;
} aa{};
aa.fields.bar -= 1;
return aa.mask;
}
That could even be wrapped in a macro (yuck!) to make it "generic".
But I guess you can readily see the various deficiencies of this method.
Is there a nicer, generic C++ way of doing it? Or even a not-so-nice way? Is there something useful coming up for the next C++ standard(s)? Reflection?
Edit: Let me add that I am trying to find a way of making bitfield manipulation more flexible, so that it is up to the programmer to modify multiple fields at the same time using masking. I am after terse notation, so that things can be expressed concisely without lots of boilerplate. Think working with hardware registers in I/O drivers as a use case.

Unfortunately, there is no better way - in fact, there is no way to extract individual adjacent bit fields from a struct by inspecting its memory directly in C++.
From Cppreference:
The following properties of bit-fields are implementation-defined:
The value that results from assigning or initializing a signed bit-field with a value out of range, or from incrementing a signed
bit-field past its range.
Everything about the actual allocation details of bit-fields within the class object
For example, on some platforms, bit-fields don't straddle bytes, on others they do
Also, on some platforms, bit-fields are packed left-to-right, on others right-to-left
Your compiler might give you stronger guarantees; however, if you do rely on the behavior of a specific compiler, you can't expect your code to work with a different compiler/architecture pair. GCC doesn't even document their bit field packing, as far as I can tell, and it differs from one architecture to the next. So your code might work on a specific version of GCC on x86-64 but break on literally everything else, including other versions of the same compiler.
If you really want to be able to extract bitfields from a random structure in a generic way, your best bet is to pass a function pointer around (instead of a mask); that way, the function can access the field in a safe way and return the value to its caller (or set a value instead).
Something like this:
template<typename T>
auto extractThatBitField(const void *ptr) {
return static_cast<const T *>(ptr)->m_thatBitField;
}
auto *extractor1 = &extractThatBitField<Type1>;
auto *extractor2 = &extractThatBitField<Type2>;
/* ... */
Now, if you have a pair of {pointer, extractor}, you can get the value of the bitfield safely. (Of course, the extractor function has to match the type of the object behind that pointer.) It's not much overhead compared to having a {pointer, mask} pair instead; the function pointer is maybe 4 bytes larger than the mask on a 64-bit machine (if at all). The extractor function itself will just be a memory load, some bit twiddling, and a return instruction. It'll still be super fast.
This is portable and supported by the C++ standard, unlike inspecting the bits of a bitfield directly.
Alternatively, C++ allows casting between standard-layout structs that have common initial members. (Though keep in mind that this falls apart as soon as inheritance or private/protected members get involved! The first solution, above, works for all those cases as well.)
struct Common {
int m_a : 13;
int m_b : 19;
int : 0; //Needed to ensure the bit fields end on a byte boundary
};
struct Type1 {
int m_a : 13;
int m_b : 19;
int : 0;
Whatever m_whatever;
};
struct Type2 {
int m_a : 13;
int m_b : 19;
int : 0;
Something m_something;
};
int getFieldA(const void *ptr) {
//We still can't do type punning directly due
//to weirdness in various compilers' aliasing resolution.
//std::memcpy is the official way to do type punning.
//This won't compile to an actual memcpy call.
Common tmp;
std::memcpy(&tmp, ptr, sizeof(Common));
return tmp.m_a;
}
See also: Can memcpy be used for type punning?

Why pass a pointer as a (char ) and cast to a (long )

I know legacy is always a justification, but I wanted to check out this example from MariaDB and see if I understand it enough to critique what's going on,
static int show_open_tables(THD *, SHOW_VAR *var, char *buff) {
var->type = SHOW_LONG;
var->value = buff;
*((long *)buff) = (long)table_cache_manager.cached_tables();
return 0;
}
Here they're taking in char* and they're writing it to var->value which is also a char*. Then they force a pointer to a long in the buff and set the type to a SHOW_LONG to indicate it as such.
I'm wondering why they would use a char* for this though and not a uintptr_t -- especially being when they're forcing pointers to longs and other types in it.
Wasn't the norm pre-uintptr_t to use void* for polymorphism in C++?

There seems to be two questions here. So I've split my answer up.
Using char*
Using a char* is fine. Character types (char, signed char, and unsigned char) are specially treated by the C and C++ standards. The C standard defines the following rules for accessing an object:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the object,
a type that is the signed or unsigned type corresponding to the effective type of the object,
a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
a character type.
This effectively means character types are the closest the standards come to defining a 'byte' type (std::byte in C++17 is just defined as enum class byte : unsigned char {})
However, as per the above rules casting a char* to a long* and then assigning to it is incorrect (although generally works in practice). memcpy should be used instead. For example:
long cached_tables = table_cache_manager.cached_tables();
memcpy(buf, &cached_tables, sizeof(cached_tables));
void* would also be a legitimate choice. Whether it is better is a mater of opinion. I would say the clearest option would be to add a type alias for char to convey the intent to use it as a byte type (e.g. typedef char byte_t). Of the top of my head though I can think of several examples of prominent libraries which use char as is, as a byte type. For example, the Boost memory mapped file code gives a char* and leveldb uses std::string as a byte buffer type (presumably to taking advantage of SSO).
Regarding uinptr_t:
uintptr_t is an optional type defined as an unsigned integer capable of holding a pointer. If you want to store the address of a pointed-to object in an integer, then it is a suitable type to use. It is not a suitable type to use here.

they're taking in char* and they're writing it to var->value which is also a char*. Then they force a pointer to a long in the buff and set the type to a SHOW_LONG to indicate it as such.
Or something. That code is hideous.
I'm wondering why they would use a char* for this though and not a uintptr_t -- especially being when they're forcing pointers to longs and other types in it.
Who knows? Who knows what the guy was on when he wrote it? Who cares? That code is hideous, we certainly shouldn't be trying to learn from it.
Wasn't the norm pre-uintptr_t to use void* for polymorphism in C++?
Yes, and it still is. The purpose of uintptr_t is to define an integer type that is big enough to hold a pointer.
I wanted to check out this example from MariaDB and see if I understand it enough to critique what's going on
You might have reservations about doing so but I certainly don't, that API is just a blatant lie. The way to do it (if you absolutely have to) would (obviously) be:
static int show_open_tables(THD *, SHOW_VAR *var, long *buff) {
var->type = SHOW_LONG;
var->value = (char *) buff;
*buff = (long)table_cache_manager.cached_tables();
return 0;
}
Then at least it is no longer a ticking time bomb.
Hmmm, OK, maybe (just maybe) that function is used in a dispatch table somewhere and therefore needs (unless you cast it) to have a specific signature. If so, I'm certainly not going to dig through 10,000 lines of code to find out (and anyway, I can't, it's so long it crashes my tablet).
But if anything, that would just make it worse. Now that timebomb has become a stealth bomber. And anyway, I don't believe it's that for a moment. It's just a piece of dangerous nonsense.

When defining a function that will operate on memory, is it more correct to pass addresses as void* or uint8*?

Most of the existing run-time memory functions accept or return void*, which enables passing of arguments without explicitly casting the pointer types. Should this pattern be replicated when creating custom memory functions?
In other words, which of the following is more correct, and why:
int read_bytes( void * dest, size_t count );
or
int read_bytes( uint8_t * dest, size_t count );
?

I would recommend using void*. Otherwise, every call to the function will look like:
int n = read_bytes((uint8_t*)&myVar, sizeof(myVar));
instead of just
int n = read_bytes(&myVar, sizeof(myVar));

void* is a general purpose pointer in C/C++. It is used in cases where you don't want a specific type specified with the data and avoids the need to cast the pointer. It is also the pointer that you want to use with raw addresses.
You would use uint_t where you want to specify that you are really dealing with unsigned integers.

If you treat memory only as raw, opaque memory, and not as a sequence of bytes, then void * is an appropriate type. This may be idiomatic, for example, when using placement-new to create object in memory. There are also some traditional C APIs that use void pointers for memory references, like memcpy or memchr, so occasionally it can be convenient to use the same type.
On the other hand, if you're thinking of memory as an array of bytes, and especially if you want to access random bytes in memory (i.e. perform pointer or iterator arithmetic), you should absolutely use a char pointer type. There's a certain debate about which one is best; typically, for I/O you want plain char as the "system's I/O data type" (e.g. reading/writing I/O). On the other hand, if you want to operate on arithmetic byte values, unsigned char is more appropriate. The two types are layout-compatible, though, so feel free to treat one as the other if that's necessary.

Datatype declaration significance in pointer to pointer (C/C++)

Is there a difference between pointer to integer-pointer (int**) and pointer to character-pointer (char**), and any other case of pointer to pointer?
Isn't the memory block size for any pointer is the same, so the sub-datatype doesn't play a role in here?
Is it just a semantic distinction with no other significance?
Why not to use just void**?

Why should we use void** when you want a pointer to a char *? Why should we not use char **?
With char **, you have type safety. If the pointer is correctly initialized and not null, you know that by dereferencing it once you get a valid char * - and by dereferencing that pointer, in turn, you get a char.
Why should you ignore this advantage in type safety, and instead play pointer Russian roulette with void**?

The difference is in type-safety. T** implicitly interprets the data as T. void**, however, needs to be manually casted first. And no, pointers are not all 4 / 8 bytes on 32 / 64bit architectures respectively. Member function pointers, for instance, contain offset information too, which needs to be stored in the pointer itself (in the most common implementation).

Most C implementations use the same size and format for all pointers, but this is not required by the C standard.
Some machines do not have byte addressing, so the C implementation implements it by using shifts and other operations. In these implementations, pointers to larger types, such as int, may be normal addresses, but pointers to char would have to have both a machine address and a byte-within-word offset.
Additionally, C makes use of the type information for a variety of purposes, including reducing mistakes made by programmers (possibly giving warnings or errors when you attempt to use a pointer to int where a pointer to float is needed) and optimization. Regarding optimization, consider this example:
void foo(float *array, int *limit)
{
for (int i = 0; i < *limit; ++i)
array[i] = <some calculation>;
}
The C standard says a compiler may use the fact that array and limit are pointers to different types to conclude that they do not overlap. Given this rule, the C implementation may evaluate *limit once when the loop starts, because it knows it will not change during the loop. Without this rule, the compiler would have to assume that one of the assignments to array[i] might change *limit, and it would have to load *limit from memory in each iteration.

Is it possible to hash pointers in portable C++03 code?

Is it possible to portably hash a pointer in C++03, which does not have std::hash defined?
It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them.
The closest way I can think of is doing reinterpret_cast<uintptr_t>(ptr), but uintptr_t is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

No, in general. In fact it's not even possible in general in C++11 without std::hash.
The reason why lies in the difference between values and value representations.
You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only.
For another example, consider:
int i;
int* x = &i;
int* y = &i;
x == y; // this is true; the two pointer values are equal
Underneath that, though, the value representation for x and y could be different!
Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value.
I'll just jump ahead and say it could be something like this:
struct __pointer_impl
{
std::uint8_t byte1; // contains the address we're holding
std::uint8_t byte2; // needed for architecture reasons, unused
// (assume no padding; we are the compiler, after all)
};
Okay, this is our value representation, now lets implement the value semantics. First, equality:
bool operator==(const __pointer_impl& first, const __pointer_impl& second)
{
return first.byte1 == second.byte1;
}
Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ.
We need the address-of operator implementation, of course:
__pointer_impl address_of(int& i)
{
__pointer_impl result;
result.byte1 = /* hypothetical architecture magic */;
return result;
}
This particular implementation overload gets us a pointer value representation for a given int. Note that the second byte is left uninitialized! That's okay: it's not important for the value.
This is really all we need to drive the point home. Pretend the rest of the implementation is done. :)
So now consider our first example again, "compiler-ized":
int i;
/* int* x = &i; */
__pointer_impl x = __address_of(i);
/* int* y = &i; */
__pointer_impl y = __address_of(i);
x == y; // this is true; the two pointer values are equal
For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that x == y implies memcmp(&x, &y, sizeof(__pointer_impl)) == 0. There simply aren't requirements on the value representation to do so.
Now consider your question: how do we hash pointers? That is, we want to implement:
template <typename T>
struct myhash;
template <typename T>
struct myhash<T*> :
std::unary_function<T*, std::size_t>
{
std::size_t operator()(T* const ptr) const
{
return /* ??? */;
}
};
The most important requirement is that if x == y, then myhash()(x) == myhash()(y). We also already know how to hash integers. What can we do?
The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us std::uintptr_t, so we can do this, right?
return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr));
Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it:
// okay because we assumed no padding:
typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation
__uintptr_t __to_integer(const __pointer_impl& ptr)
{
__uintptr_t result;
std::memcpy(&result, &ptr, sizeof(__uintptr_t));
return result;
}
__pointer_impl __from_integer(const __uintptr_t& ptrint)
{
__pointer_impl result;
std::memcpy(&result, &ptrint, sizeof(__pointer_impl));
return result;
}
So when we reinterpret_cast a pointer to integer, we'll use __to_integer, and going back we'll use __from_integer. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed!
This is allowed because the result of reinterpret_cast is totally implementation-defined; you're only guaranteed the resulting of the opposite reinterpret_cast gives you back the same result.
So there's the first issue: on this implementation, our hash could end up different for equal pointer values.
This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them.
We're stuck! It's just not possible. In general.
Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine.
In fact, most consumer platforms you use will handle the std::uintptr_t attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses!
Had our hypothetical pointer implementation simply always initialized byte2 to a constant, it would work there as well. But there just isn't any requirement for implementations to do so.
Hope this clarifies a few things.

The answer to your question really depends on "HOW portable" do you want it. Many architectures will have a uintptr_t, but if you want something that can compile on DSP's, Linux, Windows, AIX, old Cray machines, IBM 390 series machines, etc, etc, then you may have to have a config option where you define your own "uintptr_t" if it doesn't exist in that architecture.
Casting a pointer to an integer type should be fine. If you were to cast it back, you may be in trouble. Of course, if you have MANY pointers, and you allocate fairly large sections of memory on a 64-bit machine, using a 32-bit integer, there is a chance you get lots of collissions. Note that 64-bit windows still has a "long" as 32-bit.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js