Is it possible to hash pointers in portable C++03 code? - c++

Is it possible to portably hash a pointer in C++03, which does not have std::hash defined?
It seems really weird for hashables containing pointers to be impossible in C++, but I can't think of any way of making them.
The closest way I can think of is doing reinterpret_cast<uintptr_t>(ptr), but uintptr_t is not required to be defined in C++03, and I'm not sure if the value could be legally manipulated even if it was defined... is this even possible?

No, in general. In fact it's not even possible in general in C++11 without std::hash.
The reason why lies in the difference between values and value representations.
You may recall the very common example used to demonstrate the different between a value and its representation: the null pointer value. Many people mistakenly assume that the representation for this value is all bits zero. This is not guaranteed in any fashion. You are guaranteed behavior by its value only.
For another example, consider:
int i;
int* x = &i;
int* y = &i;
x == y; // this is true; the two pointer values are equal
Underneath that, though, the value representation for x and y could be different!
Let's play compiler. We'll implement the value representation for pointers. Let's say we need (for hypothetical architecture reasons) the pointers to be at least two bytes, but only one is used for the value.
I'll just jump ahead and say it could be something like this:
struct __pointer_impl
{
std::uint8_t byte1; // contains the address we're holding
std::uint8_t byte2; // needed for architecture reasons, unused
// (assume no padding; we are the compiler, after all)
};
Okay, this is our value representation, now lets implement the value semantics. First, equality:
bool operator==(const __pointer_impl& first, const __pointer_impl& second)
{
return first.byte1 == second.byte1;
}
Because the pointer's value is really only contained in the first byte (even though its representation has two bytes), that's all we have to compare. The second byte is irrelevant, even if they differ.
We need the address-of operator implementation, of course:
__pointer_impl address_of(int& i)
{
__pointer_impl result;
result.byte1 = /* hypothetical architecture magic */;
return result;
}
This particular implementation overload gets us a pointer value representation for a given int. Note that the second byte is left uninitialized! That's okay: it's not important for the value.
This is really all we need to drive the point home. Pretend the rest of the implementation is done. :)
So now consider our first example again, "compiler-ized":
int i;
/* int* x = &i; */
__pointer_impl x = __address_of(i);
/* int* y = &i; */
__pointer_impl y = __address_of(i);
x == y; // this is true; the two pointer values are equal
For our tiny example on the hypothetical architecture, this sufficiently provides the guarantees required by the standard for pointer values. But note you are never guaranteed that x == y implies memcmp(&x, &y, sizeof(__pointer_impl)) == 0. There simply aren't requirements on the value representation to do so.
Now consider your question: how do we hash pointers? That is, we want to implement:
template <typename T>
struct myhash;
template <typename T>
struct myhash<T*> :
std::unary_function<T*, std::size_t>
{
std::size_t operator()(T* const ptr) const
{
return /* ??? */;
}
};
The most important requirement is that if x == y, then myhash()(x) == myhash()(y). We also already know how to hash integers. What can we do?
The only thing we can do is try to is somehow convert the pointer to an integer. Well, C++11 gives us std::uintptr_t, so we can do this, right?
return myhash<std::uintptr_t>()(reinterpret_cast<std::uintptr_t>(ptr));
Perhaps surprisingly, this is not correct. To understand why, imagine again we're implementing it:
// okay because we assumed no padding:
typedef std::uint16_t __uintptr_t; // will be used for std::uintptr_t implementation
__uintptr_t __to_integer(const __pointer_impl& ptr)
{
__uintptr_t result;
std::memcpy(&result, &ptr, sizeof(__uintptr_t));
return result;
}
__pointer_impl __from_integer(const __uintptr_t& ptrint)
{
__pointer_impl result;
std::memcpy(&result, &ptrint, sizeof(__pointer_impl));
return result;
}
So when we reinterpret_cast a pointer to integer, we'll use __to_integer, and going back we'll use __from_integer. Note that the resulting integer will have a value depending upon the bits in the value representation of pointers. That is, two equal pointer values could end up with different integer representations...and this is allowed!
This is allowed because the result of reinterpret_cast is totally implementation-defined; you're only guaranteed the resulting of the opposite reinterpret_cast gives you back the same result.
So there's the first issue: on this implementation, our hash could end up different for equal pointer values.
This idea is out. Maybe we can reach into the representation itself and hash the bytes together. But this obviously ends up with the same issue, which is what the comments on your question are alluding to. Those pesky unused representation bits are always in the way, and there's no way to figure out where they are so we can ignore them.
We're stuck! It's just not possible. In general.
Remember, in practice we compile for certain implementations, and because the results of these operations are implementation-defined they are reliable if you take care to only use them properly. This is what Mats Petersson is saying: find out the guarantees of the implementation and you'll be fine.
In fact, most consumer platforms you use will handle the std::uintptr_t attempt just fine. If it's not available on your system, or if you want an alternative approach, just combine the hashes of the individual bytes in the pointer. All this requires to work is that the unused representation bits always take on the same value. In fact, this is the approach MSVC2012 uses!
Had our hypothetical pointer implementation simply always initialized byte2 to a constant, it would work there as well. But there just isn't any requirement for implementations to do so.
Hope this clarifies a few things.

The answer to your question really depends on "HOW portable" do you want it. Many architectures will have a uintptr_t, but if you want something that can compile on DSP's, Linux, Windows, AIX, old Cray machines, IBM 390 series machines, etc, etc, then you may have to have a config option where you define your own "uintptr_t" if it doesn't exist in that architecture.
Casting a pointer to an integer type should be fine. If you were to cast it back, you may be in trouble. Of course, if you have MANY pointers, and you allocate fairly large sections of memory on a 64-bit machine, using a 32-bit integer, there is a chance you get lots of collissions. Note that 64-bit windows still has a "long" as 32-bit.

Related

Mask from bitfield in C++

Here's a little puzzle I couldn't find a good answer for:
Given a struct with bitfields, such as
struct A {
unsigned foo:13;
unsigned bar:19;
};
Is there a (portable) way in C++ to get the correct mask for one of the bitfields, preferably as a compile-time constant function or template?
Something like this:
constinit unsigned mask = getmask<A::bar>(); // mask should be 0xFFFFE000
In theory, at runtime, I could crudely do:
unsigned getmask_bar() {
union AA {
unsigned mask;
A fields;
} aa{};
aa.fields.bar -= 1;
return aa.mask;
}
That could even be wrapped in a macro (yuck!) to make it "generic".
But I guess you can readily see the various deficiencies of this method.
Is there a nicer, generic C++ way of doing it? Or even a not-so-nice way? Is there something useful coming up for the next C++ standard(s)? Reflection?
Edit: Let me add that I am trying to find a way of making bitfield manipulation more flexible, so that it is up to the programmer to modify multiple fields at the same time using masking. I am after terse notation, so that things can be expressed concisely without lots of boilerplate. Think working with hardware registers in I/O drivers as a use case.
Unfortunately, there is no better way - in fact, there is no way to extract individual adjacent bit fields from a struct by inspecting its memory directly in C++.
From Cppreference:
The following properties of bit-fields are implementation-defined:
The value that results from assigning or initializing a signed bit-field with a value out of range, or from incrementing a signed
bit-field past its range.
Everything about the actual allocation details of bit-fields within the class object
For example, on some platforms, bit-fields don't straddle bytes, on others they do
Also, on some platforms, bit-fields are packed left-to-right, on others right-to-left
Your compiler might give you stronger guarantees; however, if you do rely on the behavior of a specific compiler, you can't expect your code to work with a different compiler/architecture pair. GCC doesn't even document their bit field packing, as far as I can tell, and it differs from one architecture to the next. So your code might work on a specific version of GCC on x86-64 but break on literally everything else, including other versions of the same compiler.
If you really want to be able to extract bitfields from a random structure in a generic way, your best bet is to pass a function pointer around (instead of a mask); that way, the function can access the field in a safe way and return the value to its caller (or set a value instead).
Something like this:
template<typename T>
auto extractThatBitField(const void *ptr) {
return static_cast<const T *>(ptr)->m_thatBitField;
}
auto *extractor1 = &extractThatBitField<Type1>;
auto *extractor2 = &extractThatBitField<Type2>;
/* ... */
Now, if you have a pair of {pointer, extractor}, you can get the value of the bitfield safely. (Of course, the extractor function has to match the type of the object behind that pointer.) It's not much overhead compared to having a {pointer, mask} pair instead; the function pointer is maybe 4 bytes larger than the mask on a 64-bit machine (if at all). The extractor function itself will just be a memory load, some bit twiddling, and a return instruction. It'll still be super fast.
This is portable and supported by the C++ standard, unlike inspecting the bits of a bitfield directly.
Alternatively, C++ allows casting between standard-layout structs that have common initial members. (Though keep in mind that this falls apart as soon as inheritance or private/protected members get involved! The first solution, above, works for all those cases as well.)
struct Common {
int m_a : 13;
int m_b : 19;
int : 0; //Needed to ensure the bit fields end on a byte boundary
};
struct Type1 {
int m_a : 13;
int m_b : 19;
int : 0;
Whatever m_whatever;
};
struct Type2 {
int m_a : 13;
int m_b : 19;
int : 0;
Something m_something;
};
int getFieldA(const void *ptr) {
//We still can't do type punning directly due
//to weirdness in various compilers' aliasing resolution.
//std::memcpy is the official way to do type punning.
//This won't compile to an actual memcpy call.
Common tmp;
std::memcpy(&tmp, ptr, sizeof(Common));
return tmp.m_a;
}
See also: Can memcpy be used for type punning?

reinterpret_cast and static_cast - What is the difference here?

Does it generally hold true that static_cast<T*>(static_cast<void*>(a)) == reinterpret_cast<T*>(a) for some type a that can be casted to T*. What cases will this not be true? I have an example below that shows a case where this holds true:
#include <iostream>
#include <string>
int main() {
std::string a = "Hello world";
float* b = static_cast<float*>(static_cast<void *>(&a[0]));
float* c = reinterpret_cast<float*>(&a[0]);
std::cout << (b == c) << std::endl; // <-- This prints 1
}
Please let me know if my question is unclear.
Static casting to/from void* of another pointer is the same as reinterpret_casting.
Using a pointer to a non-float as a pointer to a float can be UB, including things as innocuous as comparing one pointer value to another.
A hardware reason behind this is that on some platforms, floats have mandatory alignment, and == on two float* may assume that they are pointing to aligned values. On at least one platform, both char* and void* take up more memory than a float* because there is no byte-addressing on the platform; byte addressing is emulated by storing an additional "offset" and doing bitmasking operations under the hood.
On other platforms, different address spaces entirely are used for different types of pointers. One I'm aware of has function and value address spaces being distinct.
The standard, mindful of such quirks and in order to permit certain optimizations, does not fully define what happens when you use incorrectly typed pointers. Implementations are free to document that they behave like a flat memory model that you might expect but are not mandated to.
I cannot remember if this is implementation defined (which means they must document if it is defined or UB) or just UB here (which means compilers are free to define its behavior, but do not have to).
For types that where the T* is fully valid and proper (ie, there is actually an object of type T there, we just had a pointer to void* or char* to it, or somesuch), then static_cast<T*>(static_cast<void*>(addy)) == reinterpret_cast<T*>(addy) in every case.
Aliasing issues -- having pointers to the wrong type -- are very tricky.

Inline assembly inside C++ for data conversion

I am trying to write a C++ code for conversion of assembly dq 3FA999999999999Ah into C++ double. What to type inside asm block? I dont know how to take out the value.
int main()
{
double x;
asm
{
dq 3FA999999999999Ah
mov x,?????
}
std::cout<<x<<std::endl;
return 0;
}
From the comments it sounds a lot like you want to use a reinterpret cast here. Essentially what this does is to tell the compiler to treat the sequence of bits as if it were of the type that it was casted to but it doesn't do any attempt to convert the value.
uint64_t raw = 0x3FA999999999999A;
double x = reinterpret_cast<double&>(raw);
See this in action here: http://coliru.stacked-crooked.com/a/37aec366eabf1da7
Note that I've used the specific 64bit integer type here to make sure the bit representation required matches that of the 64bit double. Also the cast has to be to double& because of the C++ rules forbidding the plain cast to double. This is because reinterpret cast deals with memory and not type conversions, for more details see this question: Why doesn't this reinterpret_cast compile?. Additionally you need to be sure that the representation of the 64 bit unsigned here will match up with the bit reinterpretation of the double for this to work properly.
EDIT: Something worth noting is that the compiler warns about this breaking strict aliasing rules. The quick summary is that more than one value refers to the same place in memory now and the compiler might not be able to tell which variables are changed if the change occurs via the other way it can be accessed. In general you don't want to ignore this, I'd highly recommend reading the following article on strict aliasing to get to know why this is an issue. So while the intent of the code might be a little less clear you might find a better solution is to use memcpy to avoid the aliasing problems:
#include <iostream>
int main()
{
double x;
const uint64_t raw = 0x3FA999999999999A;
std::memcpy(&x, &raw, sizeof raw);
std::cout<<x<<std::endl;
return 0;
}
See this in action here: http://coliru.stacked-crooked.com/a/5b738874e83e896a
This avoids the issue with the aliasing issue because x is now a double with the correct constituent bits but because of the memcpy usage it is not at the same memory location as the original 64 bit int that was used to represent the bit pattern needed to create it. Because memcpy is treating the variable as if it were an array of char you still need to make sure you get any endianness considerations correct.

Why mapped_file::data returns char* instead of void*

Or even better a template <T*>?
In case the memory mapped file contains a sequence of 32 bit integers, if data() returned a void*, we could be able to static cast to std::uint32_t directly.
Why did boost authors choose to return a char* instead?
EDIT: as pointed out, in case portability is an issue, a translation is needed. But saying that a file (or a chunk of memory in this case) is a stream of bytes more than it is a stream of bits, or of IEEE754 doubles, or of complex data structures, seems to me a very broad statement that needs some more explanation.
Even having to handle endianness, being able to directly map to a vector of be_uint32_t as suggested (and as implemented here) would make the code much more readable:
struct be_uint32_t {
std::uint32_t raw;
operator std::uint32_t() { return ntohl(raw); }
};
static_assert(sizeof(be_uint32_t)==4, "POD failed");
Is it allowed/advised to cast to a be_uint32_t*? Why, or why not?
Which kind of cast should be used?
EDIT2: Since it seems difficult to get to the point instead of discussing weather the memory model of an elaborator is made of bits, bytes or words I will rephrase giving an example:
#include <cstdint>
#include <memory>
#include <vector>
#include <iostream>
#include <boost/iostreams/device/mapped_file.hpp>
struct entry {
std::uint32_t a;
std::uint64_t b;
} __attribute__((packed)); /* compiler specific, but supported
in other ways by all major compilers */
static_assert(sizeof(entry) == 12, "entry: Struct size mismatch");
static_assert(offsetof(entry, a) == 0, "entry: Invalid offset for a");
static_assert(offsetof(entry, b) == 4, "entry: Invalid offset for b");
int main(void) {
boost::iostreams::mapped_file_source mmap("map");
assert(mmap.is_open());
const entry* data_begin = reinterpret_cast<const entry*>(mmap.data());
const entry* data_end = data_begin + mmap.size()/sizeof(entry);
for(const entry* ii=data_begin; ii!=data_end; ++ii)
std::cout << std::hex << ii->a << " " << ii->b << std::endl;
return 0;
}
Given that the map file contains the bit expected in the correct order, is there any other reason to avoid using the reinterpret_cast to use my virtual memory without copying it first?
If there is not, why force the user to do a reinterpret_cast by returning a typed pointer?
Please answer all the questions for bonus points :)
In case the memory mapped file contains a sequence of 32 bit integers, if data() returned a void*, we could be able to static cast to std::uint32_t directly.
No, not really. You still have to consider (if nothing else) endianness. This "one step conversion" idea would lead you into a false sense of security. You're forgetting about an entire layer of translation between the bytes in the file and the 32-bit integer you want to get into your program. Even when that translation happens to be a no-op on your present system and for a given file, it's still a translation step.
It's much better to get an array of bytes (literally what a char* points to!) then you know you have to do some thinking to ensure that your pointer conversion is valid and that you are performing whatever other work is required.
char* represents array of raw bytes, which is what mapped_file::data is in most general case.
void* would be misleading as it provides less information about the contained type and requires more setup to work with then char* - we know that file contents are some bytes, which char* represents.
Template return type would require conversion to that type be performed inside the library, while it makes more sense to do that on the caller side (since library just provides an interface to raw file contents, and the caller knows specifically what those contents are).
Returning a char * seems to be just a (peculiar) design decision of boost::iostreams implementation.
Other APIs like e.g. the boost interprocess return void*.
As observed by sehe the UNIX mmap specification (and malloc) use void* as well.
It is somewhat a duplicate of void* or char* for generic buffer representation?
As a note of caution the layer of translation mentioned by Lightness in another answer may be needed when the memory is written from one architecture and read on a different one. Endianness is easy to solve using a conversion type, but alignment need to be considered as well.
About static cast: http://en.cppreference.com/w/cpp/language/static_cast mentions:
A prvalue of type pointer to void (possibly cv-qualified) can be
converted to pointer to any type. If the value of the original pointer
satisfies the alignment requirement of the target type, then the
resulting pointer value is unchanged, otherwise it is unspecified.
Conversion of any pointer to pointer to void and back to pointer to
the original (or more cv-qualified) type preserves its original value.
So if the file to be memory mapped was created on a different architecture with a different alignment, the loading may fail (e.g. with a SIGBUS) depending on the architecture and the OS.

Converting Function Address to 64-bit Integer: Undefined/Ill-behaved?

Background: I have a scenario in which I must allow comparison between two functor objects, using a unique ID to test if they're equal (I can't simply check if their addresses are the same, as the function pointers themselves aren't stored in the object). Initially, I had this idea, to simply start the id generator at 0 and increment ad infinitum:
struct GenerateUniqueID{
static std::size_t id_count = 0;
auto operator()() -> std::size_t { return (id_count++); }
};
...However, as I have literally thousands upon thousands of these objects created every few seconds, I actually managed to run into the case of id_count overflowing back to 0! The results were... unpleasant. Now, the second idea I had was that, since these functors are, obviously, wrappers around a function, I could perform the comparison by converting the address of the function pointer into a 64-bit integer, and storing that in the class for comparison. See:
//psuedocode
struct Functor{
std::uint64_t id;
auto generate_id_from_function_address(function f) -> void {
id = reinterpret_cast<std::uint64_t>(&f);
}
};
Now, my concern here is simple: is casting function pointers to 64-bit integers ill-behaved/undefined? On 32-bit architectures? On 64-bit architectures? On both? My main concern here is with virtual functions, as I know that for inline functions the compiler simply creates a non-inlined version, so there's no issue there.
Converting a regular pointer (let alone a function pointer) to uint64_t is implementation-defined, since pointers could be wider than 64 bits. The conversion is well-defined if you use uintptr_t (and that type exists).
Converting a function pointer to any integer type is implementation-defined (even if you use uintptr_t), because function pointers may be wider than regular pointers. Some other standards like POSIX explicitly allow this, so under POSIX it is safe to cast function pointers to data pointers like void* and to uintptr_t.
(Converting a pointer-to-member to an integer, data pointer, or regular function pointer is undefined, and in practice likely to always fail since they're bigger than regular pointers.)
However, it may be simpler to just use uint64_t instead of size_t for your unique IDs. It is basically impossible to overflow a uint64_t by incrementing it repeatedly due to their enormous range.