I have read that converting a function pointer to a data pointer and vice versa works on most platforms but is not guaranteed to work. Why is this the case? Shouldn't both be simply addresses into main memory and therefore be compatible?
An architecture doesn't have to store code and data in the same memory. With a Harvard architecture, code and data are stored in completely different memory. Most architectures are Von Neumann architectures with code and data in the same memory but C doesn't limit itself to only certain types of architectures if at all possible.
Some computers have (had) separate address spaces for code and data. On such hardware it just doesn't work.
The language is designed not only for current desktop applications, but to allow it to be implemented on a large set of hardware.
It seems like the C language committee never intended void* to be a pointer to function, they just wanted a generic pointer to objects.
The C99 Rationale says:
6.3.2.3 Pointers
C has now been implemented on a wide range of architectures. While some of these
architectures feature uniform pointers which are the size of some integer type, maximally
portable code cannot assume any necessary correspondence between different pointer types and the integer types. On some implementations, pointers can even be wider than any integer type.
The use of void* (“pointer to void”) as a generic object pointer type is an invention of the C89 Committee. Adoption of this type was stimulated by the desire to specify function prototype arguments that either quietly convert arbitrary pointers (as in fread) or complain if the argument type does not exactly match (as in strcmp). Nothing is said about pointers to functions, which may be incommensurate with object pointers and/or integers.
Note Nothing is said about pointers to functions in the last paragraph. They might be different from other pointers, and the committee is aware of that.
For those who remember MS-DOS, Windows 3.1 and older the answer is quite easy. All of these used to support several different memory models, with varying combinations of characteristics for code and data pointers.
So for instance for the Compact model (small code, large data):
sizeof(void *) > sizeof(void(*)())
and conversely in the Medium model (large code, small data):
sizeof(void *) < sizeof(void(*)())
In this case you didn't have separate storage for code and date but still couldn't convert between the two pointers (short of using non-standard __near and __far modifiers).
Additionally there's no guarantee that even if the pointers are the same size, that they point to the same thing - in the DOS Small memory model, both code and data used near pointers, but they pointed to different segments. So converting a function pointer to a data pointer wouldn't give you a pointer that had any relationship to the function at all, and hence there was no use for such a conversion.
Pointers to void are supposed to be able to accommodate a pointer to any kind of data -- but not necessarily a pointer to a function. Some systems have different requirements for pointers to functions than pointers to data (e.g, there are DSPs with different addressing for data vs. code, medium model on MS-DOS used 32-bit pointers for code but only 16-bit pointers for data).
In addition to what is already said here, it is interesting to look at POSIX dlsym():
The ISO C standard does not require that pointers to functions can be cast back and forth to pointers to data. Indeed, the ISO C standard does not require that an object of type void * can hold a pointer to a function. Implementations supporting the XSI extension, however, do require that an object of type void * can hold a pointer to a function. The result of converting a pointer to a function into a pointer to another data type (except void *) is still undefined, however. Note that compilers conforming to the ISO C standard are required to generate a warning if a conversion from a void * pointer to a function pointer is attempted as in:
fptr = (int (*)(int))dlsym(handle, "my_function");
Due to the problem noted here, a future version may either add a new function to return function pointers, or the current interface may be deprecated in favor of two new functions: one that returns data pointers and the other that returns function pointers.
C++11 has a solution to the long-standing mismatch between C/C++ and POSIX with regard to dlsym(). One can use reinterpret_cast to convert a function pointer to/from a data pointer so long as the implementation supports this feature.
From the standard, 5.2.10 para. 8, "converting a function pointer to an object pointer type or vice versa is conditionally-supported." 1.3.5 defines "conditionally-supported" as a "program construct that an implementation is not required to support".
Depending on the target architecture, code and data may be stored in fundamentally incompatible, physically distinct areas of memory.
undefined doesn't necessarily mean not allowed, it can mean that the compiler implementor has more freedom to do it how they want.
For instance it may not be possible on some architectures - undefined allows them to still have a conforming 'C' library even if you can't do this.
Another solution:
Assuming POSIX guarantees function and data pointers to have the same size and representation (I can't find the text for this, but the example OP cited suggests they at least intended to make this requirement), the following should work:
double (*cosine)(double);
void *tmp;
handle = dlopen("libm.so", RTLD_LAZY);
tmp = dlsym(handle, "cos");
memcpy(&cosine, &tmp, sizeof cosine);
This avoids violating the aliasing rules by going through the char [] representation, which is allowed to alias all types.
Yet another approach:
union {
double (*fptr)(double);
void *dptr;
} u;
u.dptr = dlsym(handle, "cos");
cosine = u.fptr;
But I would recommend the memcpy approach if you want absolutely 100% correct C.
They can be different types with different space requirements. Assigning to one can irreversibly slice the value of the pointer so that assigning back results in something different.
I believe they can be different types because the standard doesn't want to limit possible implementations that save space when it's not needed or when the size could cause the CPU to have to do extra crap to use it, etc...
The only truly portable solution is not to use dlsym for functions, and instead use dlsym to obtain a pointer to data that contains function pointers. For example, in your library:
struct module foo_module = {
.create = create_func,
.destroy = destroy_func,
.write = write_func,
/* ... */
};
and then in your application:
struct module *foo = dlsym(handle, "foo_module");
foo->create(/*...*/);
/* ... */
Incidentally, this is good design practice anyway, and makes it easy to support both dynamic loading via dlopen and static linking all modules on systems that don't support dynamic linking, or where the user/system integrator does not want to use dynamic linking.
A modern example of where function pointers can differ in size from data pointers: C++ class member function pointers
Directly quoted from https://blogs.msdn.microsoft.com/oldnewthing/20040209-00/?p=40713/
class Base1 { int b1; void Base1Method(); };
class Base2 { int b2; void Base2Method(); };
class Derived : public Base1, Base2 { int d; void DerivedMethod(); };
There are now two possible this pointers.
A pointer to a member function of Base1 can be used as a pointer to a
member function of Derived, since they both use the same this
pointer. But a pointer to a member function of Base2 cannot be used
as-is as a pointer to a member function of Derived, since the this
pointer needs to be adjusted.
There are many ways of solving this. Here's how the Visual Studio
compiler decides to handle it:
A pointer to a member function of a multiply-inherited class is really
a structure.
[Address of function]
[Adjustor]
The size of a pointer-to-member-function of a class that uses multiple inheritance is the size of a pointer plus the size of a size_t.
tl;dr: When using multiple inheritance, a pointer to a member function may (depending on compiler, version, architecture, etc) actually be stored as
struct {
void * func;
size_t offset;
}
which is obviously larger than a void *.
On most architectures, pointers to all normal data types have the same representation, so casting between data pointer types is a no-op.
However, it's conceivable that function pointers might require a different representation, perhaps they're larger than other pointers. If void* could hold function pointers, this would mean that void*'s representation would have to be the larger size. And all casts of data pointers to/from void* would have to perform this extra copy.
As someone mentioned, if you need this you can achieve it using a union. But most uses of void* are just for data, so it would be onerous to increase all their memory use just in case a function pointer needs to be stored.
I know that this hasn't been commented on since 2012, but I thought it would be useful to add that I do know an architecture that has very incompatible pointers for data and functions since a call on that architecture checks privilege and carries extra information. No amount of casting will help. It's The Mill.
Related
Recently I try to use FlatBuffers in C++. I found FlatBuffers seems to use a lot of type punning with things like reinterpret_cast in C++. This make me a little uncomfortable because I've learned it's undefined behavior in many cases.
e.g. Rect in fbs file:
struct Rect {
left:int;
top:int;
right:int;
bottom:int;
}
turns into this C++ code for reading it from a table:
const xxxxx::Rect *position() const {
return GetStruct<const xxxxx::Rect *>(VT_POSITION);
}
and the definition of GetStruct simply uses reinterpret_cast.
My questions are:
Is this really undefined behavior in C++?
In practice, will this kind of usage actually be problematic?
Update:
The buffer can just came from network or disk. I don't know if it's different if the buffer actually came from same memory written by writer of the same C++ program.
But the writer's auto-generated method is:
void add_position(const xxxxx::Rect *position) {
fbb_.AddStruct(Char::VT_POSITION, position);
}
which will use this method and this method and so use reinterpret_cast also.
I didn't analyze the whole FlatBuffers' source code, but I didn't see where these objects are created: I see no new expression, which would create P objects here:
template<typename P> P GetStruct(voffset_t field) const {
auto field_offset = GetOptionalFieldOffset(field);
auto p = const_cast<uint8_t *>(data_ + field_offset);
return field_offset ? reinterpret_cast<P>(p) : nullptr;
}
So, it seems that this code does have undefined behavior.
However, this is only true for C++17 (or pre). In C++20, there will be implicit-lifetime objects (for example, scalar types, aggregates are implicit-lifetime types). If P has implicit lifetime, then this code can be well defined. Provided that the same memory area are always accessed by a type, which doesn't violate type-punning rules (for example, it always accessed by the same type).
I think both your questions are answered by the Flatbuffers: Use in C++ page:
Direct memory access
As you can see from the above examples, all elements in a buffer are accessed through generated accessors. This is because everything is stored in little endian format on all platforms (the accessor performs a swap operation on big endian machines), and also because the layout of things is generally not known to the user.
For structs, layout is deterministic and guaranteed to be the same across platforms (scalars are aligned to their own size, and structs themselves to their largest member), and you are allowed to access this memory directly by using sizeof() and memcpy on the pointer to a struct, or even an array of structs.
These paragraphs guarantee that – given a valid flatbuffer – all memory accesses are valid, as the memory at that specific location will match the expected layout.
If you are processing untrusted flatbuffers, you first need to use the verifier functions to ensure the flatbuffer is valid:
This verifier will check all offsets, all sizes of fields, and null termination of strings to ensure that when a buffer is accessed, all reads will end up inside the buffer.
I am using a third-party C++ library to do some heavy lifting in Julia. On the Julia side, data is stored in an object of type Array{Float64, 2} (this is roughly similar to a 2D array of doubles). I can pass this to C++ using a pointer to a double. On the C++ side, however, data is stored in a struct called vector3:
typedef struct _vector3
{
double x, y, z;
} vector3;
My quick and dirty approach is a five-step process:
Dynamically allocate an array of structs on the C++ side
Copy input data from double* to vector3*
Do heavy lifting
Copy output data from vector3* to double*
Delete dynamically allocated arrays
Copying large amounts of data is very inefficient. Is there some arcane trickery I can use to avoid copying data from double to struct and back? I want to somehow interpret a 1D array of double (with a size that is a multiple of 3) as a 1D array of a struct with 3 double members.
Unfortunately no you cannot. And this is because of the aliasing rule that C++ has. In short if you have an object T you cannot legally access it from a pointer of incompatible type U. In this sense you cannot access an object of type double or double* via a pointer of type struct _vector3 or vice-versa.
If you dig deep enough you will find reinterpret_cast and maybe think "Oh this is exactly what I need" but it is not. No matter what trickery you do (reinterpret_cast or otherwise) to bypass the language restrictions (also known as just make it compile) the fact remains that you can legally access the object of type double only via pointers of type double.
One trick that is often used to type-pune is to use union. Legal in C it is however illegal in C++ but some compilers allow it. In your case however I don't think there is a way to use union.
The ideal situation would be to do the heavy lifting on the double* data directly. If that is feasible on your workflow.
Strictly speaking, you cannot. I asked a similar question some times ago (Aliasing struct and array the C++ way), and answers explained why direct aliasing would invoke Undefined Behaviour, and gave some hints on possible workarounds.
That being said you are already in a corner case, because the original data come from a different language. That means that the processing of that data is not covered by the C++ standard and is only defined by the implementation that you are using (gcc/version or clang/version or...)
For your implementation, it might be legal to alias an external array to a C++ struct or an external struct to a C++ array. You shuld carefully the documentation of mixed language programming for your precise implementation.
The other answers mention real issues (the conversion may not work properly). I will add a small runtime check which verifies that aliasing works, and provide a placeholder where you would call/use your copy-heavy code.
int aliasing_supported_internal() {
double testvec[6];
_vector3* testptr = (_vector3*)(void*)testvec;
// check that the pointer wasn't changed
if (testvec != (void*)testptr) return 0;
// check for structure padding
if (testvec+3 != (void*)(testptr+1)) return 0;
// TODO other checks?
return 1;
}
int aliasing_supported() {
static int cached_result = aliasing_supported_internal();
return cached_result;
}
This code converts a small array of doubles to an array of structures aliasing (not copying), then checks if it is valid. If the conversion works (the function returns 1), you are likely to be able to use the same kind of aliasing (converting via void pointer) yourself.
BEWARE THAT THE CODE CAN STILL BE BROKEN IN UNEXPECTED WAYS. The strict aliasing rules state that even the above check is undefined behavior. This may work or may fail horribly. Only conversions that are allowed to work properly are to void* and back to the original pointer type. Also, the above check may be completely wrong in multiple inheritance hierarchies or with virtual base classes (both are in a sense unsafe to convert to void* because the actual pointer value may be shifted, that is, may change due to constraints other than alignment and by quite a few bytes)
Pointer types like int*, char*, and float* point to different types. But I have heard that pointers are simply implemented as links to other addresses - then how is this link associated with a type that the compiler can match with the type of the linked address (the variable at this location)?
Types are mostly compile time things in c++. A variable's type is used at compile time to determine what the operations (in other C++ code) do on that variable.
So a variable bob of type int* when you ++ it, maps at runtime to a generic pointer-sized integer being increased by sizeof(int).
To a certain extent this is a lie; C++'s behavior is specified in terms of an abstract machine, not a concrete one. The compiler interprets your code as expressing operations on that abtract machine (that doesn't exist), then writes concrete assembly code that realizes those operations (insofar as they are defined) on concrete hardware.
In that abstract machine, int* and double* are not just numbers. If you dereference an int* and write to some memory, then do the same with a double*, and the memory overlaps, in the abstract machine the result is undefined behavior.
In the concrete implementation of that abstract machine, pointers-as-numbers as int* or double* dereferenced with the same address results in quite well defined behavior.
This difference is important. The compiler is free to assume the abstract machine (where int* and double* are very distinct things) is the only reality that matters. So if you write to a int*, write to a double* then read back from the int* the compiler can skip the read back, because it can prove that in the abstract machine writing to a double* cannot change a the value that an int* points to.
So
int buf[10]={0};
int* a = &buff[0];
double* d = reinterpret_cast<double*>(&buff[0]);
*a = 77;
*d = 3.14;
std::cout << *a;
the apparent read at std::cout << *a can be skipped by the compiler. Meanwhile, if it actually happened on real hardware, it would read bits generated by the *d write.
When reasoning about C++ you have to think of 3 things at once; what happens at compile time, the abstract machine behavior, and the concrete implementation of your code. In two of these (compile time and abstract machine) int* is implemented differently than float*. At actual runtime, int* and float* are both going to be 64 or 32 bit integers in a register or in memory somewhere.
Type checking is done at compile time. The error happens then, or never, excluding cases of RTTI (runtime type information).
RTTI is things like dynamic_cast, which does not work on pointers to primitives like float* or int*.
At compile time that variable carries with it the fact it is a int* everywhere it goes. In the abstract machine, ditto. In the concrete compiled output, it has forgotten it is an int*.
There's no particular "link" at this stage, nor any hidden meta-data stored somewhere. Since C and C++ are compiled and eventually produce a standalone executable, the compiler "trusts" the programmer and simply provides him with a data type that represents a memory address.
If there's nothing explicitly defined at this address, you can use void * pointer. If you know that this will be the location of something in particular, you can qualify it with a certain data type like int * or char *. The compiler will therefore be able to directly access the object that lies behind but the way this address is stored remains the same in every case, and keep the same format.
Note that this qualification is done at compilation time only. It totally disappear in the definitive executable code. This means that this generated code will be produced to handle certain kinds of objects, but nothing will tell you which ones at first if you disassemble the machine code. You'll have to figure this out by yourself.
Variables represent data which is stored in one or more memory cells or "bytes". The compiler will associate this group of bytes with a name and a type when the variable is defined.
The hardware uses a binary number to access a memory cell. This is known as the "address" of the memory cell.
When you store some data in a variable, the compiler will look up the name of the variable and check that the data you want to store is compatible with its type. If it is, it then generates code which will save it in the memory cell(s) at that address.
Since this address is a number, it can itself be stored in a variable. The type of this address variable will be "pointer to T", where T is the type of the data stored in that address.
It is the responsibility of the programmer to make sure that this address variable does correspond to valid data and not some random area of memory. The compiler will not check this for you.
i'm pretty new to box2d and i'm trying to use the userdata (of type void*) field in the b2body object to store an int value (an enum value, so i know the type of the object).
right now i'm doing something this:
int number = 1023;
void* data = (void*)(&number);
int rNumber = *(int*)data;
and i get the value correctly, but as i've been reading around casting to void* it's not portable or recommendable... is my code cross-platform? is it's behavior defined or implementation dependent?
Thanks!
Casting to void * is portable. It is not recommended because you are losing type safety. Anything can be put into a void * and anything can be gotten out. It makes it easier to shoot yourself in the foot. Otherwise void * is fine as long as you are careful and extra cautious.
You are actually not casting int to void*, you cast int* to void*, which is totally different.
A pointer to any type can be stored in a void*, and be cast back again to the same type. That is guaranteed to work.
Casting to any other type is not portable, as the language standard doesn't say that different pointers have to be the same size, or work the same way. Just that void* has to be wide enough to contain them.
One of the problems with void* is that you need to know (keep track of) what type it originally was in order to cast it properly. If it originally was a float and you case it to an int the compiler would just take your word for it.
To avoid this you could create a wrapper for your data which contains the type, that way you will be able to always cast it to the right type.
Edit: you should also make a habit of using C++ casting style instead of C i.e. reinterpret_cast
void * is somehow a relic of the past (ISO C), but very convenient. You can use it safely as far as you are careful casting back and forward the type you want. Consider other alternatives like the c++ class system or overloading a function
anyways you will have better cast operators, some times there is no other way around (void*), some other times they are just too convenient.
It can lead to non portable code, not because of the casting system, but because some people are tempted to do non portable operations with them. The biggest problem lies in the fact that (void*) is a as big as a memory address, which in many platforms happens to be also the length of the platform integers.
However in some rare exceptions size(void*) != size(int)
If you try to do some type of operations/magic with them without casting back to the type you want, you might have problems. You might be surprised of how many times I have seen people wanting to store an integer into a void* pointer
To answer your question, yes, it's safe to do.
To answer the question you didn't ask, that void pointer isn't meant for keeping an int in. Box2D has that pointer for you to point back to an Entity object in your game engine, so you can associate an Entity in your game to a b2Body in the physics simulation. It allows you to more easily program your entities interact with one another when one b2Body interacts with another.
So you shouldn't just be putting an enum in that void*. You should be pointing it directly to the game object represented by that b2body, which could have an enum in it.
In attempting to answer another question, I was intrigued by a bout of curiousity, and wanted to find out if an idea was possible.
Is it possible to dynamically dereference either a void * pointer (we assume it points to a valid referenced dynamically allocated copy) or some other type during run time to return the correct type?
Is there some way to store a supplied type (as in, the class knows the void * points to an int), if so how?
Can said stored type (if possible) be used to dynamically dereference?
Can a type be passed on it's own as an argument to a function?
Generally the concept (no code available) is a doubly-linked list of void * pointers (or similar) that can dynamically allocated space, which also keep with them a copy of what type they hold for later dereference.
1) Dynamic references:
No. Instead of having your variables hold just pointers, have them hold a struct containing both the actual pointer and a tag defining what type the pointer is pointing to
struct Ref{
int tag;
void *ref;
};
and then, when "dereferencing", first check the tag to find out what you want to do.
2) Storing types in your variables, passing them to functions.
This doesn't really make sense, as types aren't values that can be stored around. Perhaps what you just want is to pass around a class / constructor function and that is certainly feasible.
In the end, C and C++ are bare-bones languages. While a variable assignment in a dynamic language looks a lot like a variable assignment in C (they are just a = after all) in reality the dynamic language is doing a lot of extra stuff behind the scenes (something it is allowed to do, since a new language is free to define its semantics)
Sorry, this is not really possible in C++ due to lack of type reflection and lack of dynamic binding. Dynamic dereferencing is especially impossible due to these.
You could try to emulate its behavior by storing types as enums or std::type_info* pointers, but these are far from practical. They require registration of types, and huge switch..case or if..else statements every time you want to do something with them. A common container class and several wrapper classes might help achieving them (I'm sure this is some design pattern, any idea of its name?)
You could also use inheritance to solve your problem if it fits.
Or perhaps you need to reconsider your current design. What exactly do you need this for?