C++ and structs (with multiple inheritance) - c++

I've been looking into C++ and structs for a project I'm working on; at the moment I'm using 'chained' template structures to add in data fields in as pseudo-traits.
Whilst it works, I think I'd prefer something like multiple inheritance as in the example below:
struct a {
int a_data;
}; // 'Trait' A
struct b {
int b_data;
}; // 'Trait' B
struct c : public a, public b {
int c_data;
}; // A composite structure with 'traits' A and B.
struct d : public b {
int d_data;
}; // A composite structure with 'trait' B.
My experimental code examples show they work fine, but I'm a bit perplexed as to how its actually working when things get complex.
For example:
b * basePtr = new c;
cout << basePtr->b_data << endl;
b * basePtr = new d;
cout << basePtr->b_data << endl;
This works fine every time, even through function calls with the a pointer as a parameter.
My question is how does the code know where b_data is stored in one of the derived structs? As far as I can tell, the structs still use a compacted structure with no extra data (i.e. 3 int structs only take up 12 bytes, 2 ints 8 bytes, etc). Surely it needs some sort of extra data field to say where a_data and b_data are stored in a given structure?
It's more of a curiosity question as it all seems to work regardless, and if there are multiple implementations in use, I'll happily accept a single example. Though I do have a bit of a concern as I want to transfer the bytes behind these structs through a inter-process message queue and want to know if they'll be decoded OK on the other end (all the programs using the queue will be compiled by the same compiler and run on a single platform).

In both cases, basePtr truly is a pointer to an object of type b, so there is no problem. The fact that this object is not a complete object, but rather a subobject of a more-derived object (this is actually the technical term), is not material.
The (static, implicit) conversion from d * to b *, as well as from c * to b *, takes care of adjus­ting the pointer value so that it really points to the b subobject. All the information is known statically, so the compiler makes all those computations automatically.

You should read the wikipedia value on C++ classes , under the memory management and class inheritance content.
Basically, the compiler creates the class structure, so at compile time it knows the offset to each part of the class.
When you call a variable, the compiler knows the type and therefore its structure, and if you cast it to a base class, it just needs to jump to the right off set.

On most implementations, a pointer conversion, say from c* to b*, will automatically adjust the address if necessary. In the statement
b * basePtr = new c;
the new expression allocates a c object, which contains an a base class subobject, a b base class subobject, and a c_data member subobject. In raw memory, this will probably look like just three ints. The new expression returns the address of the created complete c object, which is (on most implementations) the same as the address of the a base class subobject and the address of the a_data member subobject.
But then the expression new c, with type c*, is used to initialize a b* pointer, which causes an implicit conversion. The compiler sets basePtr to the address of the b base class subobject within the complete c object. Not hard, since the compiler knows the offset from a c object to its unique b subobject.
Afterward, an expression like basePtr->b_data doesn't need to know what the complete object type was. It just knows that b_data is at the very beginning of b, so it can simply dereference the b* pointer.

The details of this are up to the C++ implementation, but in a case like this, with non-virtual inheritance, you can think of it like this:
c has two sub-objects, one with type a and one with type b.
When you cast a pointer to c to a pointer to b, the compiler is smart enough so that the result of the cast is a pointer to the b sub-object of the c object referenced by the original pointer. This may involve changing the numerical value of the returned pointer.
Generally, with single inheritance, the sub-object pointer will have the same numerical value as the original pointer. With multi-inheritance, it might not.

Yes, there are extra fields that define the offset each sub-component has into the aggregate. But they are not stored in the aggregate itself, but most likely (although the ultimate choice about how to do that is left to the compiler designers) in auxiliary structure residing in a hidden side of the data segment.
Your objects are not polymorphic (and you used them wrongly, but I'll came to this later), but just compounds like:
c[a[a_data],b[b_data],c_data];
^
b* points here
d[b[b_data],d_data]
^
b* points here
(Note that the real layout may depend on the particular compiler and even optimization flags used)
The offsets of the beginning of b respect to the beginning of c or d does not depend on the particular object instance, so it is not a value required to stay into the object, but just in a general d and c descriptions known to the compiler but not necessarily available to you.
The compiler knows, given a c or a d, where the b component begins. But given a b cannot know if it is inside a d or a c.
The reason why you used the object wrongly is that you did not care about their destruction. You allocate them with new, but never delete-ed them afterwards.
And you cannot just call delete baseptr since there is nothing in the b subcomponent that tells what the aggregate it is actually (at runtime) part of.
There are two programming style to come around it:
The classic OOP, assume the actual type is known at runtime, and pretends all your classes to have a virtual destructor: that gives to all the struct an extra "ghost" field (the v-table pointer, that point to a table in the "auxiliary descriptor", containing all the virtual functions' addresses) that makes the destructor call originated by delete to actually be dispatched to the most derived one (hence delete pbase will actually call c::~c or d::~d depending on the actual object)
The Generic programming style, assume you know in some other way (most likely from a template parameter) the actual derived type, so you will not delete pbase, but a static_cast<actual_derived_class*>(pbase)

Inheritance is the abstraction for a method to resuse functions from another class under it. The method can be called from the class if it's located in the class below it. A struct enables you to have variables as in a data structure similar to a class that uses a variable or a function.
class trait
{
//variable definition
//variable declaration
function function_name(variable_type variable_name, and more)
{
//operation on variables in function call
}
variable_name = function_name(variable_name);
struct struct_name
{
//variable definition
}
struct_name = {value_1, value_2, and more}
operation on struct_name.value_1
}

There is a distinction between compile time knowledge and runtime knowledge. Part of the job of the compiler is to make as much use of compile time information as possible to avoid having to do things at run time.
In this case, all the details of exactly where each piece of data is in a given type are known at compile time. So the compiler doesn't need to know it at runtime. Whenever you access a particular member, it just uses its compile time knowledge to compute the appropriate offset for the data you need.
The same thing goes for pointer conversions. It will adjust pointer values when they're converted to make sure the point at the appropriate sub-part.
Part of the reason this works is that the data values from an individual class or struct are never interleaved with any other that aren't mentioned in the class definition, even when that struct is a sub-component of another struct either through composition or inheritance. So the relative layout of any individual struct is always the same no matter where in memory it's found.

Related

Is casting Derived** → Base** wrong? What's the alternative?

Context
My goal is to have a base container class that contains and manipulates several base class objects, and then a derived container class that contains and manipulates several derived class objects. On the advice of this answer, I attempted to do this by having each contain an array of pointers (a Base** and a Derived**), and cast from Derived** to Base** when initializing the base container class.
However, I ran into a problem – despite compiling just fine, when manipulating the contained objects, I'd have segfaults or the wrong methods would be called.
Problem
I have boiled the problem down to the following minimal case:
#include <iostream>
class Base1 {
public:
virtual void doThing1() {std::cout << "Called Base1::doThing1" << std::endl;}
};
class Base2 {
public:
virtual void doThing2() {std::cout << "Called Base2::doThing2" << std::endl;}
};
// Whether this inherits "virtual public" or just "public" makes no difference.
class Derived : virtual public Base1, virtual public Base2 {};
int main() {
Derived derived;
Derived* derivedPtrs[] = {&derived};
((Base2**) derivedPtrs)[0]->doThing2();
}
You may expect this to print "Called Base2::doThing2", but…
$ g++ -Wall -Werror main.cpp -o test && ./test
Called Base1::doThing1
Indeed – the code calls Base2::doThing2, but Base1::doThing1 ends up being called. I've also had this segfault with more complex classes, so I assume it's address-related hijinks (perhaps vtable-related – the error doesn't seem to occur without virtual methods). You can run it here, and see the assembly it compiles to here.
You can see my actual structure here – it's more complex, but it ties it into the context and explains why I need something along the lines of this.
Why does the Derived**→Base** cast not work properly when Derived*→Base* does, and more importantly, what is the right way to handle an array of derived objects as an array of base objects (or, failing that, another way to make a container class that can contain multiple derived objects)?
I can't index before upcasting (((Base2*) derivedPtrs[0])->doThing2()), I'm afraid, since in the full code that array is a class member – and I'm not sure it's a good idea (or even possible) to cast manually in every place contained objects are used in the container classes. Correct me if that is the way to handle this, though.
(I don't think it makes a difference in this case, but I'm in an environment where std::vector is unavailable.)
Edit: Solution
Many of the answers suggested that casting each pointer individually is the only way to have an array that can contain derived objects – and that does indeed seem to be the case. For my particular use case, though, I managed to solve the problem using templates! By supplying a type parameter for what the container class is supposed to contain, instead of having to contain an array of derived objects, the type of the array can be set to the derived type at compile time (e.g. BaseContainer<Derived> container(length, arrayOfDerivedPtrs);).
Here's a version of the broken "actual structure" code above, fixed with templates.
There are many things that make this code pretty awful and contribute to this issue:
Why are we dealing with two-star types in the first place? If std::vector doesn't exist, why not write your own?
Don't use C-style casts. You can cast pointers to completely unrelated types into each other and the compiler is not allowed to stop you (coincidentally, this is exactly what is happening here). Use static_cast/dynamic_cast instead.
Let's assume we had std::vector, for ease of notation. You are trying to cast a std::vector<Derived*> to std::vector<Base*>. Those are unrelated types (the same is true for Derived** and Base**), and casting one to the other is not legal in any way.
Pointer casts from/to derived are not necessarily trivial. If you have struct X : A, B {}, then a pointer to the B base will be different from a pointer to the A base† (and with a vtable in play, possible also different from a pointer to X). They must be, because the (sub)objects cannot reside at the same memory address. The compiler will adjust the pointer value when you cast the pointer. This of course does not/cannot happen for each individual pointer if you (try to) cast an array of pointers.
If you have an array of pointers to Derived and want to get an array of those pointers to Base, then you have to manually cast each one. Since the pointer values will generally be different in both arrays, there is no way to "reuse" the same array.
†(Unless the conditions for empty base optimization are fulfilled, which is not the case for you).
Like others said the problem is that you are not letting the compiler doing its job in adjusting indices, suppose Derived layout in memory is something like (not guaranteed by the standard, just a possible implementation):
| vtable_Base1 | Base1 | vtable_Base2 | Base2 | vtable_Derived | Derived |
Then &derived points to the start of the object, when you normally do
Base2* base = static_cast<Derived*>(&derived)
the compiler knows the offset which Base2 structure has inside Derived type and adjusts the address to point to the start of it.
If, instead, you cast an array of pointers directly the compiler is coercing the type, assuming that your array is storing pointers to Base2 already, but without adjusting them.
A dirty hack which may work or not in your situation is having a method which returns the pointer to itself, eg:
class Base2 {
public:
Base2* base2() { return this; }
}
so that you can do derivedPtrs[0]->base2()->doThing2().
It doesn't work the same reason why this don't work:
struct Base {};
struct Derived : Base { int i; };
int main() {
Derived d[6];
Derived* d2 = d;
Base** b = &d2; // ERROR!
}
Your c-style cast is a bad practice because it didn't warn you of the error. Just don't do that to your code. Your c-style cast was actually a reinterpret_cast in disguise, which is tottaly wrong in the case.
But why cannot you cast an array to derived into an array to base? Simple: they have different layout.
You see, when you iterate on an array of a type, each elements in the array are contiguous in memory. The Derived class may have a size of let's say, 24 bytes, and Base a size of 8:
Derived d[4];
------------------------------------------------------
| D1 | D2 | D3 | D4 |
------------------------------------------------------
Base b[4];
---------------------
| B1 | B2 | B3 | B4 |
---------------------
As you can see, Derived[4] and Base[4] are different types with different layouts.
Then what can you do?
There's many solution in fact. The easiest would be to create a new array of pointer to base, and cast each derived to base pointers. You have to ajust the pointer of each objects anyway.
It would look like this:
std::vector<Base*> bases;
bases.reserve(std::size(derived_arr))
std::transform(
std::begin(derived_arr), std::end(derived_arr),
std::back_inserter(bases),
[](Dervied* d) {
// You must use dynamic cast because the
// pointer offset in only known at runtime
// when using virtual inheritance
return dynamic_cast<Base*>(d);
}
);
The other solution that is in place in memory would be to create you own type of iterator that would do the cast when calling the operator* and the operator->. This is a bit harder to do but can save you an allocation by making iteration a bit slower.
On top all that, this may be unrelated, but I would advise against virtual inheritance. This is not a good practice and in my experience resulted in more pain than anything. I would suggest using the adaptator pattern and wrap non-polymorphic types instead.
When you cast from Derived* to Base* the compiler will adjust the value. When you cast Derived** to Base** you defeat that.
This is a good reason to always use static_cast. Example error with your code changed:
test.cpp:19:5: error: static_cast from 'Derived **' to 'Base2 **' is not
allowed
static_cast<Base2**>(derivedPtrs)[0]->doThing2();
(Base2 **) is in fact a reinterpret_cast (You can confirm this by trying all four casts) and that expression causes UB. The fact that you can implicitly cast a pointer to derived to a pointer to base doesn't mean that they are the same, e.g. int and float. And here, you are referring an object by the type which it isn't, in this case, causes an UB.
Calling a virtual function this way results the final overrider of that object being called. How this is achieved depends on the compiler.
Assume that the compiler uses a "vtable", finding the vtable of Base2 (might be adding an offset to the memory. assembly) from the memory address of a Derived is not trivial.
Basically, you must perform a dynamic cast on every pointer, store them somewhere or dynamic cast it when in need.

C++ base class pointer calls child virtual function, why could base class pointer see child class member

I think I might be confusing myself. I know class with virtual functions in C++ has a vtable (one vtable per class type), so the vtable of Base class will have one element &Base::print(), while the vtable of Child class will have one element &Child::print().
When I declare my two class objects, base and child, base's vtable_ptr will pointer to Base class's vtable, while child's vtable_ptr will point to Child class's vtable. After I assign the address of base and child to an array of Base type pointer. I call base_array[0]->print() and base_array[1]->print(). My question is, both base_array[0] and base_array[1] is of type Base*, during run-time, although the v-table lookup will gives the correct function pointer, how could a Base* type see the element in Child class? (basically value 2?). When I call base_array[1]->print(), base_array[1] is of type Base*, but during run time it finds out it will use Child class print(). However, I am confused why value2 can be accessed during this time, because I am playing with type Base*..... I think I must miss something somewhere.
#include "iostream"
#include <string>
using namespace std;
class Base {
public:
int value;
string name;
Base(int _value, string _name) : value(_value),name(_name) {
}
virtual void print() {
cout << "name is " << name << " value is " << value << endl;
}
};
class Child : public Base{
public:
int value2;
Child(int _value, string _name, int _value2): Base(_value,_name), value2(_value2) {
}
virtual void print() {
cout << "name is " << name << " value is " << value << " value2 is " << value2 << endl;
}
};
int main()
{
Base base = Base(10,"base");
Child child = Child(11,"child",22);
Base* base_array[2];
base_array[0] = &base;
base_array[1] = &child;
base_array[0]->print();
base_array[1]->print();
return 0;
}
The call to print through the pointer does a vtable lookup to determine what actual function to call.
The function knows the actual type of the 'this' argument.
The compiler will also insert code to adjust to the actual type of the argument (say you have class child:
public base1, public base2 { void print(); };
where print is a virtual member inherited from base2. In that case the relevant vtable will not be at offset 0 in child so an adjustment will be needed to translate from the stored pointer value to the correct object location).
The data needed for that fix-up is generally stored as part of hidden run-time type information (RTTI) blocks.
I think I must miss something somewhere
Yes, and you got most of the stuff right until the end.
Here is a reminder of the really basic stuff in C/C++ (C and C++: same conceptual heritage so a lot of basic concepts are shared, even if the fine details diverge significantly at some point). (That may be really obvious and simple, but it's worth saying it loudly to feel it.)
Expressions are part of the compiled program, they exist at compile time; objects exist at run time. An object (the thing) is designated by expression (the word); they are conceptually different.
In traditional C/C++, an lvalue (short for left-value) is an expression whose runtime evaluation designates an object; dereferencing a pointer gives an lvalue, (f.ex. *this). It's called "left value" because the assignment operator on the left requires an object to assign to. (But not all lvalue can be at the left of the assignment operator: expressions designating const objects are lvalues, usually cannot be assigned to.) Lvalues always have a well defined identity and most of them has an address (only members of struct declared as bit-field cannot have their address taken, but the underlying storage object still has an address).
(In modern C++, the lvalue concept was renamed glvalue, and a new concept of lvalue was invented (instead of making a new term for the new concept and keeping the old term of concept of object with an identity that may or may not be modifiable. That was in my not so humble opinion a serious error.)
The behavior of a polymorphic object (an object of a class type with at least one virtual function) depends of its dynamic type, that its type of started construction (the name of the constructor of the object that started to construct data members, or entered the constructor body). During the execution of the body of the Child constructor, the dynamic type of the object designed by *this is Child (during the execution of the body of a base class constructor, the dynamic type is that of the base class constructor running).
Dynamic polymorphic means that you can use a polymorphic object with an lvalue whose declared type (type deduced at compile time from the rules of the language) isn't exactly the same type, but a related type (related by inheritance). That's the whole point of the virtual keyword in C++, without that it would be completely useless!
If base_array[i] contains an address of an object (so its value is well defined, not null), you can dereference it. That gives you an lvalue whose declared type is always Base * by definition: that's the declared type, the declaration of base_array being:
Base (*(base_array[2])); // extra, redundant parentheses
which can of course be written
Base* base_array[2];
if you want to write it that way, but the parse tree, the way the declaration is decomposed by the compiler is NOT
{ Base* } { base_array[2] }
(using bold face curly brace to symbolically represent parsing)
but instead
Base { * { { base_array } [2] } }
I hope you understand that the curly braces here are my choice of meta language NOT the curly braces used in the language grammar to define classes and functions (I don't know how to draw boxes around text here).
As a beginner, it's important that you "program" your intuition correctly, to always read declarations like the compiler does; if you ever declare two identifier on the same declaration, the difference is important int * a, b; means int (*a), b; AND NOT int (*a), (*b);
(Note: even if that might be clear to you the OP, as this is clearly a question of interest to beginners in C++, that reminder of C/C++ declaration syntax might be of use of someone else.)
So, going back to the issue of polymorphism: an object of a derived type (name of the most recently entered constructor) can be designated by an lvalue of a base class declared type. The behavior of virtual function calls is determined by the dynamic type (also called real type) of the object designated by the expression, unlike the behavior of non virtual function calls; that's the semantic defined by the C++ standard.
The way the compiler gets the semantic defined by a language standard is its own problem and not described in the language standard, but when there is only one efficient simple to do it, all compilers do it essentially the same way (the fine details are compiler-specific) with
one virtual function table ("vtable") per polymorphic class
one pointer to a vtable ("vptr") per polymorphic object
(Both vtable and vptr are obviously implementation concepts and not language concepts, but they are so common that every C++ programmer knows them.)
The vtable is a description of the polymorphic aspects of the class: the runtime operations on an expression of a given declared type whose behavior depends on the dynamic type. There is one entry for each runtime operation. A vtable is like a struct (record) with one member (entry) per operation (all entries are usually pointers of the same size, so many people describe the vtable as an array of pointers, but I don't, I describe it as a struct).
The vptr is a hidden data member (a data member without a name, not accessible by C++ code), whose position in an object is fixed like any other data member, that can be read by the runtime code when an lvalue of polymorphic class type (call it D for "declared type") is evaluated. Dereferencing a vptr in D gives you a vtable describing a D lvalue, with entries for each runtime aspect of an lvalue of type D. By definition, the location of the vptr and the interpretation of the vtable (layout and use of its entries) are completely determined by the declared type D. (Obviously no information necessary for the use and interpretation of the vptr can be a function of the runtime type of an object: the vptr is used when that type is not known.)
The semantics of the vptr is the set of guaranteed valid runtime operations on the vptr: how the vptr can be dereferenced (the vptr of an existing object always points to a valid vtable). It's the set of properties of the form: by adding offset off to vptr value, you get a value that can be used in "such way". These guarantees form a runtime contract.
The most obvious runtime aspect of a polymorphic object is calling virtual function, so there is an entry in a vtable for D lvalue for each virtual function that can be called on an lvalue of type D, that is an entry for each virtual function declared either in that class or in a base class (not counting overriders as they are the same). All non static member functions have a "hidden" or "implicit" argument, the this parameter; when compiling it becomes a normal pointer.
Any class X derived from D will have a vtable for D lvalues. For efficiency in common case of simple case of normal (non virtual) single inheritance, the semantics of the vptr of the base class (that we then call a primary base class) will be augmented with new properties, so the vtable for X will be augmented: the layout and semantics of the vtable for D will be augmented: any property of a vtable for D is also a property of the vtable for X, the semantic will be "inherited": there's an "inheritance" of the vtables in parallel with the inheritance inside the classes.
In logical terms, there's an increase of guarantees: the guarantees of the vptr of a derived class object are stronger than the guarantees of the vptr of a base class object. Because it's stronger contract, all code generated for a base lvalue is still valid.
[In more complex inheritance, that is either virtual inheritance, or non virtual
secondary inheritance (in multiple inheritance, inheritance from a secondary base, that is any base that is not defined as "primary base"), the augmentation of the semantics of the vtable of the base class is not so simple.]
[One way to explain C++ classes implementation is as a translation to C (indeed the first C++ compiler was compiling to C, not to assembly). The translation of a C++ member function is simply a C function where the implicit this parameter is explicit, a normal pointer parameter.]
The vtable entry for a virtual function for D lvalue is just a pointer to a function with as parameter the now explicit this parameter: that parameter is a pointer to D, it actually points to the D base subobject of an object of class derived from D, or an object of actual dynamic type D.
If D is a primary base of X, that is one that starts at the same address as the derived class, and where the vtable starts at the same address, so the vptr value is the same, and the vptr is shared between a primary base and the derived class. It means that virtual calls (calls on lvalue that go through vtable) to virtual functions in X that replace identically (that override with the same return type) just follow the same protocol.
(Virtual overriders can have a different, covariant return type and a different call convention might be used in that case.)
There are other special vtable entries:
Multiple virtual call entries for a given virtual function signature if the overrider has a covariant return type that requires and adjustment (that is not a primary base).
For special virtual functions: when the delete operator used on a polymorphic base with a virtual destructor, it is done through a deleting virtual destructor, to call the correct operator delete (replaced delete if there is one).
There is also a non deleting virtual destructor that is used for explicit destructor calls: l.~D();
The vtables stores the offsets for each virtual base subobjects, for implicit conversion to a virtual base pointer, or for accessing its data members.
There is the offset of the most derived object for dynamic_cast<void*>.
An entry for the typeid operator applied to a polymorphic object (notably, the name() of the class).
Enough information for dynamic_cast<X*> operators applied to a pointer to polymorphic object to navigate the class hierarchy at runtime, to locate a given base class or derived subobject (unless X isn't simply a base class of the cast type as that is without dynamically navigating the hierarchy).
This is just an overview of the information present in vtable and the kinds of vtable, there are other subtleties. (Virtual bases are notably more complex than non virtual bases at the implementation level.)
I think you might be confusing the way a pointer is declared with the type of the object it happens to be pointing to.
Forget about vtables for a moment. They're an implementation detail. They are just a means to an end. Let's look at what your code is actually doing.
So, with reference to the code you posted, this line:
base_array[0]->print();
calls into Bases implementation of print(), because the object pointed to is of type Base.
Whereas this line:
base_array[1]->print();
calls into Childs implementation of print(), because (yes, you guessed it) the object pointed to is of type Child. You don't need any fancy type casts to make this happen. It will just happen anyway, provided the method is declared virtual.
Now, inside the body of Base::print(), the compiler doesn't know (or care) whether this points to an object of type Base or an object of type Child (or any other class derived from Base, in the general case). It therefore follows that it can only access data members declared by Base (or any parent classes of Base, if there were any). Once you understand that, it's all simple enough.
But inside the body of Child::print(), the compiler does know a bit more about what this is pointing to - it has to be an instance of class Child (or some other class derived from Child). So now, the compiler can safely access value2 - inside the body of Child::print() - and your example therefore compiles correctly.
I think that's about it really. The vtable is only there to dispatch to the correct virtual method when you call that method through a pointer whose type is not known at compile time, as your example code is indeed doing. (*)
(*) Well, almost. Optimising compilers are getting pretty funky these days, there is actually enough information there for it you call the relevant method direct but please don't let that confuse the issue in any way.

RTTI and virtual functions in c++ . Is the implementation approach of gcc necessary?

While trying to understand the inner workings of virtual function and RTTI, I observed the subsequent fact by examining the gcc compiler:
When structs or classes have a virtual function than the space occupied by them get's expanded by preciding them with a pointer that determines their type.
More over and worse, when multiple inheritance is active, each substructure adds another pointer. So e.g. a structure as:
struct C: public A, public B{...}
creates a structure with the data of A and B plus two pointers.
That is memory and efficiency consuming.
The question is if that is needed.
I don't have experience in working with those arguments but the logic thougths that I would like to ask if are buggy are as follows:
a) when we declare a variable of a certain struct type we know it's type. So when we have to pass that variable by value to another function we know it's type and we may pass it as a plain old structure along with an extra value indicating it's type ( I mean the compiler should take care of that).
b) When we pass a struct by reference or by pointer we know again the starting type and we may pass a pointer along with an extra value indicating the type.
In order to assign a struct passed by value to a narrower type we have just to take the necessary subpart (offsets calculation).
In order to change to a narrower pointer we have to adjust the pointer and change the extra value indicating the type accordingly.
So along that line of thoughs, we need to have the extra type indication only when passing called arguments between functions on the stack and it isn't needed to save them in each struct.
What am am I missing ?
The only case that I see would need the type information stored along with the data is when we use a Union and we need to save any kind of a certain specific list of structures but tha is any case resolved by other means.
So the question is if dynamic type information is needed besides when passing arguments between functions, and if and where eventually that is mandated by the standard.
What am am I missing ?
rtti and dynamic_cast need to be able to deduce the layout of the object from any one of its interfaces.
How will you do that unless each interface encapsulates a pointer to this information?
example of RTTI:
struct A { virtual ~A() = default; }; // virtual base class
struct B : A {}; // derived polymorphic class
A* p = new B();
std::cout << typeid(*p).name() << std::endl;
Exercise for reader:
Which class's name gets printed?
Why?
I think, following code will explain it all:
// in header.h
struct Base {
virtual void foo() = 0;
};
// in impl.cpp
void work(Base* b) {
b->foo();
}
Assume the code above gets compiled into a C++ library. Never touched afterwards.
Now, some time later, we write following code:
#include <header.h>
struct D : Base {
void foo() { std::cout << "Foo!\n"; }
};
void my() {
D* d = new D;
work(d);
}
As you see, we can't possibly pass any type to work() - when it was compiled, D didn't even exist! The elegance of virtual functions through pointers to virtual function table - this is what those pointers are - is that above design is allowed.

How does typeid work and how do objects store class information?

http://en.wikipedia.org/wiki/Typeid
This seems to be a mystery to me: how does a compiler stores information about the type of an object ? Basically an empty class, once instantiated, has not a zero size in memory.
How it is stored is implementation-defined. There are many completely different ways to do it.
However, for non-polymorphic types nothing needs to be stored. For non-polymorphic types typeid returns information about the static type of the expression, i.e. its compile-time type. The type is always known at compile-time, so there's no need to associate any additional information with specific objects (just like for sizeof to work you don't really need to store the object size anywhere). "An empty object" that you mention in your question would be an object of non-polymorphic type, so there's no need to store anything in it and there's no problem with it having zero size. (Meanwhile, polymorphic objects are never really "empty" and never have "zero size in memory".)
For polymorphic types typeid does indeed return the information about the dynamic type of the expression, i.e. about its run-time type. To implement this something has to be stored inside the actual object at run-time. As I said above, different compilers implement it differently. In MSVC++, for one example, the VMT pointer stored in each polymorphic object points to a data structure that contains the so called RTTI - run-time type information about the object - in addition to the actual VMT.
The fact that you mention zero size objects in your question probably indicates that you have some misconceptions about what typeid can and cannot do. Remember, again, typeid is capable of determining the actual (i.e. dynamic) type of the object for polymorphic types only. For non-polymorphic types typeid cannot determine the actual type of the object and reverts to primitive compile-time functionality.
Imagine every class as if it has this virtual method, but only if it already has one other virtual, and one object is created for each type:
extern std::type_info __Example_info;
struct Example {
virtual std::type_info const& __typeid() const {
return __Example_info;
}
};
// "__" used to create reserved names in this pseudo-implementation
Then imagine any use of typeid on an object, typeid(obj), becomes obj.__typeid(). Use on pointers similarly becomes pointer->__typeid(). Except for use on null pointers (which throws bad_typeid), the pointer case is identical to the non-pointer case after dereferencing, and I won't mention it further. When applied directly on a type, imagine that the compiler inserts a reference directly to the required object: typeid(Example) becomes __Example_info.
If a class does not have RTTI (i.e. it has no virtuals; e.g. NoRTTI below), then imagine it with an identical __typeid method that is not virtual. This allows the same transformation into method calls as above, relying on virtual or non-virtual dispatch of those methods, as appropriate; it also allows some virtual method calls to be transformed into non-virtual dispatch, as can be performed for any virtual method.
struct NoRTTI {}; // a hierarchy can mix RTTI and no-RTTI, just as use of
// virtual methods can be in a derived class even if the base
// doesn't contain any
struct A : NoRTTI { virtual ~A(); }; // one virtual required for RTTI
struct B : A {}; // ~B is virtual through inheritance
void typeid_with_rtti(A &a, B &b) {
typeid(a); typeid(b);
A local_a; // no RTTI required: typeid(local_a);
B local_b; // no RTTI required: typeid(local_b);
A &ref = local_b;
// no RTTI required, if the compiler is smart enough: typeid(ref)
}
Here, typeid must use RTTI for both parameters (B could be a base class for a later type), but does not need RTTI for either local variable because the dynamic type (or "runtime type") is absolutely known. This matches, not coincidentally, how virtual calls can avoid virtual dispatch.
struct StillNoRTTI : NoRTTI {};
void typeid_without_rtti(NoRTTI &obj) {
typeid(obj);
StillNoRTTI derived; typeid(derived);
NoRTTI &ref = derived; typeid(ref);
// typeid on types never uses RTTI:
typeid(A); typeid(B); typeid(NoRTTI); typeid(StillNoRTTI);
}
Here, use on either obj or ref will correspond to NoRTTI! This is true even though the former may be of a derived class (obj could really be an instance of A or B) and even though ref is definitely of a derived class. All of the other uses (the last line of the function) will also be resolved statically.
Note that in these example functions, each typeid uses RTTI or not as the function name implies. (Hence the commented-out uses in with_rtti.)
Even when you do not use type information, an empty class will not have zero bytes, it always has something, if I remember correct the standard demands that.
I believe the typeid is implemented similar to a vtable pointer, the object will have a "hidden" pointer to its typeid.
There are several questions in your one question.
In C++ objects is something that occupies memory. If it does not occupy any memory - it is not an object (although base class sub-object can occupy no space). So, an object has to occupy at least 1 byte.
A compiler does not store any type information unless your class has a virtual function. In that case a pointer to type information is often stored at a negative offset in the virtual function table. Note that the standard does not mention any virtual tables or type information format so it is purely an implementation detail.

Why do we need "this pointer adjustor thunk"?

I read about adjustor thunk from here. Here's some quotation:
Now, there is only one QueryInterface
method, but there are two entries, one
for each vtable. Remember that each
function in a vtable receives the
corresponding interface pointer as its
"this" parameter. That's just fine for
QueryInterface (1); its interface
pointer is the same as the object's
interface pointer. But that's bad news
for QueryInterface (2), since its
interface pointer is q, not p.
This is where the adjustor thunks come
in.
I am wondering why "each function in a vtable receives the corresponding interface pointer as its "this" parameter"? Is it the only clue(base address) used by the interface method to locate data members within the object instance?
Update
Here is my latest understanding:
In fact, my question is not about the purpose of this parameter, but about why we have to use the corresponding interface pointer as the this parameter. Sorry for my vagueness.
Besides using the interface pointer as a locator/foothold within an object's layout. There're of course other means to do that, as long as you are the implementer of the component.
But this is not the case for the clients of our component.
When the component is built in COM way, clients of our component know nothing about the internals of our component. Clients can only take hold of the interface pointer, and this is the very pointer that will be passed into the interface method as the this parameter. Under this expectation, the compiler has no choice but to generate the interface method's code based on this specific this pointer.
So the above reasoning leads to the result that:
it must be assured that each function
in a vtable must recieve the
corresponding interface pointer as its
"this" parameter.
In the case of "this pointer adjustor thunk", 2 different entries exist for a single QueryInterface() method, in other words, 2 different interface pointers could be used to invoke the QueryInterface() method, but the compiler only generate 1 copy of QueryInterface() method. So if one of the interfaces is chosen by the compiler as the this pointer, we need to adjust the other to the chosen one. This is what the this adjustor thunk is born for.
BTW-1, what if the compiler can generate 2 different instances of QueryInterface() method? Each one based on the corresponding interface pointer. This won't need the adjustor thunk, but it would take more space to store the extra but similar code.
BTW-2: it seems that sometimes a question lacks a reasonable explanation from the implementer's point of view, but could be better understood from the user's pointer of view.
Taking away the COM part from the question, the this pointer adjustor thunk is a piece of code that makes sure that each function gets a this pointer pointing to the subobject of the concrete type. The issue comes up with multiple inheritance, where the base and derived objects are not aligned.
Consider the following code:
struct base {
int value;
virtual void foo() { std::cout << value << std::endl; }
virtual void bar() { std::cout << value << std::endl; }
};
struct offset {
char space[10];
};
struct derived : offset, base {
int dvalue;
virtual void foo() { std::cout << value << "," << dvalue << std::endl; }
};
(And disregard the lack of initialization). The base sub object in derived is not aligned with the start of the object, as there is a offset in between[1]. When a pointer to derived is casted to a pointer to base (including implicit casts, but not reinterpret casts that would cause UB and potential death) the value of the pointer is offsetted so that (void*)d != (void*)((base*)d) for an assumed object d of type derived.
Now condider the usage:
derived d;
base * b = &d; // This generates an offset
b->bar();
b->foo();
The issue comes when a function is called from a base pointer or reference. If the virtual dispatch mechanism finds that the final overrider is in base, then the pointer this must refer to the base object, as in b->bar, where the implicit this pointer is the same address stored in b. Now if the final overrider is in a derived class, as with b->foo() the this pointer has to be aligned with the beginning of the sub object of the type where the final overrider is found (in this case derived).
What the compiler does is creating an intermediate piece of code. When the virtual dispatch mechanism is called, and before dispatching to derived::foo the intermediate call takes the this pointer and substracts the offset to the beginning of the derived object. This operation is the same as a downcast static_cast<derived*>(this). Remember that at this point, the this pointer is of type base, so it was initially offsetted, and this effectively returns the original value &d.
[1]There is an offset even in the case of interfaces --in the Java/C# sense: classes defining only virtual methods-- as they need to store a table to that interface's vtable.
Is it the only clue(base address) used by the interface method to locate data members within the object instance?
Yes, that's really all there is to it.
Yes, this is essential for finding where the object start is. You write in your code:
variable = 10;
where variable is the member variable. First of all, which object does it belong to? It belongs to the object pointed to by this pointer. So it's actually
this->variable = 10;
now C++ needs to generate code that will actuall do the job - copy data. In order to do that it needs to know the offset between the object start and the member variable. The convention is that this always points onto the object start, so the offset can be constant:
*(reinterpret_cast<int*>( reinterpret_cast<char*>( this ) + variableOffset ) ) = 10; //assuming variable is of type int
I think that it is important to point that in C++ there is no such entity as "interface pointer" or anything close to that. It is an idiom at best built on the concept of a restricted abstract class but still remaining a class. As a such all rules applying to class members and handling 'this' still apply unchanged.
So principally an interface class must behave like stand-alone classes of given type regardless of their function and eventual inheritance hierarchy.
We can use the virtual method call mechanism to get to the actual (dynamic type) of the object exposed by an (interface) base class. How it is done is implementation specific including such concepts like the Virtual Method Table and "adjustor thunks". Generally compiler may use its initial 'this' pointer to locate the VMT and then the actual implementation of a given function and call it with eventual adjustment of the 'this' pointer. The thunk adjustment generally is needed to perform the final call if the memory layout of the base class is different than a derived one to which reference we hold like in the case of multiple inheritance.