Lua, c++ and disappearing metatables - c++

Background
I work with Watusimoto on the game Bitfighter. We use a variation of LuaWrapper to connect our c++ objects with Lua objects in the game. We also use a variation of Lua called lua-vec to speed up vector operations.
We have been working to solve a bug for some time that has eluded us. Random crashes will occur that suggest corrupt metatables. See here for Watusimoto's post on the issue. I'm not sure it is because of a corrupt metatable and have seen some really odd behavior about which I wish to ask here.
The Problem Manifestation
As an example, we create an object and add it to a level like this:
t = TextItem.new()
t:setText("hello")
levelgen:addItem(t)
However, the game will sometimes (not always) crash. With an error:
attempt to call missing or unknown method 'addItem' (a nil value)
Using a suggestion given in answer to Watusimoto's post mentioned above, I have changed the last line to the following:
local ok, res = pcall(function() levelgen:addItem(t) end)
if not ok then
local s = "Invalid levelgen value: "..tostring(levelgen).." "..type(levelgen).."\n"
for k, v in pairs(getmetatable(levelgen)) do
s = s.."meta "..tostring(k).." "..tostring(v).."\n"
end
error(res..s)
end
This prints out the metatable for levelgen if something when wrong calling a method from it.
However, and this is crazy, when it fails and prints out the metatable, the metatable is exactly how it should be (with the correct addItem call and everything). If I print the metatable for levelgen upon script load, and when it fails using pcall above, they are identical, every call and pointer to userdata is the same and as it should be.
It is as though the metatable for levelgen is spontaneously disappearing at random.
Would anyone have any idea what is going on?
Thank you
Note: This doesn't happen with only the levelgen object. For instance, it has happened on the TestItem object mentioned above as well. In fact, that same code crashes on my computer at the line levelgen:addItem(t) but crashes on another developer's computer with the line t:setText("hello") with the same error message missing or unknown method 'setText' (a nil value)

As with any mystery, you will need to peel it off layer by layer. I recommend going through the same steps Lua is going and trying to detect where the path taken diverge from your expectations:
What does getmetatable(levelgen).__index return? If it's a table, then check its content for addItem. If it's a function, then try to call it with (table, "addItem") and see what it returns.
Check if getmetatable returns reference to the same object before and after the call (or when it fails).
Are there several levels of metatable indirection that the call is going through? If so, try to follow the same path with explicit calls and see where the differences are.
Are you using weak keys that may cause values to disappear if there are no other references?
Can you provide a "default" value when you detect that it fails and continue to see if it "finds" this method again later? Or when it's broken, it's broken for every call after that?
What if you save a proper value for addItem and "fix" it when you detect it's broken?
What if you simply handle the error (as you do) and call it 10 times? Would it show valid results at least once (after it fails)? 100 times? If you keep calling the same method when it works, will it fail? This may help you to come up with a more reproducible error.
I'm not familiar with LuaWrapper to provide more specific questions, but these are the steps I'd take if I were you.

I strongly suspect the issue is that you have a class or struct similar to this:
struct Foo
{
Bar bar;
// Other fields follow
}
And that you've exposed both Foo and Bar to Lua via LuaWrapper. The important bit here is that bar is the first field on your Foo struct. Alternatively, you may have some class that inherits from some other base class and both the derived and base class are exposed to LuaWrapper.
LuaWrapper uses an function called an Identifier to uniquely track each object (like whether or not the given object has already been added to the Lua state). By default it uses the object address as a key. In cases like the one posed above it is possible that both Foo and Bar have the same address in memory, and thus LuaWrapper can get confused.
This may result in grabbing the wrong object's metatable when attempting to look up a method. Clearly, since it's looking at the wrong metatable it won't find the method you want, and so it will appear as if your metatable has mysteriously lost entries.
I've checked in a change that tracks each object's data per-type rather than in one giant pile. If you update your copy LuaWrapper to latest one from the repository I'm fairly certain your problem will be fixed.

After merging with upstream (commit 3c54015) LuaWrapper, this issue has disappeared. It appears to have been a bug in LuaWrapper.
Thanks Alex!

Related

Trouble breaking down a class and passing pointers

I have working code in a single class that looks like this
Mesh* player_; //
renderer_3d_->DrawSkinnedMesh(*player_, player_->bone_matrices());
seems straight forward, but I'm having trouble introducing another a vector of enemies going through classes, thanks to the pointers.
I have two extra classes, enemy and manager. Enemy contains a SkinnedMeshInstance, and manager should worry about drawing it.
Manager
std::vector <Enemy> enemy_;
enemy_.push_back(Enemy(*platform_)); // Initialise (platform is required by the default constructor, not relevant to the pointer issue)
rend->DrawSkinnedMesh(enemy_[0].getSkinnedMesh(), enemy_[0].getSkinnedMesh().bone_matrices()); //render, intellisense only accepts it this way
Enemy
Mesh* mesh_instance_;
Mesh getSkinnedMesh() { return *mesh_instance_; };
What am I doing wrong here? Notice how rendering has changed how it's dereferenced. This way doesn't work as it throws some illegal access errors either at 'return *mesh_instance_;' or deeper in the framework itself, depending on how I try to change the communication. Might be simple for some but I feel like I tried everything possible.
Solved. I didn't inisialise my pointer as NULL.
Dunde, I'd like to beg the indulgence of the group to pass along a couple of "lessons painfully learned" about code such as this excerpt:
Mesh getSkinnedMesh() { return *mesh_instance_; };
First, of course, you know that you need to be absolutely-certain that variables such as mesh_instance really are global, and that they really are initialized to NULL.
Now, since you know that a call to this "accessor function" is going to take place each and every time someone needs to "get a skinned mesh," you should always take full advantage of this opportunity. If you need to "automagically create a new object-instance," do so now. And (IMHO), if there's something that you can do to verify that the pointer-value is "good" (using a try construct to detect a runtime-error that occurs while doing so), do it. In this way, you now will be able to say: "if this method returns a value to you at all, then the value which it returns is good, because otherwise it's gonna blow up."
In practice, the hardest thing about any programming bug is – knowing that it exists, and where it is. "If you've got a traceback, you're home free."

Why is QueryInterface looking at two different COM projects for the same line of code?

Let me start off by saying I am VERY inexperienced with the workings of COM, but I have been tasked with debugging an issue for someone else
I have two COM projects named pvTaskCOM and pvFormsCOM and each has many Interfaces, but the two I am concerned with are:
ITaskActPtr which is in pvTaskCOM
IChartingObjectPtr which is in pvFormsCOM
The line of code causing my problem is:
ITaskActPtr pTaskAct = m_pChartObj;
Where m_pChartObj is an IChartingObjectPtr. The problem I was encountering was that pTaskAct was NULL after this assignment in one workflow, but fine in most other workflows. I dived into what is happening here using the debugger and found it is looking at the wrong COM entries during the QueryInterface. In the workflows that work fine, QueryInterface grabs entries from pvTaskCOM/pvTaskAct.h:
BEGIN_COM_MAP(CTaskAct)
COM_INTERFACE_ENTRY(ITaskAct)
.
.
.
END_COM_MAP()
Which contains the Interface I'm trying to cast to, and QueryInterface returns S_OK.
But in this other workflow m_pChartObj is instantiated in the same way, but QueryInterface for some strange reason looks inside pvFormsCOM/ChartingObject.h
BEGIN_COM_MAP(CChartingObject)
COM_INTERFACE_ENTRY(IChartingObject)
.
.
.
END_COM_MAP()
which does NOT contain the ITaskAct we are trying to cast to, and so QueryInterface returns E_NOINTERFACE.
The question I have is what could cause it to be looking at two different COM's for the same line of code? Is it some sort of inheritance issue? I just need a step in the right direction.
In the workflows that work fine, QueryInterface grabs entries from pvTaskCOM/pvTaskAct.h
It shouldn't be.
This line:
ITaskActPtr pTaskAct = m_pChartObj;
Is doing this under the hood:
ITaskAct *pTaskAct = NULL;
m_pChartObj->QueryInterface(IID_ITaskAct, (void*)&pTaskAct);
It is asking the IChartingObject's implementing object if it supports the ITaskAct interface, and if so to return a pointer to that implementation. So this code should only be looking at the entries of the COM_MAP for the CChartingObject class. It should not be looking at the CTaskAct class at all.
But in this other workflow m_pChartObj is instantiated in the same way, but QueryInterface for some strange reason looks inside pvFormsCOM/ChartingObject.h
That is the correct behavior, since that is where CChartingObject is actually implemented. If there is no entry for ITaskAct in the COM_MAP of CChartingObject, then the correct behavior is for CChartingObject::QueryInterface() to fail with an E_NOINTERFACE error.
So, the real problem is that your "working" workflows are actually flawed, and your "non working" workflow is doing the correct thing.
what could cause it to be looking at two different COM's for the same line of code? Is it some sort of inheritance issue?
No. The "working" workflows are corrupted, plain and simple. Calling QueryInterface() on an IChartingObject interface should be calling CChartingObject::QueryInterface(), but it is clearly calling CTaskAct::QueryInterface() instead. So either
the IChartingObject* pointer is actually pointing at a CTaskAct object instead of a CChartingObject object
something has corrupted memory and the IChartingObject's vtable is the unsuspecting victim.
I would suspect the former. So, in the "working" workflows, make sure the IChartingObject* pointer is actually pointing at the correct object. It sounds like someone took an ITaskAct* and type-casted it to a IChartingObject* without using QueryInterface(). Or they called QueryInterface() on some object and asked it for IID_ITaskAct instead of IID_IChartingObject but then saved the returned pointer in an IChartingObject* pointer instead of an ITaskAct* pointer.
You are probably getting a bit lost in the plumbing. This is C++ code that was meant to make COM a bit less draconian. An important aspect of COM is that client code only ever works with interfaces. It doesn't know anything about objects. An interface is a simple contract, a list of functions that you can call. IChartingObject would have, say, a Paint() function. ITaskAct would have, no real idea, something "tasky", a Schedule() function.
Note how m_pChartObj is a pretty misleading name. It stores an interface pointer, not an object. But not uncommon, it is easy to think of an interface pointer as an object pointer if the object implements only one interface or has a "dominant" interface that you'd use all the time. Hiding the object inside the server code is a very strong goal in COM, you can only ever make interface calls.
So the ITaskActPtr pTaskAct = m_pChartObj; basically announces, "I have a chart, I want to make task functions calls next". Like Schedule(). That requires COM to ask the chart object implementation "do you know anything about the task interface contract?". Inevitably it has to consult back to the server, in the interface map for the CChartingObject where IChartingObject came from, to see if it also implements ITaskAct.
So what you see happening is entirely normal. The answer is "no".

c++ Testing for a missing callback function

EDIT: added backquotes to callback template. The interface was reading the asterisks as markdown indicators, not just as asterisks!
In a Windows DLL/Linux SO I am writing, I give the user app a way to register a callback function. Works great. The callback prototype looks like (void)(*callback)(void*);
I was having a fit of paranoia while writing the docs and realized, I have no really good way to know if the registered address is valid. The only feedback is either a crash or call the callback inside a try/catch.
I have no idea what exception would be thrown if the callback did not exist and who-knows-what executed. NOt even really sure that the call to "nowhere" could recover itself enough to generate the exception instead of a crash.
Yes, I know it's the user's problem. Just trying to be thoughtful and maybe help the user understand his bug.
So, what exception would this throw? Windows and Linux answers please if they differ.
Or, is there a better way to approach this without having to use an exception catch to detect the missing function?
There's no way to recover. Similarly, you cannot recover if the callback contains a line like *(int*)(0x1234) = 5;. Just live with it.
As a C++ library developer, you're not in the business of making sure that nothing ever crashes. You merely provide code that does what it promises when used the way you document.
A bit off-topic, but callbacks in the form of void(*)(), i.e. taking 0 arguments, are less than useful. A useful C-style callback accepts a user specified argument, so that the user can find the state corresponding to the callback. E.g.:
typedef void callback_fn(void* user_arg);
callback_id register_callback(callback_fn* callback, void* user_arg);
void unregister_callback(callback_id);
Without user_arg the user of your callback will be forced to use a global variable to store state corresponding to the callback function.
The situation you describe is rather unlikely. I have never seen such handling anywhere. The program would just crash and ruin its user.
But your concern is valid, because the failure root cause (assigning wrong address) and its manifestation (call to invalid address) can be so far away from each other that it could be very hard to identify it.
All I could advise here is to "fail fast and loud". For instance, you could do test call of the callback whenever it is assigned. This will still lead to crash, but now in the stack trace user will see where it all started from.
Again, this is not something an ordinary library user would expect...
As I answered here:
How to test if an instance is corrupted?
(completely different type of question, but same applies)
If the pointer is "not recognisable NULL or similar", then there is no way, in code, to tell if it's valid or not.
You also can't use try/catch to capture the failure, as "failure to execute the code" does not result in a throw.
Since this a "programmer error", I don't believe it's a big issue. Programmers can do what they like with their own code anyway, so whatever mechanism you add, it's going to be possible to circumvent in some way or another.
As others have said, there's no way to check, but... Is it
really necessary? I'm all in favor of defensive programming,
but the only way you can possibly get a pointer to a function
(other than a null pointer) is by taking the address of
a function. Some compilers do allow explicitly converting
a pointer to an object to a pointer to function, despite the
fact that the standard requires a diagnostic in such cases. But
even then, the client code needs an explicit cast to screw up.
And unlike objects, functions life through out the lifetime of
the program, so you cannot have a problem with a dangling
pointer—a pointer which was once valid, but isn't any
more. There is, in fact, practically no way to get an invalid
pointer to a function except intentionally (the only way that
occurs to me is if the unload a DLL with the function), and if
someone intentionally wants to screw up, there's no way you'll
be able to prevent it.

Segfault calling virtual method on initialized object

I'm getting a seg fault that I do not understand. I'm using the Wt library and doing some fancy things with signals (which I only mention because it has enabled me to attempt to debug this).
I'm getting a pointer to one of my widgets from a vector and trying to call a method on the object it points to. Gdb shows that the pointer resolves, and if I examine the object it points to, it is exactly the one I need to modify. In this instance, the widget is broadcasting to itself, so it is registered as both the broadcaster and the listener; therefore, I was also able to verify that the 'broadcaster' pointer and the 'listener' pointer are accessing the same object. They do!
However, even though I can see that the object exists, and is initialized, and is in fact the correct object, when I try to call a method on the object, I get an immediate seg fault. I've tried a few different methods (including a few boolean returns that don't modify the object). I've tried calling them through the broadcaster pointer and the listener pointer, again, just to try to debug.
The debugger doesn't even enter the object; the segfault occurs immediately on attempting to call a method.
Code!
/* listeners is a vector of pointers to widgets to whom the broadcasting widget
* is trying to signal.
*/
unsigned int num_listeners = listeners.size();
for (int w = 0; w < num_listeners; w++)
{
// Moldable is an abstraction of another widget type
Moldable* widget = listeners.at(w);
/* Because in this case, the broadcaster and the listener are one in the same,
* these two point to the same location in memory; this part works. I know, therefore,
* that the object has been instantiated, exists, and is happy, or we wouldn't
* have gotten to this point to begin with. I can also examine the fields with gdb
* and can verify that all of this is correct.
*/
Moldable* broadcaster_debug = broadcast->getBroadcaster();
/* setStyle is a method I created, and have tested in other instances and it
* works just fine; I've also used native Wt methods for testing this problem and
* they are also met with segfaults.
*/
widget->setStyle(new_style); // segfault goes here!
}
I have read since researching that storing pointers in vectors is not the greatest idea and I should look into boost::shared_ptr. That may be so, and I will look into it, but it doesn't explain why calling a method on an object known to exist causes a segfault. I'd like to understand why this is happening.
Thanks for any assistance.
Edit:
I have created a gist with the vector operations detailed because it was more code than would comfortably fit in the post.
https://gist.github.com/3111137
I have not shown the code where the widgets are created because it's a recursive algorithm and in order to do that, I would have to show the entire class decision tree for creating widgets. Suffice to say that the widgets are being created; I can see them on the page when viewing the application in a browser. Everything works fine until I start playing with my fancy signals.
Moar Edit:
When I take a look at the disassembly in instruction stepping mode, I can see that just before the segfault occurs, the following operation takes place, the first argument of which is listed as 'void'. Admittedly, I know nothing about Assembly much to my chagrin, but this seems to be important. Can anyone explain what this instruction means and whether it might be the cause of my woes?
add $0x378,%rax //$0x378 is listed as 'void'
Another Edit:
At someone's suggestion, I created a non-virtual method that I am able to successfully call just before the seg fault, meaning the object is in fact there. If I take the same method and make it virtual, the seg fault occurs. So, why do only virtual methods create a seg fault?
I've discovered now that if in the calling class, I make sure to specify Moldable::debug_test (and Moldable::setStyle), the seg fault does not take place. However, this seems to have a similar effect as const bubbling -- every virtual method seems to want this specifier. I've never witnessed this behaviour before. While i'm willing to correct my code if that's REALLY how it's supposed to be, I'm not sure if the root problem is something else.
Getting there!
Well, I figured out the problem, though I'm sad to say it was a totally newbish mistake that due to the nature of the project was super difficult to find. I'll put the answer here, and I've also voted to close the question as too localized. Please feel free to do the same.
The BroadcastMessage class had a __broadcaster field (Moldable* __broadcaster;). When passing in the pointer to the broadcaster into the BroadcastMessage constructor, I forgot to assign the inbound pointer to that field, meaning __broadcaster was not a fully realised instance of the Moldable class.
Therefore, some methods were in fact working -- those that could be inlined, or my dummy functions that I created for testing (one of which returned a value of 1, for instance), so it was appearing that there was a full object there when in fact there was not. It wasn't until calling a more specialized method that tried to access some specific, dynamic property of the object that the segfault occurred.
What's more, most of the broadcast message lifespan was in its constructor, meaning that most of its purpose was fulfilled without issue, because the broadcaster was available in the local scope of the constructor.
However, using Valgrind as suggested, I did uncover some other potential issues. I also pretty much stripped-down and re-built the entire project. I trashed tons of unnecessary code and it runs a lot faster now as a side effect.
Anyway, thanks for all the assistance. Sorry the solution wasn't more of a discovery.

Object having two different instances at method activation

The title is not that clear, and if anybody has a better suggestion please tell me.
Now to business:
I am activating a class' method.
m_someObject.Clear();
The problem is that when I look at the address of m_someObject before the call I get that it is located in a certain address, and when I enter the Clear method with the debugger I get that this variable is located in another address.
The result is that after returning from Clear method it doesn't seem to have affected
m_someObject instance which called it.
Does anybody have any idea what could cause this kind of behavior?
Working on Microsoft Visual Studio 2010 64-bit.
Probably you pass m_someObject as a value to some other function (and thus get a copy) and execute Clear() only on copy. This way you will not notice a change on original object.
Can you please check if you have two different variables with the same name? One defined in the immediate scope and another one, maybe in the global scope?
The most common reason is Multiple Inheritance. Unlike C# and Java, in C++ a class can have multiple base classes. Obviously, not all can be located at offset 0. This means that this has to be adjusted if you're using a method from a base class that's located at a non-zero offset.
Well, apparently the debugger was lying.. I wasn't aware of this, but apparently some of the code was compiled in release mode. Conclusion - Debugger No, printf - Yes.