MEX code in a MATLAB Wrapper

MEX code in a MATLAB Wrapper - c++

I have the following code:
for i=1:N,
some_mex_file();
end
My MEX file does the following:
Declares an object, of a class I defined, that has 2 large memory blocks, i.e., 32x2048x2 of type double.
Processes the data in this object.
Destroys the object.
I am wondering if it takes more time when I call a MEX file in a loop that allocates large memory blocks for its object. I was thinking of migrating to C++ so that I can declare the object only once and just reset its memory space so that it can be used again and again without new declaration. Is this going to make a difference or going to be a worthless effort? In other words, does it take more time to allocate a memory in MEX file than to declare it once and reuse it?

So, the usual advice here applies: Profile your code (both in Matlab and using a C/C++ profiler), or at least stop it in a debugger several times to see where it's spending its time. Stop "wondering" about where it's spending its time, and actually measure where it's spending its time.
However, I have run into problems like this, where allocating/deallocating memory in the MEX function is the major performance sink. You should verify this, however, by profiling (or stopping the code in a debugger).
The easiest solution to this kind of performance problem is twofold:
Move the loop into the MEX function. Call the MEX function with an iteration count, and let your fast C/C++ code actually perform the loop. This eliminates the cost of calling from Matlab into your MEX function (which can be substantial for large N), and facilitates the second optimization:
Have your MEX function cache its allocation/deallocation, which is much, much easier (and safer) to do if you move the loop into the MEX function. This can be done several ways, but the easiest is to just allocate the space once (outside the loop), and deallocate it once the loop is done.

Related

Execute Instructions From The Heap

Can I allocate a block on the heap, set its bytes to values that correspond to a function call and its parameters, then use the function call and dereference operators to execute that sequence?

So if I read you right you want to dynamically create CPU assembly instructions on the heap and execute them. A bit like self-modifying code. In theory that's possible, but in practice maybe not.
The problem is that the heap is in a data segment, and CPU's/operating systems nowadays have measures to prevent exactly this kind of behavior (it's called the NX bit, or No-eXecute bit for x86 CPUs). If a segement is marked as NX, you can't execute code from it. This was invented to stop computer virusses from using buffer overflows to place exectuable code in data/heap/stack memory and then try the calling program to execute such code.
Note that DLL's and libraries are loaded in the code segment, which of course allows code execution.

Yes. How else could Dynamic loading and Linking work? Remembering that some (most?) Operating Systems, and some (most?) Linkers are also written in C/C++. For example,
#include <dlfcn.h>
void* initializer = dlsym(sdl_library,"SDL_Init");
if (initializer == NULL) {
// report error ...
} else {
// cast initializer to its proper type and use
}
Also, I believe that a JIT (e.g. GNU lightning and others) in general performs those operations.

In windows, for example, this is now very hard to do when it was once very easy. I used to be able to take an array of bytes in C and then cast it to a function pointer type to execute it... but not any more.
Now, you can do this if you can call Global or VirtualAlloc functions and specifically ask for executable memory. On most platforms its either completely open or massively locked down. Doing this sort of thing on iOS, for example, is a massive headache and it will cause a submission fail on the app store if discovered.
here is some fantastically out of date and crusty code where i did the original thing you described:
https://code.google.com/p/fridgescript/source/browse/trunk/src/w32/Code/Platform_FSCompiledCode.cpp
using bytes from https://code.google.com/p/fsassembler
you may notice in there that i need to provide platform (windows) specific allocation functions to get some executable memory:
https://code.google.com/p/fridgescript/source/browse/trunk/src/w32/Core/Platform_FSExecutableAlloc.cpp

Yes, but you must ensure that the memory is marked executable. How you do that depends on the architecture.

Delay on call to MATLAB function

I call a MATLAB function (dll) from my C++ code. This function gets an array as a parameter.
Function does some calculation on each member of array.
I did two tests.
For the first time I called this function with an array with 24 elements.
For the second time I called this function three times with 8 elements.
The second test took twice more time.
Why ?
Does enter into MATLAB function and exit from it take a lot of time ?
If yes, why ?

What you've noticed is that it costs a fair amount of time to call into a MEX function. Consider the minimum that Matlab has to do:
Scan the Matlab path to make sure that the function maps to the MEX file (and that the MEX file hasn't changed)
Load the MEX function from its DLL or shared library, and then resolve its mexFunction symbol.
Allocate arrays of input and output parameters, and initialize them
Call your function
Look for and free any temporary variables that your MEX function loaded
Free the arrays of input and output parameters
In theory, Matlab can use caching to avoid the first two steps. I'm not sure if it does, though. None of the subsequent steps can be skipped, or even really optimized by the Matlab interpreter (or its JIT compiler). Basically, if your calculation is fast, then you'll spend a lot more time calling the MEX function than actually running it.
You've already hit on the way to maximize MEX performance, which is to have the MEX function do as much as possible with each call.
In addition to having it work on as much data as you can on each call, you should also push any simple outer loops into the MEX function. Simple loops are easy to implement in MEX functions. They're also faster than loops in Matlab (even JIT-compiled Matlab), and avoid the cost of repeatedly calling the MEX function.
You can also see if judicious use of the mexLock function will help. You should provide some way to unlock the MEX function with mexUnlock, or you may start leaking memory, and will also have to restart your Matlab session every time you change the MEX function.

Converting a string into a function in c++

I have been looking for a way to dynamically load functions into c++ for some time now, and I think I have finally figure it out. Here is the plan:
Pass the function as a string into C++ (via a socket connection, a file, or something).
Write the string into file.
Have the C++ program compile the file and execute it. If there are any errors, catch them and return it.
Have the newly executed program with the new function pass the memory location of the function to the currently running program.
Save the location of the function to a function pointer variable (the function will always have the same return type and arguments, so
this simplifies the declaration of the pointer).
Run the new function with the function pointer.
The issue is that after step 4, I do not want to keep the new program running since if I do this very often, many running programs will suck up threads. Is there some way to close the new program, but preserve the memory location where the new function is stored? I do not want it being overwritten or made available to other programs while it is still in use.
If you guys have any suggestions for the other steps as well, that would be appreciated as well. There might be other libraries that do things similar to this, and it is fine to recommend them, but this is the approach I want to look into — if not for the accomplishment of it, then for the knowledge of knowing how to do so.
Edit: I am aware of dynamically linked libraries. This is something I am largely looking into to gain a better understanding of how things work in C++.

I can't see how this can work. When you run the new program it'll be a separate process and so any addresses in its process space have no meaning in the original process.
And not just that, but the code you want to call doesn't even exist in the original process, so there's no way to call it in the original process.
As Nick says in his answer, you need either a DLL/shared library or you have to set up some form of interprocess communication so the original process can send data to the new process to be operated on by the function in question and then sent back to the original process.

How about a Dynamic Link Library?
These can be linked/unlinked/replaced at runtime.
Or, if you really want to communicated between processes, you could use a named pipe.
edit- you can also create named shared memory.

for the step 4. we can't directly pass the memory location(address) from one process to another process because the two process use the different virtual memory space. One process can't use memory in other process.
So you need create a shared memory through two processes. and copy your function to this memory, then you can close the newly process.
for shared memory, if in windows, looks Creating Named Shared Memory
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366551(v=vs.85).aspx
after that, you still create another memory space to copy function to it again.
The idea is that the normal memory allocated only has read/write properties, if execute the programmer on it, the CPU will generate the exception.
So, if in windows, you need use VirtualAlloc to allocate the memory with the flag,PAGE_EXECUTE_READWRITE (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx)
void* address = NULL;
address= VirtualAlloc(NULL,
sizeof(emitcode),
MEM_COMMIT|MEM_RESERVE,
PAGE_EXECUTE_READWRITE);
After copy the function to address, you can call the function in address, but need be very careful to keep the stack balance.

Dynamic library are best suited for your problem. Also forget about launching a different process, it's another problem by itself, but in addition to the post above, provided that you did the virtual alloc correctly, just call your function within the same "loadder", then you shouldn't have to worry since you will be running the same RAM size bound stack.
The real problems are:
1 - Compiling the function you want to load, offline from the main program.
2 - Extract the relevant code from the binary produced by the compiler.
3 - Load the string.
1 and 2 require deep understanding of the entire compiler suite, including compiler flag options, linker, etc ... not just the IDE's push buttons ...
If you are OK, with 1 and 2, you should know why using a std::string or anything but pure char *, is an harmfull.
I could continue the entire story but it definitely deserve it's book, since this is Hacker/Cracker way of doing things I strongly recommand to the normal user the use of dynamic library, this is why they exists.

Usually we call this code injection ...
Basically it is forbidden by any modern operating system to access something for exceution after the initial loading has been done for sake of security, so we must fall back to OS wide validated dynamic libraries.
That's said, one you have valid compiled code, if you realy want to achieve that effect you must load your function into memory then define it as executable ( clear the NX bit ) in a system specific way.
But let's be clear, your function must be code position independant and you have no help from the dynamic linker in order to resolve symbol ... that's the hard part of the job.

Quick successful exit from C++ with lots of objects allocated

I'm looking for a way to quickly exit a C++ that has allocated a lot of structures in memory using C++ classes. The program finishes correctly, but after the final "return" in the program, all of the auto-destructors kick in. The problem is the program has allocated about 15GB of memory through lots of C++ class structures, and this auto-destruct process takes about 1 more hour itself to complete as it walks through all of the structures - even though I don't care about the results. The program only took 1 hour to complete the task up to this point. I would like to just return to the OS and let it do its normal wholesale process allocation deletion - which is very quick. I've been doing this by manually killing the process during the cleanup stage - but am looking for a better programic solution.
I would like to return a success to the OS, but don't care to keep any of the memory content. The program does perform a lot of dynamic allocation/deallocation during the normal processing, so it's not just simple heap management.
Any opinions?

In Standard C++ you only have abort(), but that has the process return failure to the OS.
On many platforms (Unix, MS Windows) you can use _exit() to exit the program without running cleanup and destructors.

C++0x std::quick_exit is what you are looking for if your compiler already supports it (g++-4.4.5 does).

If the 15 GB of memory is being allocated to a reasonably small number of classes, you could override operator delete for those classes. Just pass the call to the standard delete, but set up a global flag that, if set, will make the call to delete a no-op. Or, if the logic of your program is such that these objects are not deleted in the normal course of building your data structures, you could simply ignore delete in all cases for these classes.

As Naveen says, this can't be a matter of memory deallocation. I've written neural network simulations with evolutionary algorithms that where allocating and freed lots of memory in small and large chunks and this was never a major issue.

If you have a C99 compiler, you can use the _Exit function to end immediately without having global object destructors or any functions registered with atexit to be called; whether or not unwritten buffered file data is flushed, open streams are closed, or temporary files are removed is implementation-defined (C99 §7.20.4.4).
If you're on Windows, you can also use ExitProcess to achieve the same effect.
But, as others have said, your destructors should really not be taking an hour to run unless you're doing a fair amount of I/O (writing log files, etc.). I strongly, strongly recommend you profile your program to see where the time is spent.

The possible strategies depend on the number of objects that are directly visible in main through which you access the 15GB of data and if these are local to main or statically allocated.
If all access to the 15GB of data is through local objects in main, then you can simply replace the return 0; at the end of main with exit(0);.
exit will terminate your application and trigger cleanup of statically allocated variables, but not of local variables.
If the data is accessed through a handful of statically allocated variables, you could turn them into pointers (or references) to dynamically allocated memory and deliberately leak that.

Does it take time to deallocate memory?

I have a C++ program which, during execution, will allocate about 3-8Gb of memory to store a hash table (I use tr1/unordered_map) and various other data structures.
However, at the end of execution, there will be a long pause before returning to shell.
For example, at the very end of my main function I have
std::cout << "End of execution" << endl;
But the execution of my program will go something like
$ ./program
do stuff...
End of execution
[long pause of maybe 2 min]
$ -- returns to shell
Is this expected behavior or am I doing something wrong?
I'm guessing that the program is deallocating the memory at the end. But, commercial applications which use large amounts of memory (such as photoshop) do not exhibit this pause when you close the application.
Please advise :)
Edit: The biggest data structure is an unordered_map keyed with a string and stores a list of integers.
I am using g++ -O2 on linux, the computer I am using has 128GB of memory (with most of that free). There are a few giant objects
Solution: I ended up getting rid of the hashtable since it was almost full anyways. This solved my problem.

If the data structures are sufficiently complicated when your program finishes, freeing them might actually take a long time.
If your program actually must create such complicated structures (do some memory profiling to make sure), there probably is no clean way around this.
You can short cut that freeing of memory by a dirty hack - at least on those operating systems where all memory allocated by a process is automatically freed when the process terminates.
You would do that by directly calling the libc's exit(3) function or the operating system's _exit(2). However, I would be very careful about verifying this does not short-circuit any other (important) cleanups some C++ destructor code might be doing. And what this does or does not do is highly system dependent (operating system, compiler, libc, the APIs you were using, ...).

Yes the deallocation of memory can take some time, and also possibly you have code executing like destructors being called. Photoshop does not use 3-8GB of memory.
Also you should perhaps add profiling to your application to confirm it is the deallocation of memory and not something else.

(I started this as a reply to ndim, but it got to long)
As ndim already posted, termination can take a long time.
Likely reasons are:
you have lots of allocations, and parts of the heap are swapped to disk.
long running destructors
other atexit routines
OS specific cleanup, such as notifying DLL's of thread & process termination on Windows (don't know what exactly happens on Linux.)
exit is not the worst workaround here, however, actual behavior is system dependent. e.g. exit on WIndows / MSVC CRT will run global destructors / atexit routines, then call ExitProcess which does close handles (but not necessarily flush them - at least it's not guaranteed).
Downsides: Destructors of heap allocated objects don't run - if you rely on them (e.g. to save state), you are toast. Also, tracking down real memory leaks gets much harder.
Find the cause You should first analyze what is happening.
e.g. by manually freeing the root objects that are still allocated, you can separate the deallocation time from other process cleanup. Memory is the likely cause accordign to your description, but it's not the only possible one. Some cleanup code deadlocking before it runs into a timeout is possible, too. Monitoring stats (such as CPU/swap activity/disk use) can give clues.
Check the release build - debug builds usually use extra data on the heap that can immensely increase cleanup cost.
Different allocators
Ifdeallocation is the problem, you might benefit a lot from using custom allocation mechanisms. Example: if your map only grows (items are never removed), an arena allocator can help a lot. If your lists of integers have many nodes, switch to a vector, or use a rope if you need random insertion.

Certainly it's possible.
About 7 years ago I had a similar problem on a project, there was much less memory but computers were slower too I suppose.
We had to look at the assembly languge for free in the end to work out why it was so slow and it seemed that it was essentially keeping the freed blocks in a linked list so they could be reallocated and was also scanning that list looking for blocks to combine. Scanning the list was an O(n) operation but freeing 'n' objects turned it into O(n^2)
Our test data took about 5 seconds to free the memory but some customers had about 10 times as much data as we every used and it was taking 5-10 minutes to shut down the program on their systems.
We fixed it, as has been suggested by just terminating the process instead and letting the operating system clear up the mess (which we knew was safe to do on our application).
Perhaps you have a more sensible free function that we had several years ago, but I just wanted to post that it's entirely possible if you have many objects to free and an O(n) free operation.

I can't imagine how you'd use enough memory for it to matter, but one way I sped up a program was to use boost::object_pool to allocate memory for a binary tree. The major benefit for me was that I could just put the object pool as a member variable of the tree, and when the tree went out of scope or was deleted, the object pool would be deleted all at once (letting me not have to use a recursive deconstructor for the nodes). object_pool does call all of its objects decontructors at exit though. I'm not sure if it handles empty decontructors in a special way or not.
If you don't need your allocator to call a constructor, you can also use boost::pool, which I think may deallocate faster because it doesn't have to call deconstructors at all and just deleted the chunk of memory in one free().

Freeing memory may well take time - data structures are being updated. How much time depends on the allocator being used.
Also there might be more than just memory deallocation going on - if destructors are being executed, there may be a lot more than that going on.
2 minutes does sound like a lot of time though - you might want to step through the clean up code in a debugger (or use a profiler if that's more convenient) to see what's actually taking all the time.

The time is probably not entirely wasted deallocating memory, but calling all the destructors. You can provide your own allocator that does not call the destructor (if the object in the map doesn't need to be destructed, but only deallocated).
Also take a look at this other question: C++ STL-conforming Allocators

Normally, deallocating memory as a process ends is not taken care of as part of the process, but rather as an operating system cleanup function. You might try something like valgrind to make sure your memory is being dealt with properly. However, the compiler also does certain things to set up and tear down your program, so some sort of performance profiling, or using a debugger to step through what is taking place at teardown time might be useful.

when your program exits the destructors of all the global objects are called.
if one of them takes a long time, you will see this behavior.
look for global objects and investigate their destructors.

Sorry, but this is a terrible question. You need to show the source code showing the specific algorithms and data structures that you are using.
It could be de-allocating, but that's just a wild guess. What are your destructors doing? Maybe is paging like crazy. Just because your application allocates X amount of memory, that doesn't mean it will get it. Most likely it will be paging off virtual memory. Depending on how the specifics of your application and OS, you might be doing a lot of page faults.
In such cases, it might help to run iostat and vmstat on the background to see what the heck is going on. If you see a lot of I/O that's a sure sign you are page faulting. I/O operations will always be more expensive that memory ops.
I would be very surprised if indeed all that lapsed time at the end is purely due to de-allocation.
Run vmstat and iostat as soon as you get the "ending" message, and look for any indications of I/O going bananas.

The objects in memory are organized in a heap. They are not deleted at once, they are deleted one by one, and the cost of deleting an object is O(log n). Freeing them takes loooong.
The answer is then, yes, it takes so much time.

You can avoid free being called on an object by using a destructor call my_object->~my_class() instead of delete my_object. You can avoid free on all objects of a class by overriding and nullifying operator delete( void * ) {} inside the class. Derived classes with virtual destructors will inherit that delete, otherwise you can copy-paste (or maybe using base::operator delete;).
This is much cleaner than calling exit. Just be sure you don't need that memory back!

I guess your unordered map is a global variable, whose constructor is called at process startup, and destructor is called at process exit.
How could you know if the map is guilty?
You can test if your unordered_map is responsible (and I guess it is) by allocating it with a new, and, well, ahem... forget to delete it.
If your process' exit goes faster, then you have your culprit.
Why this is so sloooooow?
Now, just by reading your post, for your unordered map, I see potential allocations for:
strings allocated buffer
list items (each one being a string + other things)
unordered map items + the bucket array
If you have 3-8 Gb of data in this unordered map, this means that each item above will need some kind of new and delete. And if you free every item, one by one, it could take time.
Other reasons?
Note that if you add items to your map item by item while your process executing, the new are not exactly perceptible... But the moment you want to clean all, all your allocated items must be destroyed at the same time, which could explain the perceived difference between construction/use and destruction...
Now, the destructors could take time for an additional reason.
For example, on Visual C++ 2008 in debug mode, for example, upon destruction of STL iterators, the destructor verifies the iterators are still correct. This caused quite a slowdown upon my object destruction (which was basically a tree of nodes, each node having list of child nodes, with iterators everywhere).
You are working on gcc, so perhaps they have their own debug testing, or perhaps your destructors are doing additional work (e.g. logging?)...

In my experience, the calls to free or delete should not take a significant amount of time. That said, I have seen plenty of cases where it does take non-trivial time to destruct objects because of destructors that did non-trivial things. If you can't tell what's taking time during the destruction, use a debugger and/or a profiler to determine what's going on. If the profiler shows you that it really is calls to free() that take a lot of time, then you should improve your memory allocation scheme, because you must be creating an extremely large number of small objects.
As you noted plenty of applications allocate large amounts of memory, and incur no significant memory during shutdown, so there's no reason your program can't do the same.

I would recommend (as some others have) a simple forced process termination, if you're certain that you've nothing left to do but free memory (for example, no file i/o and such left to do).
The thing is that when you free memory, typically, it's not actually returned to the OS - it's held in a list to be reallocated, and this is obviously slow. However, if you terminate process, the OS will lump reclaim all your memory at once, which should be substantially faster. However, as others have said, if you have any destructors that need to run, you should ensure that they are run before force calling exit() or ExitProcess or anysuch function.
What you should be aware of is that deallocating memory that is spread out (e.g., two nodes in a map) is much slower due to cache effects than deallocating memory in a vector, because the CPU needs to access the memory to free it and run any destructors. If you deallocated a very large amount of memory that's very fragmented, you could be falling afoul of this, and should consider changing to some more contiguous structures.
I actually had a problem where allocating memory was faster than de-allocating it, and after allocating memory and then de-allocating it, I had a memory leak. Eventually, I worked out that this is why.

I am currently facing a similar issue, with a CPU & memory intensive research program of mine. It runs until a specified time limit, prints a solutions and exits. The destructor call of a single object (containing up to 10⁶ relatively small objects) was what unexpectedly took time at the end of execution (about 10sec. to free 5Gb of data).
I was not satisfied by the answers advising to avoid executing every destructor, so here is the solution I came up with:
Original code:
void process() {
vector<unordered_map<State, int>> large_obj(100);
// Processing...
} // Takes a few seconds to exit (destructor calls)
Solution:
void process(bool free_mem = false) {
auto * large_obj_ = new vector<unordered_map<State, int>>(100);
auto &large_obj = *large_obj;
// Processing...
// (No changes required here, 'large_obj' can be used exactly as before)
if(free_mem)
delete large_obj_;
}
It has the advantage of being completely transparent apart from a few lines to insert, and it can even be parametrized to take some time to free the memory if needed. It is explicit which object will intentionally not be freed to avoid leaving things in an "unstable" state. Memory is cleaned up instantly by the OS on exit when free_mem = false.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js