CPLEX: freeing model (resources) takes immense time - c++

I am solving a MIP and have built a corresponding CPLEX IloModel. My implementation follows the following pseudo-code:
model = IloModel( env );
//Build optimization model
//Configure CPLEX Solver
//Solve model
//Do some solution-statistics
model.end();
Everything works fine, I get correct solutions, et cetera. Now, I would like to automate solving a lot of different instances sequentially.
However, here I ran into a problem: the bigger my instances, the longer freeing resources using model.end() takes time. For my small instances (using up to 500mb of RAM) it already takes dozens of minutes, for medium sized instance (using up to 2 GB of RAM) it takes hours and I never measured how long it takes my large instances (using up to 32 GB of RAM), as I always manually killed the process after it did not finish over a whole nights wait. Therefore, freeing ressources takes significantly longer than building the model or solving it using my specified time-limits. While model.end() runs, the CPU usage always stays at roughly 100%.
Is this expected behaviour? Have I missed something in implementing my model or how to free resources that it takes this excessive amount of time?
I really want to avoid automating solving multiple instances in sequence through killing the CPLEX solve process after a specified time threshold.
Thank you!
EDIT:
I can circumvent the problem by running env.end() (which takes <1s even for large models) instead of model.end(). As I do not reuse the environment for now, that is ok for me. However, I wonder what is happening here, from what I gathered from the docs, freeing the resources allocated for the model is a subprocess of freeing the whole environment.

I'm guessing, but did you terminate the solver before terminating the model? The solver is using the model and so it is notified about its changes. It could be that model.end() is not optimized and as it is freeing constraints one by one, solver is notified about each particular change, it updates its own data structures etc.
In other words, I think that calling cplex.end() before model.end() may solve the issue.
If you can then it is always best to call env.end() after each solve. As you noticed, it is faster: it is easier to free all the resources at once since there's no need to check whether a particular resource is still needed (e.g. a variable could be used by multiple models). It is also safer since new model starts from scratch and the risk of a memory leak is minimized.

Related

Is there any way to isolate two thread memory space?

I have an online service which is single thread, since whole program share one memory pool(use new or malloc),a module might destroy memory which leads to another module work incorrectly, so I want to split whole program into two part, each part runs on a thread, is it possible to isolate thread memory like multiprocess so I can check where is the problem? (splitting it into multiprocess cost a lot of time and risky so I don't want to try)
As long as you'll use threads, memory can be easily corrupted since, BY DESIGN, threads are sharing the same memory. Splitting your program across two threads won't help in ANY manner for security - it can greatly help with CPU load, latency, performances, etc. but in no way as an anti memory corruption mechanism.
So either you'll need to ensure a proper development and that your code won't plow memory where it must not, or you use multiple process - those are isolated from each other by operating system.
You can try to sanitize your code by using tools designed for this purpose, but it depends on your platform - you didn't gave it in your question.
It can go from a simple Debug compilation with MSVC under Windows, up to a Valgrind analysis under Linux, and so on - list can be huge, so please give additional informations.
Also, if it's your threading code that contains the error, maybe rethink what you're doing: switching it to multiprocess may be the cheapest solution in the end - don't be fooled by sunk cost, especially since threading can NOT protect part 1 against part 2 and reciprocally...
Isolation like this is quite often done by using separate processes.
The downsides of doing that are
harder communication between the processes (but thats kind of the point)
the overhead of starting processes is typically a lot larger than starting thread. So you would not typically spawn a new process for each request for example.
Common model is a lead process that starts a child process to do the request serving. The lead process just monitors the health of the worker
Or you could fix your code so that it doesnt get corrupted (AN easy thing to say I know)

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?
I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

Clueless on how to execute big tasks on C++ AMP

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!
TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

How do I prevent my parallel code using up all the available system memory?

I'm developing a c++ code on Linux which can run out of memory, go into swap and slow down significantly, and sometimes crash. I'd like to prevent this from happening by allowing the user to specify a limit on the proportion of the total system memory that the process can use up. If the program should exceed this limit, then the code could output some intermediate results, and terminate cleanly.
I can determine how much memory is being used by reading the resident set size from /proc/self/stat. I can then sum this up across all the parallel processes to give me a total memory usage for the program.
The total system memory available can be obtained via a call to sysconf(_SC_PHYS_PAGES) (see How to get available memory C++/g++?). However, if I'm running on a parallel cluster, then presumably this figure will only give me the total memory for the current cluster node. I may, for example, be running 48 processes across 4 cluster nodes (each with 12 cores).
So, my real question is how do I find out which processor a given process is running on? I could then sum up the memory used by processes running on the same cluster node, and compare this with the available memory on that node, and terminate the program if this exceeds a specified percentage on any of the nodes that the program is running on. I would use sched_getcpu() for this, but unfortunately I'm compiling and running on a system with version 2.5 of glibc, and sched_getcpu() was only introduced in glibc 2.6. Also, since the cluster is using on old linux OS (version 2.6.18), I can't use syscall() to call getcpu() either! Is there any other way to get the processor number, or any sort of identifier for the processor, so that I can sum memory used across each processor separately?
Or is there a better way to solve the problem? I'm open to suggestions.
A competently-run cluster will put your jobs under some form of resource limits (RLIMIT_AS or cgroups). You can do this yourself just by calling setrlimit(RLIMIT_AS,...). I think you're overcomplicating things by worrying about sysconf, since on a shared cluster, there's no reason to think that your code should be using even a fixed fraction of the physical memory size. Instead, you should choose a sensible memory requirement (if your cluster doesn't already provide one - most schedulers do memory scheduling reasonably well.) Even if you insist on doing it yourself, auto-sizing, you don't need to know about which cores are in use: just figure out how many copies of your process are on the node, and divide appropriately. (you will need to figure out which node (host) each process is running on, of course.)
It's worth pointing out that the kernel RLIMIT_AS, not RLIMIT_RSS. And that when you hit this limit, new allocations will fail.
Finally, I question the design of a program which uses unbounded memory. Are you sure there's not a better algorithm? Users are going to find your program pretty irritating to use if, after investing significant time in a computation, it decides it's going to try allocating more, then fails. Sometimes people ask this sort of question when they are mistakenly thinking that allocating as much memory as possible will give them better IO buffering (which is naive wrt pagecache, etc).

Making more efficient use fork() and copy-on-write memory sharing

I am a programmer developing a multiplayer online game using Linux based servers. We use an "instanced" architecture for our world. This means that each player entering a world area gets a copy of that area to play in with their party members, and independent of all the other players playing in the same area.
Internally we use a separate process for each instance. Initially each instance process would start up, load only the resources required for the given area, generate it's random terrain, and then allow new connections from players. The amount of memory used by an instance was typically about 25 meg including resources and the randomly generated level with entities.
In order to reduce the memory footprint of instances, and to speed up the spawn time, we changed to an approach where we create a single master instance that loads all the resources that any instance could possibly need (about 150 meg of memory) and then when a new instance is required, use the fork() function to spawn a new instance and utilise copy-on-write memory sharing so that the new instance only requires memory for it's "unique" data set. The footprint of the randomly generated level and entities which make up the unique data for each instance is about 3-4 meg of memory.
Unfortunately the memory sharing is not working as well as I think it could. A lot of memory pages seem to become unshared.
At first, as we load more of our data set in the prefork instance, the memory required for each forked instance goes down, but eventually there is an inflection point where loading more assets in the prefork actually increases the data used by each forked instance.
The best results we have had are loading about 80 meg of the data set pre fork, and then having the fresh instances demand load the rest. This results in about 7-10 extra meg per instance and an 80 meg fixed cost. Certainly a good improvement, but not the theoretical best.
If I load the entire 150 meg data set and then fork, each forked instance uses about 50 more meg of memory! Significantly worse than simply doing nothing.
My question is, how can I load all of my data set in the prefork instance, and make sure I'm only getting the minimum set of really unique per instance data as the memory footprint for each instance.
I have a theory as to what is happening here and I was wondering if someone would be able to help confirm for me that this is the case.
I think it's to do with the malloc free chain. Each memory page of the prefork instance probably has a few free spots of memory left in it. If, during random level generation, something is allocated that happens to fit in one of the free spots in a page, then that entire page would be copied into the forked process.
In windows you can create alternate heaps, and change the default heap used by the process. If this were possible, it would remove the problem. Is there any way to do such a thing in linux? My investigations seem to indicate that you can't.
Another possible solution would be if I could somehow discard the existing malloc free chain, forcing malloc to allocate fresh memory from the operating system for subsequent calls. I attempted to look at the implementation of malloc to see if this would be easily possible, but it seemed like it might be somewhat complex. If anyone has any ideas around this area or a suggestion of where to start with this approach, I'd love to hear it.
And finally if anyone has any other ideas about what might be going wrong here, I'd really like to hear them. Thanks a lot!
In windows you can create alternate heaps, and change the default heap
used by the process. If this were possible, it would remove the
problem. Is there any way to do such a thing in linux?
I Unix you can simply mmap(2) memory an bypass malloc altogether.
I would also ditch the whole "rely-on-cow" thing. I would have the master process mmap some memory (80M, 150M whatever), write stuff to it, mark it read-only via mprotect(2) for good measure and take it from there. This would solve the real issue and wouldn't force you to change the code down the road.