Monitoring memory usage of self process

Monitoring memory usage of self process - c++

I have an application which runs through a wrapper and submitted as a job on grid (Linux).
y task is to monitor the RAM and virtual memory usage of the process and if the process fails due to memory issue, resubmit it again to grid with a higher memory requirement ( using some switch ).
I think this can be achieved by invoking a separate thread from the application which watches the main application and in case of failure relaunch the main application.
I am seeking for an advice for better solution to this problem.
Thanks
Ruchi

Thread will not work, since C and C++ mandate that returning from the main function kills all running threads (courtesy Do child threads exit when the parent thread terminates).
You will need to make it another process, perhaps a script that starts the process which then manages your application.

An usuall way of doing it'd be to check when memory allocation fails, i.e malloc(). If malloc() fails that's an indication that, your systems memory is almost full and on that particular case, you can do what you like to do.

Related

Is it correct to use std::async for background tasks inside an internal thread (not from main process's thread)

I would like to have your opinion for this general technical concept. (I am working on microsoft windows OS)
There is a Process, this process creates multiple threads for different tasks.
Main process: it is a windows service written by C# code.
There are several threads that are create inside the main process: Thread_01, Thread_02, ...
Inside Thread_01: There is a Wrapper dll written in managed C++ to consume DLL_01. (DLL_01 is a dll written by me in native C++ code, that provides some APIs: Add, Remove, Connect)
Add and Remove can run very fast, but Connect may take more than 10 seconds and blocks the caller until it finishes.
I am thinking to use std::async to do the Connect function code, and send the result through a callback to the caller (main process).
Is it a good approach? I heard we cannot create or it is better not to create any thread inside inner threads, is it true? If so, how about std::async ?
Any recommendation is appreciated.
Thanks in advance,

None of what you describe makes the use of threads inacceptable for your code.
As usual, threads have issues that need to be cared for:
Data races due to access to shared data.
Problems of ownership of resources is now not just "Who own what?" but "Who and when owns what?".
When a thread is blocked and you want to abort this operation, how do you cancel it without causing issues down the line? In your case, you must avoid calling the callback, when the receiver doesn't exist any more.
Concerning your approach of using a callback, consider std::future<> instead. This takes care of a few of the issues above, though some are only shifted to the caller instead.

Terminating QWebSocketServer with connected sockets

I debug console multithreaded application written in C++/Qt 5.12.1. It is running on Linux Mint 18.3 x64.
This app has SIGINT handler, QWebSocketServer and QWebSocket table. It uses close() QWebSocketServer and call abort()/deleteLater() for items in QWebSocket table to handle the termination.
If the websocket client connects to this console app, then termination fails because of some running thread (I suppose it's internal QWebSocket thread).
Termination is successful if there were no connections.
How to fix it? So that the app gracefully exits.

To gracefully quit the socket server we can attempt:
The most important part is to allow the main thread event loop to run and wait on QWebSocketServer::closed() so that the slot calls QCoreApplication::quit().
That can be done even with:
connect(webSocketServer, &QWebSocketServer::closed,
QCoreApplication::instance(), &QCoreApplication::quit);
If we don't need more detailed reaction.
After connecting that signal before all, proceed with pauseAccepting() to prevent more connections.
Call QWebSocketServer::close.
The below may not be needed if the above sufficient. You need to try the above first, and only if still have problems then deal with existing and pending connections. From my experience the behavior was varying on platforms and with some unique websocket implementations in the server environment (which is likely just Qt for you).
As long as we have some array with QWebSocket instances, we can try to call QWebSocket::abort() on all of them to immediately release. This step seem to be described by the question author.
Try to iterate pending connections with QWebSocketServer::nextPendingConnection() and call abort() for them. Call deleteLater, if that works as well.

There is no need to do anything. What do you mean by "graceful exit"? As soon as there's a request to terminate your application, you should terminate it immediately using exit(0) or a similar mechanism. That's what "graceful exit" should be.
Note: I got reformed. I used to think that graceful exits were a good thing. They are most usually a waste of CPU resources and usually indicate problems in the architecture of the application.
A good rationale for why it should be so written in the kj framework (a part of capnproto).
Quoting Kenton Varda:
KJ_NORETURN(virtual void exit()) = 0;
Indicates program completion. The program is considered successful unless error() was
called. Typically this exits with _Exit(), meaning that the stack is not unwound, buffers
are not flushed, etc. -- it is the responsibility of the caller to flush any buffers that
matter. However, an alternate context implementation e.g. for unit testing purposes could
choose to throw an exception instead.
At first this approach may sound crazy. Isn't it much better to shut down cleanly? What if
you lose data? However, it turns out that if you look at each common class of program, _Exit()
is almost always preferable. Let's break it down:
Commands: A typical program you might run from the command line is single-threaded and
exits quickly and deterministically. Commands often use buffered I/O and need to flush
those buffers before exit. However, most of the work performed by destructors is not
flushing buffers, but rather freeing up memory, placing objects into freelists, and closing
file descriptors. All of this is irrelevant if the process is about to exit anyway, and
for a command that runs quickly, time wasted freeing heap space may make a real difference
in the overall runtime of a script. Meanwhile, it is usually easy to determine exactly what
resources need to be flushed before exit, and easy to tell if they are not being flushed
(because the command fails to produce the expected output). Therefore, it is reasonably
easy for commands to explicitly ensure all output is flushed before exiting, and it is
probably a good idea for them to do so anyway, because write failures should be detected
and handled. For commands, a good strategy is to allocate any objects that require clean
destruction on the stack, and allow them to go out of scope before the command exits.
Meanwhile, any resources which do not need to be cleaned up should be allocated as members
of the command's main class, whose destructor normally will not be called.
Interactive apps: Programs that interact with the user (whether they be graphical apps
with windows or console-based apps like emacs) generally exit only when the user asks them
to. Such applications may store large data structures in memory which need to be synced
to disk, such as documents or user preferences. However, relying on stack unwind or global
destructors as the mechanism for ensuring such syncing occurs is probably wrong. First of
all, it's 2013, and applications ought to be actively syncing changes to non-volatile
storage the moment those changes are made. Applications can crash at any time and a crash
should never lose data that is more than half a second old. Meanwhile, if a user actually
does try to close an application while unsaved changes exist, the application UI should
prompt the user to decide what to do. Such a UI mechanism is obviously too high level to
be implemented via destructors, so KJ's use of _Exit() shouldn't make a difference here.
Servers: A good server is fault-tolerant, prepared for the possibility that at any time
it could crash, the OS could decide to kill it off, or the machine it is running on could
just die. So, using _Exit() should be no problem. In fact, servers generally never even
call exit anyway; they are killed externally.
Batch jobs: A long-running batch job is something between a command and a server. It
probably knows exactly what needs to be flushed before exiting, and it probably should be
fault-tolerant.

Cgroup usage to limit resources

My Goal: To provide user a way to limit resources like CPU, memory for the given process (C++).
So someone suggested me to utilize Cgroups which looks like an ideal utility.
After doing some research I have a concern:
When we utilize memory.limit_in_bytes to limit the memory usage for the given process, is there way to handle the out of memory exception in the process? I see control groups provide a parameter called "memory.oom_control" which when enabled, it kills the process that is requesting more memory than allowed. When disabled it just pauses the process.
I want a way to let the process know that it is requesting more memory than expected and should throw out of memory exception. This is so that the process gracefully exits.
Does cgroups provide such kind of behaviour?
Also is cgroup available in all flavour of linux? I am mainly interested in RHEL 5+, CENTOS 6+ and ubuntu 12+ machines.
Any help is appreciated.
Thanks

I want a way to let the process know that it is requesting more memory than expected and should throw out of memory exception. This is so that the process gracefully exits.
Does cgroups provide such kind of behaviour?
All processes in recent releases already run inside a cgroup, the default one. If you create a new cgroup and then migrate the process into the new cgroup, everything works as before but using the constraints from the new cgroup. If your process allocates more memory than permitted, it gets an ENOSPC or a malloc failure just as it presently does.

Fork and core dump with threads

Similar points to the one in this question have been raised before here and here, and I'm aware of the Google coredump library (which I've appraised and found lacking, though I might try and work on that if I understand the problem better).
I want to obtain a core dump of a running Linux process without interrupting the process. The natural approach is to say:
if (!fork()) { abort(); }
Since the forked process gets a fixed snapshot copy of the original process's memory, I should get a complete core dump, and since the copy uses copy-on-write, it should generally be cheap. However, a critical shortcoming of this approach is that fork() only forks the current thread, and all other threads of the original process won't exist in the forked copy.
My question is whether it is possible to somehow obtain the relevant data of the other, original threads. I'm not entirely sure how to approach this problem, but here are a couple of sub-questions I've come up with:
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
Is it possible to (quicky) enumerate all the running threads in the original process and store the addresses of the bases of their stacks? As I understand it, the base of a thread stack on Linux contains a pointer to the kernel's thread bookkeeping data, so...
with the stored thread base addresses, could you read out the relevant data for each of the original threads in the forked process?
If that is possible, perhaps it would only be a matter of appending the data of the other threads to the core dump. However, if that data is lost at the point of the fork already, then there doesn't seem to be any hope for this approach.

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.
I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.
Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.
Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.
Here is the procedure I'd use, assuming CRIU is unsuitable:
Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)
You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.
You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.
The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.
Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.
The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)
When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.
I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).
Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)
It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).
Do you need some example code to illustrate my ideas above?

When you fork you get a full copy of the running processes memory. This includes all thread's stacks (after all you could have valid pointers into them). But only the calling thread continues to execute in the child.
You can easily test this. Make a multithreaded program and run:
pid_t parent_pid = getpid();
if (!fork()) {
kill(parent_pid, SIGSTOP);
char buffer[0x1000];
pid_t child_pid = getpid();
sprintf(buffer, "diff /proc/%d/maps /proc/%d/maps", parent_pid, child_pid);
system(buffer);
kill(parent_pid, SIGTERM);
return 0;
} else for (;;);
So all your memory is there and when you create a core dump it will contain all the other threads stacks (provided your maximum core file size permits it). The only pieces that will be missing are their register sets. If you need those then you will have to ptrace your parent to obtain them.
You should keep in mind though that core dumps are not designed to contain runtime information of more then one thread - the one that caused the core dump.
To answer some of your other questions:
You can enumerate threads by going through /proc/[pid]/tasks, but you can not identify their stack bases until you ptrace them.
Yes, you have full access to the other threads stacks snapshots (see above) from the forked process. It is not trivial to determine them, but they do get put into a core dump provided the core file size permits it. Your best bet is to save them in some globally accessible structure if you can upon creation.

If you intend to get the core file at non-specific location, and just get core image of the process running without killing, then you can use gcore.
If you intend to get the core file at specific location (condition) and still continue running the process - a crude approach is to execute gcore programmatically from that location.
A more classical, clean approach would be to check the API which gcore uses and embedded it in your application - but would be too much of an effort compared to the need most of the time.
HTH!

If your goal is to snapshot the entire process in order to understand the exact state of all threads at a specific point then I can't see any way to do this that doesn't require some kind of interrupt service routine. You must halt all processors and record off the current state of each thread.
I don't know of any system that provides this kind of full process core dump. The rough outlines of the process would be:
issue an interrupt across all CPUs (both logical and physical cores).
busy wait for all cores to synchronize (this shouldn't take long).
clone the desired process's memory space: duplicate the page tables and mark all pages as copy on write.
have each processor check whether its current thread is in the target process. If so record the current stack pointer for that thread.
for every other thread examine the thread data block for the current stack pointer and record it.
create a kernel thread to save off the copied memory spaces and the thread stack pointers
resume all cores.
This should capture the entire process state, including a snapshot of any processes that were running at the moment the inter-processor interrupt was issued. Because all threads are interrupted (either through standard scheduler suspension process, or via our custom interrupt process) all register states will be on a stack somewhere in the process memory. You then only need to know where the top of each thread stack is. Using the copy on write mechanism to clone the page tables allows transparent save-off while the original process is allowed to resume.
This is a pretty heavyweight option, since it's main functionality requires suspending all processors for a significant amount of time (synchronize, clone, walk all threads). However this should allow you to exactly capture the status of all threads, as well as determine which threads were running (and on which CPUs) when the checkpoint was reached. I would assume some of the framework for doing this process exists (in CRIU for instance). Of course resuming the process will result in a storm of page allocations as the copy on write mechanism protects the check-pointed system state.

Hibernating/restarting a thread

I'm looking for a way to restart a thread, either from inside that thread's context or from outside the thread, possibly from within another process. (Any of these options will work.) I am aware of the difficulty of hibernating entire processes, and I'm pretty sure that those same difficulties attend to threads. However, I'm asking anyway in the hopes that someone has some insight.
My goal is to pause, save to file, and restart a running thread from its exact context with no modification to that thread's code, or rather, modification in only a small area - i.e., I can't go writing serialization functions throughout the code. The main block of code must be unmodified, and will not have any global/system handles (file handles, sockets, mutexes, etc.) Really down-and-dirty details like CPU registers do not need to be saved; but basically the heap, stack, and program counter should be saved, and anything else required to get the thread running again logically correctly from its save point. The resulting state of the program should be no different, if it was saved or not.
This is for a debugging program for high-reliability software; the goal is to run simulations of the software with various scripts for input, and be able to pause a running simulation and then restart it again later - or get the sim to a branch point, save it, make lots of copies and then run further simulations from the common starting point. This is why the main program cannot be modified.
The main thread language is in C++, and should run on Windows and Linux, however if there is a way to only do this on one system, then that's acceptable too.
Thanks in advance.

I think what you're asking is much more complicated than you think. I am not too familiar with Windows programming but here are some of the difficulties you'll face in Linux.
A saved thread can only be restored from the root process that originally spawned the thread, otherwise the dynamic libraries would be broken. Because of this saving to disk is essentially meaningless. The reason is dynamic libraries are loaded at different address each time they're loaded. The only way around this would be to take complete control of dynamically linking, no small feat. It's possible, but pretty scary.
The suspended thread will have variables in the the heap. You'd need to be able to find all globals 'owned' by the thread. The 'owned' state of any piece of the heap cannot be determined. In the future it may be possible with the C++0x's garbage collection ABI. You can't just assume the whole stack belongs to the thread to be paused. The main thread uses the heap when creating threads. So blowing away the heap when deserializing the paused thread would break the main thread.
You need to address the issues with globals. And not just the globals from created in the threads. Globals (or statics) can and often are created in dynamic libraries.
There are more resources to a program than just memory. You have file handles, network sockets, database connections, etc. A file handle is just a number. serializing its memory is completely meaningless without the context of the process the file was opened in.
All that said. I don't think the core problem is impossible, just that you should consider a different approach.
Anyway to try to implement this the thread to paused needs to be in a known state. I imagine the thread to be stoped would call a library function meant the halt the process so it could be resumed.
I think the linux system call fork is your friend. Fork perfectly duplicates a process. Have the system run to the desired point and fork. One fork wait to fork others. The second fork runs one set of input.
once it completes the first fork can for again. Again the second fork can run another set of input.
continue ad infinitum.

Threads run in the context of a process. So if you want to do anything like persist a thread state to disk, you need to "hibernate" the entire process.
You will need to serialise the entire set of the processes data. And you'll need to store the current thread execution point. I think serialising the process is do-able (check out boost::serialize) but the thread stop point is a lot more difficult. I would put places where it can be stopped through the code, but as you say, you cannot modify the code.
Given that problem, you're looking at virtualising the platform the app is running on, and using its suspend functionality to pause the entire thing. You might find more information about how to do this in the virtualisation vendor's features, eg Xen.

As the whole logical address space of the program is part of the thread's context, you would have to hibernate the whole process.
If you can guarantee that the thread only uses local variables, you could save its stack. It is easy to suspend a thread with pthreads, but I don't see how you could access its stack from outside then.

The way you would have to do this is via VM Snapshots; get a copy of VMWare Workstation, then you can write code to automate starting/stopping/snapshotting the machine at different points. Any other approach is pretty untenable, as while you might be able to freeze and dethaw a process, you can't reconstruct the system state it expects (all the stuff that Caspin mentions like file handles et al.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js