Parameters to watch while running an application on linux? - c++

I am running an application overnight on my Linux xscale device. I am looking for things,which would increase with the increase in the amount of time of running.
One thing,is the memory. If you observe the memory on the xscale systems,the free memory would start decreasing,but you will see an increase in the cached memory. What are the other parameters which we can observe ,e.g. can we observe the amount of stack or heap usage?

If the application is developed by you, i would recommend the following heap profiler to use for getting more deeper details.
http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools

vmstat is often a good thing to run; it gives detailed information about memory, swap, & cpu usage, as well as the average number of interrupts & context switches per second. Give it a number, n, and it will run continuously every n seconds.

Related

How do I prevent my parallel code using up all the available system memory?

I'm developing a c++ code on Linux which can run out of memory, go into swap and slow down significantly, and sometimes crash. I'd like to prevent this from happening by allowing the user to specify a limit on the proportion of the total system memory that the process can use up. If the program should exceed this limit, then the code could output some intermediate results, and terminate cleanly.
I can determine how much memory is being used by reading the resident set size from /proc/self/stat. I can then sum this up across all the parallel processes to give me a total memory usage for the program.
The total system memory available can be obtained via a call to sysconf(_SC_PHYS_PAGES) (see How to get available memory C++/g++?). However, if I'm running on a parallel cluster, then presumably this figure will only give me the total memory for the current cluster node. I may, for example, be running 48 processes across 4 cluster nodes (each with 12 cores).
So, my real question is how do I find out which processor a given process is running on? I could then sum up the memory used by processes running on the same cluster node, and compare this with the available memory on that node, and terminate the program if this exceeds a specified percentage on any of the nodes that the program is running on. I would use sched_getcpu() for this, but unfortunately I'm compiling and running on a system with version 2.5 of glibc, and sched_getcpu() was only introduced in glibc 2.6. Also, since the cluster is using on old linux OS (version 2.6.18), I can't use syscall() to call getcpu() either! Is there any other way to get the processor number, or any sort of identifier for the processor, so that I can sum memory used across each processor separately?
Or is there a better way to solve the problem? I'm open to suggestions.
A competently-run cluster will put your jobs under some form of resource limits (RLIMIT_AS or cgroups). You can do this yourself just by calling setrlimit(RLIMIT_AS,...). I think you're overcomplicating things by worrying about sysconf, since on a shared cluster, there's no reason to think that your code should be using even a fixed fraction of the physical memory size. Instead, you should choose a sensible memory requirement (if your cluster doesn't already provide one - most schedulers do memory scheduling reasonably well.) Even if you insist on doing it yourself, auto-sizing, you don't need to know about which cores are in use: just figure out how many copies of your process are on the node, and divide appropriately. (you will need to figure out which node (host) each process is running on, of course.)
It's worth pointing out that the kernel RLIMIT_AS, not RLIMIT_RSS. And that when you hit this limit, new allocations will fail.
Finally, I question the design of a program which uses unbounded memory. Are you sure there's not a better algorithm? Users are going to find your program pretty irritating to use if, after investing significant time in a computation, it decides it's going to try allocating more, then fails. Sometimes people ask this sort of question when they are mistakenly thinking that allocating as much memory as possible will give them better IO buffering (which is naive wrt pagecache, etc).

getting std::bad_alloc error ; How to cross-verify that OS is really running out of memory

I have a C++ program/Linux, which within 2-3 seconds of running starts spitting error std::bad alloc on a 32GB RAM (and gets restarted by wrapper caller). What I really care about is to solve this problem, but I would like to go step by step and build up my confidence in my understanding of the problem.
It looks like the system is not able to allocate memory for a new request (this would happen when the OS has run out of memory). While the program is running, on another terminal I run the sar command with the smallest interval possible (1 second), but I see that kbcached is ~24GB memory. Why is the OS not able to release the caching and make that memory available to the new request? Either 1 sec is too much time (in comparison to how fast programs run) or I am doing something wrong here.
Basically I would like to cross-verify and pin-point that the OS is indeed running out of memory and thus is not able to allocate memory, and then take things from this point on. How to do it?
Ideally, I would like to have the system statistics right at the point when memory allocation fails, like how much caching, total used up memory etc.
If you actually want to see how your process's memory is allocated, you could set a breakpoint with gdb for when the exception is thrown. When it is, inspect the process with a tool like pmap, which can show you additional information about how the process uses memory.
If that's too primitive (and it quickly will be, pmap is pretty primitive), valgrind includes Massif and many other utilities for diagnosing memory usage, CPU utilization, and other runtime problems.

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

How to write a C or C++ program to act as a memory and CPU cycle filler?

I want to test a program's memory management capabilities, for example (say, program name is director)
What happens if some other processes take up too much memory, and there is too less memory for director to run? How does director behave?
What happens if too many of the CPU cycles are used by some other program while director is running?
What happens if memory used by the other programs is freed after sometime? How does director claim the memory and start working at full capabilities.
etc.
I'll be doing these experiments on a Unix machine. One way is to limit the amount of memory available to the process using ulimit, but there is no good way to have control over the CPU cycle utilization.
I have another idea. What if I write some program in C or C++ that acts as a dynamic memory and CPU filler, i.e. does nothing useful but eats up memory and/or CPU cycles anyways?
I need some ideas on how such a program should be structured. I need to have dynamic(runtime) control over memory used and CPU used.
I think that creating a lot of threads would be a good way to clog up the CPU cycles. Is that right?
Is there a better approach that I can use?
Any ideas/suggestions/comments are welcome.
http://weather.ou.edu/~apw/projects/stress/
Stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPLv2.
The functionality you seek overlaps the feature set of "test tools". So also check out http://ltp.sourceforge.net/tooltable.php.
If you have a single core this is enough to put stress on a CPU:
while ( true ) {
x++;
}
If you have lots of cores then you need a thread per core.
You make it variably hungry by adding a few sleeps.
As for memory, just allocate lots.
There are several problems with such a design:
In a virtual memory system, memory size is effectively unlimited. (Well, it's limited by your hard disk...) In practice, systems usually run out of address space well before they run out of backing store -- and address space is a per-process resource.
Any reasonable (non realtime) operating system is going to limit how much CPU time and memory your process can use relative to other processes.
It's already been done.
More importantly, I don't see why you would ever want to do this.
Dynamic memory control, you could just allocate or free buffers of a certain size to use or free more or less memory. As for CPU utilization, you will have to get an OS function to check this and periodically check it and see if you need to do useful work.

Staying away from virtual memory in Windows\C++

I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.
How about this function:
int
bytesLeftUntilVMUsed() {
return 0;
}
it should give the correct result in nearly all cases I think ;)
Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Update:
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.
Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.
The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.
Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.