Non voluntary context switches: How can I prevent them? - c++

I have a small application that I was running right now and I wanted to check if I have any memory leaks in it so I put in this piece of code:
for (unsigned int i = 0; i<10000; i++) {
for (unsigned int j = 0; j<10000; j++) {
std::ifstream &a = s->fhandle->open("test");
char temp[30];
a.getline(temp, 30);
s->fhandle->close("test");
}
}
When I ran the application i cat'ed /proc//status to see if the memory increases.
The output is the following after about 2 Minutes of runtime:
Name: origin-test
State: R (running)
Tgid: 7267
Pid: 7267
PPid: 6619
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 256
Groups: 4 20 24 46 110 111 119 122 1000
VmPeak: 183848 kB
VmSize: 118308 kB
VmLck: 0 kB
VmHWM: 5116 kB
VmRSS: 5116 kB
VmData: 9560 kB
VmStk: 136 kB
VmExe: 28 kB
VmLib: 11496 kB
VmPTE: 240 kB
VmSwap: 0 kB
Threads: 2
SigQ: 0/16382
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000002004
SigCgt: 00000001800044c2
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed: 3f
Cpus_allowed_list: 0-5
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 120
nonvoluntary_ctxt_switches: 26475
None of the values did change except the last one, so does the mean there are no memory leaks?
But what's more important and what I would like to know is, if it is bad that the last value is increasing rapidly (about 26475 switches in about 2 Minutes!).
I looked at some other applications to compare how much non-volunary switches they have:
Firefox: about 200
Gdm: 2
Netbeans: 19
Then I googled and found out some stuff but it's to technical for me to understand.
What I got from it is that this happens when the application switches the processor or something? (I have an Amd 6-core processor btw).
How can I prevent my application from doing that and in how far could this be a problem when running the application?
Thanks in advance,
Robin.

Voluntary context switch occurs when your application is blocked in a system call and the kernel decide to give it's time slice to another process.
Non voluntary context switch occurs when your application has used all the timeslice the scheduler has attributed to it (the kernel try to pretend that each application has the whole computer for themselves, and can use as much CPU they want, but has to switch from one to another so that the user has the illusion that they are all running in parallel).
In your case, since you're opening, closing and reading from the same file, it probably stay in the virtual file system cache during the whole execution of the process, and youre program is being preempted by the kernel as it is not blocking (either because of system or library caches). On the other hand, Firefox, Gdm and Netbeans are mostly waiting for input from the user or from the network, and must not be preempted by the kernel.
Those context switches are not harmful. On the contrary, it allow your processor to be used to fairly by all application even when one of them is waiting for some resource.≈
And BTW, to detect memory leaks, a better solution would be to use a tool dedicated to this, such as valgrind.

To add to #Sylvain's info, there is a nice background article on Linux scheduling here: "Inside the Linux scheduler" (developerWorks, June 2006).

To look for a memory leak it is much better to install and use valgrind, http://www.valgrind.org/. It will identify memory leaks in the heap and memory error conditions (using uninitialized memory, tons of other problems). I use it almost every day.

Related

How to get details of context switches happening in a multithreaded embedded application?

I am trying to profile a C++ application on an embedded device. Using Vtune, I found out that the app is launching hundreds of threads, among which most are active for only small percentage of the total time.
I want to get a details of the context switches that are happening (preferably in some kind of a timeline view). I have yet to come across a tool that can show the contact switch information. Is there some kind of profiler that provides this? Or some other way to get this info?
Thanks.
On Linux you could use cat /proc/{PID}/status, to get some information on threads, voluntary_ctxt_switches and nonvoluntary_ctxt_switches
for example,
uname#hostname:/$ cat /proc/1357/status
Name: avahi-daemon
Umask: 0022
State: S (sleeping)
Tgid: 1357
Ngid: 0
Pid: 1357
PPid: 1
TracerPid: 0
Uid: 107 107 107 107
Gid: 114 114 114 114
FDSize: 128
Groups: 114
NStgid: 1357
NSpid: 1357
NSpgid: 1357
NSsid: 1357
VmPeak: 10500 kB
VmSize: 8288 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 3664 kB
VmRSS: 2700 kB
RssAnon: 328 kB
RssFile: 2372 kB
RssShmem: 0 kB
VmData: 468 kB
VmStk: 132 kB
VmExe: 92 kB
VmLib: 3720 kB
VmPTE: 52 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
Threads: 1
SigQ: 0/31668
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000180004203
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-127
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 1610
nonvoluntary_ctxt_switches: 25
This answer is specific for the Linux OS. It would be good if you specify what OS you are using because otherwise you may get the solution you don't need.
If you have Linux Perf events, you can get a visual timeline of the context switches in your application using perf timechart record and perf timechart. If the duration of the record is large it may take a while to process the result.
If you want to know what parts of your program are the culprit, maybe it would be better to use perf record -e context-switch --call-graph XXX to sample the backtrace when a context switch happens. Look into the perf manual to see more details of the command line options. Once you collect some trace data, you can visualise it with perf report. I believe Intel VTune is still able to open perf traces, but you need to rename the files from the default perf.data to a file name ending with .perf extension.

memory mapped file access is very slow

I am writing to a 930GB file (preallocated) on a Linux machine with 976 GB memory.
The application is written in C++ and I am memory mapping the file using Boost Interprocess. Before starting the code I set the stack size:
ulimit -s unlimited
The writing was very fast a week ago, but today it is running slow. I don't think the code has changed, but I may have accidentally changed something in my environment (it is an AWS instance).
The application ("write_data") doesn't seem to be using all the available memory. "top" shows:
Tasks: 559 total, 1 running, 558 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 98.5%id, 1.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1007321952k total, 149232000k used, 858089952k free, 286496k buffers
Swap: 0k total, 0k used, 0k free, 142275392k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4904 root 20 0 2708m 37m 27m S 1.0 0.0 1:47.00 dockerd
56931 my_user 20 0 930g 29g 29g D 1.0 3.1 12:38.95 write_data
57179 root 20 0 0 0 0 D 1.0 0.0 0:25.55 kworker/u257:1
57512 my_user 20 0 15752 2664 1944 R 1.0 0.0 0:00.06 top
I thought the resident size (RES) should include the memory mapped data, so shouldn't it be > 930 GB (size of the file)?
Can someone suggest ways to diagnose the problem?
Memory mappings generally aren't eagerly populated. If some other program forced the file into the page cache, you'd see good performance from the start, otherwise you'd see poor performance as the file was paged in.
Given you have enough RAM to hold the whole file in memory, you may want to hint to the OS that it should prefetch the file, reducing the number of small reads triggered by page faults, substituting larger bulk reads. The posix_madvise API can be used to provide this hint, by passing POSIX_MADV_WILLNEED as the advice, indicating it should prefetch the whole file.

Calculating used memory by a set of processes on Linux

I'm having trouble with calculating the actually used memory (resident) by a set of processes.
The issue that just came up is a user with a set of processes that share memory between themselves, so a simple addition of used memory ends up with a nonsense number (>60gb when the machine only has 48gb memory).
Is there any simple way to approach this problem?
I can probably do some approximation. Take (res mem - shared mem) * num proc + shared mem. But not all processes necessarily share the same memory block.
I'm looking for a POSIX or Linux solution to this problem for C/C++.
You will want to iterate through each processes /proc/[pid]/smaps
It will contain an entry for each VM mapping of the likes:
7ffffffe7000-7ffffffff000 rw-p 00000000 00:00 0 [stack]
Size: 100 kB
Rss: 20 kB
Pss: 20 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 20 kB
Referenced: 20 kB
Anonymous: 20 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Private_Dirty memory is what you are interested in.
If you have the Pss field in your smaps file then this is the amount of resident memory divided by the amount of processes that share the physical memory.
Private_Clean could be copy-on-write mappings. Those are commonly used for shared libraries and are generally read/no-write/execute.

Need to get process memory using c++

I want to calculate my process memory (rss) at runtime in my application (c++/unix/multithreaded).Do we have any API to use for that.Please note that , I am aware of reading /proc/stat and getrusage() , but dont want to read/parse a system file from appication and getrusage() does not work in my linux distribution.
The whole intent was to check for memory leak caused by my application . I have even tried tracking memory by overloading new/malloc/calloc/realloc and get the memory allocation trakced, but even with thsese I am not able to track the whole memory allocated by process. It would be also helpfull if you can suggest the other probable areas where I should look for memory allocation/ memory leak other than the above stated APIs.
I am aware of Valgrind/mpatrol type of memory monitor tools .. but unfortunately it does not work with my application..
Thanks in advance
First, this kind of information is operating system specific. It has to be done differently on Linux, on MacOSX, on FreeBSD...
On Linux, the blessed way, is as every one told you, to use the /proc file system, which is how all the system utilities (e.g. top or ps) are retrieving that information (perhaps by using libproc which is just a wrapper around reads of /proc/ files).
Could you explain why reading e.g. /proc/self/statm or /proc/self/stat or /proc/self/status or /proc/self/maps is not possible for you?
Remember that these /proc/files are pseudo-files, and no actual slow I/O operation to disk is involved in reading them. And you have to read them sequentially, seeking (or stat-ing) them does not work.
It seems to me that
long process_size_in_pages(void)
{
long s = -1;
FILE *f = fopen("/proc/self/statm", "r");
if (!f) return -1;
// if for any reason the fscanf fails, s is still -1,
// with errno appropriately set.
fscanf(f, "%ld", &s);
fclose (f);
return s;
}
is the fastest way to retrieve that information. Why can't you do that?
You could use valgrind. By setting it in monitor mode and calling remote method (gdb) monitor full, it would give you the total, allocated, memory at run time. See this page for more information.
You can read /proc/${pid}/status, it looks like
Name: nginx
State: S (sleeping)
SleepAVG: 98%
Tgid: 11884
Pid: 11884
PPid: 11883
TracerPid: 0
Uid: 99 99 99 99
Gid: 99 99 99 99
FDSize: 64
Groups: 99
VmPeak: 23932 kB
VmSize: 23932 kB
VmLck: 0 kB
VmHWM: 4276 kB
VmRSS: 4276 kB
VmData: 3744 kB
VmStk: 88 kB
VmExe: 452 kB
VmLib: 3024 kB
VmPTE: 88 kB
StaBrk: 1a931000 kB
Brk: 1a974000 kB
StaStk: 7fffc224d560 kB
Threads: 1
SigQ: 0/73712
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000040001000
SigCgt: 0000000198016a07
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
Cpus_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000f
Mems_allowed: 00000000,00000001
You can parse the VmRSS value.

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.