Measure the peak stackpointer value and its PC location

Measure the peak stackpointer value and its PC location - gdb

For an analysis of different binaries, I need to measure the peak actual stack memory usage (not just the stack pages reserved, but the memory actually used). I was trying the following with gdb
watch $sp
commands
silent
if $sp < $spnow
set $spnow=$sp
set $pcnow=$pc
print $spnow
print $pcnow
end
c
This appears to "work" when applied to ls, except even for a short-running program as ls, it doesn't actually appear to progress, but it's stuck in functions like "in strcoll_l () from /usr/lib/libc.so.6". Probably it just is too slow with this methodology.
I also looked into the valgrind massif tool. It can profile stack usage, but unfortunately can't seem to report in what part of the program the peak usage was encountered.

For an analysis of different binaries, I need to measure the peak actual stack memory usage
Your GDB approach
works only for single-threaded programs
is too slow to be practical (the watch $sp command forces GDB to single-step your program).
If you only care about stack usage at page granularity (and I think you should -- does it really matter whether the program used 1024 or 2000 bytes of stack?), then a much faster approach is to run the program in a loop, reducing its ulimit -s while the program successfully runs (you could also binary search, e.g. start with default 8MB, then try 4, 2, 1, 512K, etc. until it fails, then increase stack limit to find the exact value).
For /bin/ls:
bash -c 'x=4096; while /bin/ls > /dev/null; do
echo $x; x=$(($x/2)); ulimit -s $x || break; done'
4096
2048
1024
512
256
128
64
32
bash: line 1: 109951 Segmentation fault (core dumped) /bin/ls > /dev/null
You can then find the $PC by looking at the core dump.
I need the precise limits because I want to figure out what compiler optimizations cause what micro-changes to stack usages (even in the bytes range. along with .data and .text sizes).
I believe it's a fool's errand to attempt that.
In my experience, stack use is most affected by compiler inlining decisions. These in turn are most affected by precise compiler version and tuning, presence of runtime information (for profile-guided optimization), and precise source of the program being optimized.
A yes/no change to inlining decision can increase stack use by 100s of KBs in recursive programs, and minuscule changes to any of the above factors can change that decision.

Related

Looking for a way to detect valgrind/memcheck at runtime without including valgrind headers

Valgrind/Memcheck can be intensive and causes runtime performance to drop significantly. I need a way (at runtime) to detect it in order to disable all auxiliary services and features in order to perform checks in under 24 hours. I would prefer not to pass any explicit flags to program, but that would be one way.
I explored searching the symbol table (via abi calls) for valgrind or memcheck symbols, but there were none.
I explored checking the stack (via boost::stacktrace), but nothing was there either.

Not sure it's a good idea to have a different behavior when running under Valgrind since the goal of Valgrind is to assert your software in the expected usage case.
Anyway, Valgrind does not change the stack or symbols since it (kind of) emulates a CPU running your program. The only way to detect if you're being run under Valgrind is to observe for its effects, that is, everything is slow and not multithreaded in Valgrind.
So, for example, run a test that spawn 3 threads consuming a common FIFO (with a mutex/lock) and observe the number of items received. In real CPU, you'd expect the 3 threads to have processed close to the same amount of items in T time, but when run under Valgrind, one thread will have consumed almost all items in >>T time.
Another possibility is to call a some known syscall. Valgrind has some rules for observing syscall. For example, if you are allocating memory, then Valgrind will intercept this block of memory, and fill that memory area with some data. In a good software, you should not read that data and first write to it (so overwriting what Valgrind set). If you try to read that data and observe non zero value, you'll get a Valgrind invalid read of size XXX message, but your code will know it's being instrumented.
Finally, (and I think it much simpler), you should move the code you need to instrument in a library, and have 2 frontends. The "official" frontend, and a test frontend where you've disabled all bells and whistles that's supposed to be run under Valgrind.

How can I get the number of instructions executed by a program?

I have written and cross compiled a small c++ program, and I could run it in an ARM or a PC. Since ARM and a PC have different instruction set architectures, I wanna to compare them. Is that possible for me to get the number of executed instructions in this c++ program for both ISAs?

What you need is a profiler. perf would be one easy to use. It will give you the number of instructions that executed, which is the best metric if you want to compare ISA efficiency.
Check the tutorial here.
You need to use: perf stat ./your binary
Look for instructions metric. This approach uses a register in your CPU's performance monitoring unit - PMU - that counts the number of instructions.

Are you trying to get the number of static instructions or dynamic instructions? So, for instance, if you have the following loop (pseudocode):
for (i 0 to N):
a[i] = b[i] + c[i]
Static instruction count will be just under 10 instructions, give or take based on your ISA, but the dynamic count would depend on N, on the branch prediction implementation and so on.
So for static count I would recommend using objdump, as per recommendations in the comments. You can find the entry and exit labels of your subroutine and count the number of instructions in between.
For dynamic instruction count, I would recommend one of two things:
You can simulate running that code using an instruction set simulator (there are open source ISA simulators for both ARM and x86 out there - Gem5 for instance implements both of them, there are others out there that support one or the other.
Your second option is to run this natively on the target system and setup performance counters in the CPU to report dynamic instruction count. You would reset before executing your code, and read it afterwards (there might be some noise here associated with calling your subroutine and exiting, but you should be able to isolate that out)
Hope this helps :)

objdump -dw mybinary | wc -l
On Linux and friends, this gives a good approximation of the number of instructions in an executable, library or object file. This is a static count, which is of course completely different than runtime behavior.

Linux:
valgrind --tool=callgrind ./program 1 > /dev/null

how to limit cache memory ubuntu?

I have written a C++ code which I have to run on many low configuration computers. Now, my PC is very high configuration. I am using Ubuntu 10.04 & I set hard limit on some resources i.e. on memory & virtual memory. Now my question is:
1) how to set limit on the cache size and cache line size ?
2) what other limits I should put to check my code is OK or not ?
I am using command:
ulimit -H -m 1000000
ulimit -H -v 500000

You can't limit cache size, that's a (mostly transparent) hardware feature.
The good news is this shouldn't matter, since you can't run out of cache - it just spills and your program runs more slowly.
If your concern is avoiding spills, you could investigate valgrind --tool=cachegrind - it may be possible to examine the likely behaviour on your target hardware cache.

To PROPERLY simulate running on low-end machines (although not with low cache-limits), you can run the code in a virtual machine rather than the real hardware of your machine. This will show you what happens on a machine with small memory much more than if you limit using ulimit, as ulimit simply limits what YOUR application will get. So it shows that your application doesn't run out of memory when running a particular set of tests. But it doesn't show how the application and system behaves together when there isn't a huge amount of memory in the first place.
A machine with low amount of physical memory will behave quite differently when it comes to for example swapping behaviour, and filesystem caching, just to mention a couple of things that change between a "large memory, but application is limited" vs "small memory in the first place".
I'm not sure if Ubuntu comes with any flavour of Virtual Machine setup, but for example VirtualBox is pretty easy to configure and set up on any Linux/Windows machine. As long as you have a modern enough processor to run hardware virtualization instructions.
As Useless not at all uselessly stated, cache-memory will not "run out" or in any other way cause a failure. It will run a little slower, but not massive amounts (about 10x for any given operation, but this will be averaged over a large number of other instructions in most cases, unless you are really working hard at proving how important cache is, such as very large matrix multiplications).
One tip might also be to look around for some old hardware. There are usually computers for sale that are several years old, for next to nothing at a "computer recycling shop" or similar. Set such a system up, install your choice of OS, and see what happens.

GDB hardware watchpoint very slow - why?

On a large C application, I have set a hardware watchpoint on a memory address as follows:
(gdb) watch *0x12F5D58
Hardware watchpoint 3: *0x12F5D58
As you can see, it's a hardware watchpoint, not software, which would explain the slowness.
Now the application running time under debugger has changed from less than ten seconds to one hour and counting. The watchpoint has triggered three times so far, the first time after 15 minutes when the memory page containing the address was made readable by sbrk. Surely during those 15 minutes the watchpoint should have been efficient since the memory page was inaccessible? And that still does not explain, why it's so slow afterwards.
The platform is x86_64 and the GDB versions are Ubuntu 9.10 package:
$ gdb --version
GNU gdb (GDB) 7.0-ubuntu
[...]
and stock GDB 7.1 built from sources:
$ gdb-7.1 --version
GNU gdb (GDB) 7.1
Thanks in advance for any ideas as what might be the cause or how to fix/work around it.
EDIT: removed cast
EDIT: gdb 7.1

I discovered that watching a large character buffer was very slow, whereas watching a character in that buffer was very fast.
e.g.
static char buf[1024];
static char* buf_address = &buf;
watch buf_address - excruciatingly slow.
watch *buf_address - very fast.

I've actually had trouble with hardware watchpoints in GDB 7.x.x., which is not acceptable since watchpoints are a necessity in my job.
On advice from a co-worker, I downloaded the source for 6.7.1 and built it locally. Watchpoints work much better now.
Might be worth a try.

It's most likely because you're casting it each time. Try this:
(gdb) watch *0x12F5D58
Another option is that you have too many hardware watchpoints set, so gdb is forced to use software watchpoints. Try checking how many watchpoints you have using:
(gdb) info break
and see if you can disable some watchpoints.

On x86 you have the following limitation: all your watchpoints can cover no more than four memory addresses, each address of memory can watch for one memory word - this is because hardware watchpoints (the fast ones) use the processors debug registers, an you have four of them, therefore four locations to watch for.

Stack allocation limit for programs on a Linux 32 bit machine

In C++ how much can the stack segment grow before the compiler gives up and says that it cannot allocate more memory for stack.
Using gcc on a linux (fedora) 32 bit machine.

Under UNIX, if you are running bash run
$ ulimit -a
it will list various limits including stack size. Mine is 8192kb. You can use ulimit to change the limits.
Also, you can use ulimit() function to set various limits from within your program.
$ man 3 ulimit
Under Windows see StackReserveSize and StackCommitSize
In practice stack addresses begin at high addresses (on a 32-bit platform, close to the 3GB limit) and decrease while memory allocation begins at low addresses. This allows the stack and memory to grow until whole memory is exhausted.

On my 32 bit linux, its 8192K bytes. So it should be the same on your machine.
$ uname -a
Linux TomsterInc 2.6.28-14-generic #46-Ubuntu SMP Wed Jul 8 07:21:34 UTC 2009 i686 GNU/Linux
$ ulimit -s
8192

Windows (and I think Linux) both operate on the big stack model assumption, that is, there is one stack (per thread) whose space is preallocated before the thread starts.
I suspect the OS simply assigns virtual memory space of the preallocated size to that stack area, and adds real memory pages underneath as the end of the stack is advanced beyond a page boundary until the upper limit ("ulimit") is reached.ck
Since OSes often place stacks well away from other structure, when ulimit is reached, it is just possible that the OS might be able to expand the stack, if when the overflow occurs nothing else has shown up next to the stack. In general, if you are building a program complex enough enough to overflow the stack, you are likely allocating memory dynamically and there is no gaurantee that the area next to the stack didn't get allocated. If such memory is allocated, of course the OS can't expand the stack where it is.
This means the application cannot count on the stack being expanded automatically by the OS. In effect, the stack can't grow.
In theory, an application exhausting its stack might be able to start a new thread with a larger stack, copy the existing stack and continue, but as practical matter I doubt this can be done, if for no other reason than pointers to local variables stack will need adjusting and C/C++ compilers don't make it possible to find such pointers and adjust them.
Consequence: ulimit has to be declared before the program starts, and once exceeded, the program dies.
If one wants a stack that can expand arbitrarily, it is better to switch to a language that uses heap-allocated activation records. Then you simply don't run out until your address space is used up. 32 or 64 bit VM spaces ensure you can do a lot of recursion with this techniquie.
We have a parallel programming language called PARLANSE, that does heap allocation to enable thousands of parallel computational grains (in practice) to recurse arbitrarily this way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js