When I try to experiment on buffer overflows, I set randomize_va_space to 0 and I set the -fno-stack-protector flag but my experiments still don't work with newer kernels ?.
Don't work how? This question is extremely lacking in detail.
First, you can disable aslr for given process with setarch -R. Two, I suspect you have shellcode which executes itself from the stack, which is mapped non-exec. This can be remedied with -zexecstack.
I strongly suggest you get a distro prepared for beginners. https://exploit.education/ has some excellent Overflow examples that can be setup in a Virtual Environment.
Related
Referring to utilities such as:
Electric Fence (or DUMA for Windows)
gflags.exe -p /enable C:\Path\To\Your.exe /full (Windows)
mpatrol
DrMemory
Valgrind's memcheck (for Linux)
g++ ... -lmcheck
g++ ... -fmudflap (Linux)
etc.
Is there one that is guaranteed to catch these accesses 100% of the time? (In the sense that every single time an actual illegal read/write occurs, it will be detected.)
I got a sense that most of them use probabilistic approaches. But is a deterministic such tool theoretically possible, or will it require special hardware?
I realise it's a tricky question requiring a lot of research and deep understanding, so I don't expect quick answers. But if anyone knows the answer already, please share!
As some of you might know Google provides for free great collection of tools for analyzing c++ code:
http://code.google.com/p/google-perftools/
Problem is that there is apparently some libunwind problem on 64 bits, and authors can't do anything on their side to fix it(
But I don't expect a
fix anytime soon: it depends on the libc folks and the libunwind
folks working out some locking issues. There's unfortunately not
much we ourselves can do.
), so I'm searching for replacement.
Is there any similar tool that provides cool graphical representation of profiling data(for example: )
)
EDIT: paste from README that explains the problem:
2) On x86-64 64-bit systems, while tcmalloc itself works fine, the
cpu-profiler tool is unreliable: it will sometimes work, but sometimes
cause a segfault. I'll explain the problem first, and then some
workarounds.
Note that this only affects the cpu-profiler, which is a
google-perftools feature you must turn on manually by setting the
CPUPROFILE environment variable. If you do not turn on cpu-profiling,
you shouldn't see any crashes due to perftools.
The gory details: The underlying problem is in the backtrace()
function, which is a built-in function in libc. Backtracing is fairly
straightforward in the normal case, but can run into problems when
having to backtrace across a signal frame. Unfortunately, the
cpu-profiler uses signals in order to register a profiling event, so
every backtrace that the profiler does crosses a signal frame.
In our experience, the only time there is trouble is when the signal
fires in the middle of pthread_mutex_lock. pthread_mutex_lock is
called quite a bit from system libraries, particularly at program
startup and when creating a new thread.
The solution: The dwarf debugging format has support for 'cfi
annotations', which make it easy to recognize a signal frame. Some OS
distributions, such as Fedora and gentoo 2007.0, already have added
cfi annotations to their libc. A future version of libunwind should
recognize these annotations; these systems should not see any
crashses.
Workarounds: If you see problems with crashes when running the
cpu-profiler, consider inserting ProfilerStart()/ProfilerStop() into
your code, rather than setting CPUPROFILE. This will profile only
those sections of the codebase. Though we haven't done much testing,
in theory this should reduce the chance of crashes by limiting the
signal generation to only a small part of the codebase. Ideally, you
would not use ProfilerStart()/ProfilerStop() around code that spawns
new threads, or is otherwise likely to cause a call to
pthread_mutex_lock!
--- 17 May 2011
Valgrind has a collection of great tools, including callgrind to profile the code. The gui client for the callgrind and cachegrind is kcachegrind.
I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?
Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)
Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).
Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report
Have a look at Valgrind.
A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.
I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.
You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.
Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).
Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).
Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.
Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.
Actually I have already tried some tools without much luck. Here is my experience so far:
Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(
I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?
I'd greatly appreciate any help on this issue.
Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Yes, C/C++ memory corruption problems are tough.
I also used several times valgrind, sometimes it revealed the problem and sometimes not.
While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.
Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).
Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.
Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Have you tried -fmudflap? (scroll up a few lines to see the options available).
you can try IBM purify, but i am afraid that is not opensource..
The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.
Try this one:
http://www.hexco.de/rmdebug/
I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.
I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.
I have some old code written in C for 16-bit using Borland C++ that switches between multiple stacks, using longjmps. It creates a new stack by doing a malloc, and then setting the SS and SP registers to the segment and offset, resp., of the address of the malloc'd area, using inline Assembler. I would like to convert it to Win32, and it looks like the two instructions should be replaced by a single one setting the ESP. The two instructions were surrounded by a CLI/STI pair, but in Win32 these give "privileged instructions", so I have cut them out for now. I am a real innocent when it comes to Windows, so, I was rather surprised that my first test case worked! So, my rather vague question is to ask the experts here if what I am doing is a) too dangerous to continue with, or b) will work if I add some code, take certain precautions, etc.? If the latter, what should be added, and where can I find out about it? Do I have to worry about any other registers, like the SS, EBX, etc.? I am using no optimization... Thanks for any tips people can give me.
Removing CLI/STI still works due to the differences in the operating environment.
On 16-bit DOS, an interrupt could occur and this interrupt would be initially running on the same stack. If you got interrupted in the middle of the operation, the interrupt could crash because you only updated ss and not sp.
On Windows, and any other modern environment, each user mode thread gets its own stack. If your thread is interrupted for whatever reason, it's stack and context are safely preserved - you don't have to worry about something else running on your thread and your stack. cli/sti in this case would be protecting against something you're already protected against by the OS.
As Greg mentioned, the safe, supported way of swapping stacks like this on Windows is CreateFiber/SwitchToFiber. This does have the side-effect of changing your entire context, so it is not like just switching the stack.
This really raises the question of what you want to do. A lot of times, switching stacks is to get by limited stack space, which was 64k on 16-bit DOS. On Windows, you have a 1 MB stack and you can allocate even larger. Why are you trying to switch stacks?
By far the safest way to do this is to port the code to official Win32 multiprogramming structures, such as threads or fibers. Fibers provide a very lightweight multi-stack paradigm that sounds like it might be suitable for your application.
The Why does Win32 even have fibers? article is an interesting read too.
I have done this in user mode, and it appears to have no problems. You do not need cli/sti, theses instructions merely prevent interrupts at that point in code, which should be unnecessary from the limited information you have told us.
Have a look at Mtasker by Bert Hubert. It does simple cooperative multitasking, it might be easy for you to use this to port your code.
Don't forget that jumping stacks is going to hose any arguments or stack-resident variables.