How can I prevent a suspected infinite loop in the following scenario?
The entire C++ codebase is instrumented by clang at build time, using an LLVM pass that searches for llvm.memcpy intrinsics and inserting a post-call to the instrumentation runtime
The instrumentation runtime contains a std::map structure
The underlying libc++ code that implements std::map has been instrumented, and in turn calls the instrumentation runtime again
When I run the program, it freezes once the first instrumentation call is made. The suspected loop is trace_memcpy > std::map::operator[] > trace_memcpy > and so forth
Is there a way to short-circuit this loop, e.g. can the instrumentation library inspect the call stack to see that it is already in the call the stack and return early from the trace_memcpy function?
Thanks :)
Quick & dirty & probably not bulletproof - add a static variable to the implementation of trace_memcpy to avoid nesting.
void trace_memcpy(void)
{
static int nested;
if (nested)
{
return;
}
nested = 1;
// whatever your actual trace logic is
nested = 0;
}
If you need something more sophisticated, use the appropriate concurrency object as provided by your system.
Related
Let's say I have a function like:
template<typename It, typename Cmp>
void mysort( It begin, It end, Cmp cmp )
{
std::sort( begin, end, cmp );
}
When I compile this using -finstrument-functions-after-inlining with clang++ --version:
clang version 11.0.0 (...)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: ...
The instrument code explodes the execution time, because my entry and exit functions are called for every call of
void std::__introsort_loop<...>(...)
void std::__move_median_to_first<...>(...)
I'm sorting a really big array, so my program doesn't finish: without instrumentation it takes around 10 seconds, with instrumentation I've cancelled it at 10 minutes.
I've tried adding __attribute__((no_instrument_function)) to mysort (and the function that calls mysort), but this doesn't seem to have an effect as far as these standard library calls are concerned.
Does anyone know if it is possible to ignore function instrumentation for the internals of a standard library function like std::sort? Ideally, I would only have mysort instrumented, so a single entry and a single exit!
I see that clang++ sadly does not yet support anything like finstrument-functions-exclude-function-list or finstrument-functions-exclude-file-list, but g++ does not yet support -finstrument-functions-after-inlining which I would ideally have, so I'm stuck!
EDIT: After playing more, it would appear the effect on execution-time is actually less than that described, so this isn't the end of the world. The problem still remains however, because most people who are doing function instrumentation in clang will only care about the application code, and not those functions linked from (for example) the standard library.
EDIT2: To further highlight the problem now that I've got it running in a reasonable time frame: the resulting trace that I produce from the instrumented code with those two standard library functions is 15GB. When I hard code my tracing to ignore the two function addresses, the resulting trace is 3.7MB!
I've run into the same problem. It looks like support for these flags was once proposed, but never merged into the main branch.
https://reviews.llvm.org/D37622
This is not a direct answer, since the tool doesn't support what you want to do, but I think I have a decent work-around. What I wound up doing was creating a "skip list" of sorts. In the instrumented functions (__cyg_profile_func_enter and __cyg_profile_func_exit), I would guess the part that is contributing most to your execution time is the printing. If you can come up with a way of short-circuiting the profile functions, that should help, even if it's not the most ideal. At the very least it will limit the size of the output file.
Something like
#include <stdint.h>
uintptr_t skipAddrs[] = {
// assuming 64-bit addresses
0x123456789abcdef, 0x2468ace2468ace24
};
size_t arrSize = 0;
int main(void)
{
...
arrSize = sizeof(skipAddrs)/sizeof(skipAddrs[0]);
// https://stackoverflow.com/a/37539/12940429
...
}
void __cyg_profile_func_enter (void *this_fn, void *call_site) {
for (size_t idx = 0; idx < arrSize; idx++) {
if ((uintptr_t) this_fn == skipAddrs[idx]) {
return;
}
}
}
I use something like objdump -t binaryFile to examine the symbol table and find what the addresses are for each function.
If you specifically want to ignore library calls, something that might work is examining the symbol table of your object file(s) before linking against libraries, then ignoring all the ones that appear new in the final binary.
All this should be possible with things like grep, awk, or python.
You have to add attribute __attribute__((no_instrument_function)) to the functions that should not be instrumented. Unfortunately it is not easy to make it work with C/C++ standard library functions because this feature requires editing all the C++ library functions.
There are some hacks you can do like #define existing macros from include/__config to add this attribute as well. e.g.,
-D_LIBCPP_INLINE_VISIBILITY=__attribute__((no_instrument_function,internal_linkage))
Make sure to append existing macro definition with no_instrument_function to avoid unexpected errors.
I have three threads in an application I'm building, all of which remain open for the lifetime of the application. Several variables and functions should only be accessed from specific threads. In my debug compile, I'd like a check to be run and an error to be thrown if one of these functions or variables is accessed from an illegal thread, but I don't want this as overhead in my final compilation. I really just want this so I the programmer don't make stupid mistakes, not to protect my executing program from making mistakes.
Originally, I had a 'thread protected' class template that would wrap around return types for functions, and run a check on construction before implicitly converting to the intended return type, but this didn't seem to work for void return types without disabling important warnings, and it didn't resolve my issue for protected variables.
Is there a method of doing this, or is it outside the scope of the language? 'If you need this solution, you're doing it wrong' comments not appreciated, I managed to near halve my program's execution time with this methodology, but it's just too likely I'm going to make a mistake that results in a silent race condition and ultimately undefined behavior.
What you described is exactly what assert macro is for.
assert(condition)
In a debug build condition is checked. If it is false, the program will throw an exception at that line. In a release build, the assert and whatever is inside the parentheses aren't compiled.
Without being harsh, it would have been more helpful if you had explained the variables you are trying to protect. What type are they? Where do they come from? What's their lifetime? Are they global? Why do you need to protect a returned type if it's void? How did you end up in a situation where one thread might accidentally access something. I kinda have to guess but I'll throw out some ideas here:
#include <thread>
#include <cassert>
void protectedFunction()
{
assert(std::this_thread::get_id() == g_thread1.get_id());
}
// protect a global singleton (full program lifetime)
std::string& protectedGlobalString()
{
static std::string inst;
assert(std::this_thread::get_id() == g_thread1.get_id());
return inst;
}
// protect a class member
int SomeClass::protectedInt()
{
assert(std::this_thread::get_id() == g_thread1.get_id());
return this->m_theVar;
}
// thread protected wrapper
template <typename T>
class ThreadProtected
{
std::thread::id m_expected;
T m_val;
public:
ThreadProtected(T init, std::thread::id expected)
: m_val(init), m_expected(expected)
{ }
T Get()
{
assert(std::this_thread::get_id() == m_expected);
return m_val;
}
};
// specialization for void
template <>
class ThreadProtected<void>
{
public:
ThreadProtected(std::thread::id expected)
{
assert(std::this_thread::get_id() == expected);
}
};
assert is oldschool. We were actually told to stop using it at work because it was causing resource leaks (the exception was being caught high up in the stack). It has the potential to cause debugging headaches because the debug behavior is different from the release behavior. A lot of the time if the asserted condition is false, there isn't really a good choice of what to do; you usually don't want to continue running the function but you also don't know what value to return. assert is still very useful when developing code. I personally use assert all the time.
static_assert will not help here because the condition you are checking for (e.g. "Which thread is running this code?") is a runtime condition.
Another note:
Don't put things that you want to be compiled inside an assert. It seems obvious now, but it's easy to do something dumb like
int* p;
assert(p = new(nothrow) int); // check that `new` returns a value -- BAD!!
It's good to check the allocation of new, but the allocation won't happen in a release build, and you won't even notice until you start release testing!
int* p;
p = new(nothrow) int;
assert(p); // check that `new` returns a value -- BETTER...
Lastly, if you write the protected accessor functions in a class body or in a .h, you can goad the compiler into inlining them.
Update to address the question:
The real question though is where do I PUT an assert macro? Is a
requirement that I write setters and getters for all my thread
protected variables then declare them as inline and hope they get
optimised out in the final release?
You said there are variables that should be checked (in the debug build only) when accessed to make sure the correct thread is accessing them. So, theoretically, you would want an assert macro before every such access. This is easy if there are only a few places (if this is the the case, you can ignore everything I'm about to say). However, if there are so many places that it starts to violate the DRY Principal, I suggest writing getters/setters and putting the assert inside (This is what I've casually given examples of above). But while the assert won't add overhead in release mode (since it's conditionally compiled), using extra functions (probably) adds function call overhead. However, if you write them in the .h, there's a good chance they'll be inlined.
Your requirement for me was to come up with a way to do this without release overhead. Now that I've mentioned inlining I'm obligated to say that the compiler knows best. There usually are compiler-specific ways to force inlining (since the compiler is allowed to ignore the inline keyword). You should be profiling the code before trying to inline things. See the answer to this question. Is it good practice to make getters and setters inline?. You can easily see if the compiler is inlining the function by looking at the assembly. Don't worry, you don't have to be good at assembly. Just find the calling function and look for a call to the getter/setter. If the function was inlined, you won't see a call and you'll see probably a mov instead.
I'm fairly new to c++ and am really interested in learning more. Have been reading quite a bit. Recently discovered the init/fini elf sections.
I started to wonder if & how one would use the init section to prepopulate objects that would be used at runtime. Say for example you wanted
to add performance measurements to your code, recording the time, filename, linenumber, and maybe some ID (monotonic increasing int for ex) or name.
You would place for example:
PROBE(0,"EventProcessing",__FILE__,__LINE__)
...... //process event
PROBE(1,"EventProcessing",__FILE__,__LINE__)
......//different processing on same event
PROBE(2,"EventProcessing",__FILE__,__LINE__)
The PROBE could be some macro that populates a struct containing this data (maybe on an array/list, etc using the id as an indexer).
Would it be possible to have code in the init section that could prepopulate all of this data for each PROBE (except for the time of course), so only the time would need to be retrieved/copied at runtime?
As far as I know the __attribute__((constructor)) can not be applied to member functions?
My initial idea was to create some kind of
linked list with each node pointing to each probe and code in the init secction could iterate it populating the id, file, line, etc, but
that idea assumed I could use a member function that could run in the "init" section, but that does not seem possible. Any tips appreciated!
As far as I understand it, you do not actually need an ELF constructor here. Instead, you could emit descriptors for your probes using extended asm statements (using data, instead of code). This also involves switching to a dedicated ELF section for the probe descriptors, say __probes.
The linker will concatenate all the probes and in an array, and generate special symbols __start___probes and __stop___probes, which you can use from your program to access thes probes. See the last paragraph in Input Section Example.
Systemtap implements something quite similar for its userspace probes:
User Space Probe Implementation
Adding User Space Probing to an Application (heapsort example)
Similar constructs are also used within the Linux kernel for its self-patching mechanism.
There's a pretty simple way to have code run on module load time: Use the constructor of a global variable:
struct RunMeSomeCode
{
RunMeSomeCode()
{
// your code goes here
}
} do_it;
The .init/.fini sections basically exist to implement global constructors/destructors as part of the ABI on some platforms. Other platforms may use different mechanisms such as _start and _init functions or .init_array/.deinit_array and .preinit_array. There are lots of subtle differences between all these methods and which one to use for what is a question that can really only be answered by the documentation of your target platform. Not all platforms use ELF to begin with…
The main point to understand is that things like the .init/.fini sections in an ELF binary happen way below the level of C++ as a language. A C++ compiler may use these things to implement certain behavior on a certain target platform. On a different platform, a C++ compiler will probably have to use different mechanisms to implement that same behavior. Many compilers will give you tools in the form of language extensions like __attributes__ or #pragmas to control such platform-specific details. But those generally only make sense and will only work with that particular compiler on that particular platform.
You don't need a member function (which gets a this pointer passed as an arg); instead you can simply create constructor-like functions that reference a global array, like
#define PROBE(id, stuff, more_stuff) \
__attribute__((constructor)) void \
probeinit##id(){ probes[id] = {id, stuff, 0/*to be written later*/, more_stuff}; }
The trick is having this macro work in the middle of another function. GNU C / C++ allows nested functions, but IDK if you can make them constructors.
You don't want to declare a static int dummy#id = something because then you're adding overhead to the function you profile. (gcc has to emit a thread-safe run-once locking mechanism.)
Really what you'd like is some kind of separate pass over the source that identifies all the PROBE macros and collects up their args to declare
struct probe global_probes[] = {
{0, "EventName", 0 /*placeholder*/, filename, linenum},
{1, "EventName", 0 /*placeholder*/, filename, linenum},
...
};
I'm not confident you can make that happen with CPP macros; I don't think it's possible to #define PROBE such that every time it expands, it redefines another macro to tack on more stuff.
But you could easily do that with an awk/perl/python / your fave scripting language program that scans your program and constructs a .c that declares an array with static storage.
Or better (for a single-threaded program): keep the runtime timestamps in one array, and the names and stuff in a separate array. So the cache footprint of the probes is smaller. For a multi-threaded program, stores to the same cache line from different threads is called false sharing, and creates cache-line ping-pong.
So you'd have #define PROBE(id, evname, blah blah) do { probe_times[id] = now(); }while(0)
and leave the handling of the later args to your separate preprocessing.
When I design a function in a class, I want to balance the information I can extract from it. Some information may be useful for debug but not necessary as the output of the function. I give the following example:
class A
{
bool my_func(int arg1, int &output, std::vector<int> &intermediate_vec);
{
// do something
}
}
In the function my_func, std::vector<int> &intermediate_vec is not necessary as the only information I am interested in is stored in the variable output. However, for debug purpose I am also interested in obtaining intermediate_vec as it is not convenient to check this variable inside the function for some reason. Therefore, I am considering designing two functions inside class A, one is used for debug, and the other is for real application.
class A
{
// for debug
bool my_func(int arg1, int &output, std::vector<int> &intermediate_vec);
{
// do something
}
// invoked by other programs
bool my_func(int arg1, int &output);
{
// do something
std::vector<int> intermediate_vec
return my_func(arg1, output, intermediate_vec);
}
}
I am just wondering wheter there are better ways to do the job. Thanks.
Use a logging library and log those intermediate values at debug log level instead of collecting them as output.
If you plan on using the intermediate_vec in some debug post-processing it can be tricky. However, if you only plan on using it just to print the results it easier.
The main thing I dislike in your idea is having //do something, which seem to do exactly the same, in two different places. This is very error-prone and starts to grow into a real PIA, when you will have to maintain a dozen classes with a dozen methods, and half of them have some debug copy-cat. Every change in logic has to be done twice in a coherent manner.
When I came upon a similar problem I was considering following things, to avoid doubling logic while performing conditional logging and/or additional instrumentation.
#define DEBUG/NDEBUG
You just have one copy of code with some pre-processor conditionals.
template < int DEBUG >.
Basically the same effect but different semantics.
The template method might complicate the coding a little bit, but it will allow to use both version during the run-time which might come in handy. The #define method does not alter API at all but you really need to think when designing code if you want some fancy selective or multilevel debugging.
The two functions method was ok in my use-cases when I had to have safe version and fast version of routine. The safe did some checks and then called fast number cruncher. This was useful if the number cruncher was used in loops or internally where it was safe to assume you can skip checks.
If the debug version is slower (e.g. 'casue you need to initialize and fill a long vector), then you probably do not want to call it in a release code. The same goes for logging. If you really need to output one number, but in the debug version you will end up printing megabytes of data (e.g. calculating a norm of a vector and printing the vector itself), you will want to use conditional logging.
So in overall this would look more like this:
class A
{
bool my_func(int arg1, int &output, std::vector<int> &intermediate_vec);
{
if(DEBUG) {//fill in the vector}
// do something
if(DEBUG) {//print out some fancy stuff}
}
// invoked by other programs
bool my_func(int arg1, int &output);
{
std::vector<int> intermediate_vec;
return my_func(arg1, output, intermediate_vec);
}
}
Of course then you can use short-call with the debug, but you won't get the vector back. Or full-call in no-debug mode, however the intermediate_vec will not be meaningful.
Anything to avoid copy-pasting application logic stuff. I did it and I was very miserable when it came to changing the logic.
Having used gprof and callgrind many times, I have reached the (obvious) conclusion that I cannot use them efficiently when dealing with large (as in a CAD program that loads a whole car) programs. I was thinking that maybe, I could use some C/C++ MACRO magic and somehow build a simple (but nice) logging mechanism. For example, one can call a function using the following macro:
#define CALL_FUN(fun_name, ...) \
fun_name (__VA_ARGS__);
We could add some clocking/timing stuff before and after the function call, so that every function called with CALL_FUN gets timed, e.g
#define CALL_FUN(fun_name, ...) \
time_t(&t0); \
fun_name (__VA_ARGS__); \
time_t(&t1);
The variables t0, t1 could be found in a global logging object. That logging object can also hold the calling graph for each function called through CALL_FUN. Afterwards, that object can be written in a (specifically formatted) file, and be parsed from some other program.
So here comes my (first) question: Do you find this approach tractable ? If yes, how can it be enhanced, and if not, can you propose a better way to measure time and log callgraphs ?
A collegue proposed another approach to deal with this problem, which is annotating with a specific comment each function (that we care to log). Then, during the make process, a special preprocessor must be run, parse each source file, add logging logic for each function we care to log, create a new source file with the newly added (parsing) code, and build that code instead. I guess that reading CALL_FUN... macros (my proposal) all over the place is not the best approach, and his approach would solve this problem. So what is your opinion about this approach?
PS: I am not well versed in the pitfalls of C/C++ MACROs, so if this can be developed using another approach, please say it so.
Thank you.
Well you could do some C++ magic to embed a logging object. something like
class CDebug
{
CDebug() { ... log somehow ... }
~CDebug() { ... log somehow ... }
};
in your functions then you simply write
void foo()
{
CDebug dbg;
...
you could add some debug info
dbg.heythishappened()
...
} // not dtor is called or if function is interrupted called from elsewhere.
I am bit late, but here is what I am doing for this:
On Windows there is a /Gh compiler switch which makes the compiler to insert a hidden _penter function at the start of each function. There is also a switch for getting a _pexit call at the end of each function.
You can utilizes this to get callbacks on each function call. Here is an article with more details and sample source code:
http://www.johnpanzer.com/aci_cuj/index.html
I am using this approach in my custom logging system for storing the last few thousand function calls in a ring buffer. This turned out to be useful for crash debugging (in combination with MiniDumps).
Some notes on this:
The performance impact very much depends on your callback code. You need to keep it as simple as possible.
You just need to store the function address and module base address in the log file. You can then later use the Debug Interface Access SDK to get the function name from the address (via the PDB file).
All this works suprisingly well for me.
Many nice industrial libraries have functions' declarations and definitions wrapped into void macros, just in case. If your code is already like that -- go ahead and debug your performance problems with some simple asynchronous trace logger. If no -- post-insertion of such macros can be an unacceptably time-consuming.
I can understand the pain of running an 1Mx1M matrix solver under valgrind, so I would suggest starting with so called "Monte Carlo profiling method" -- start the process and in parallel run pstack repeatedly, say each second. As a result you will have N stack dumps (N can be quite significant). Then, the mathematical approach would be to count relative frequencies of each stack and make a conclusion about the ones most frequent. In practice you either immediately see the bottleneck or, if no, you switch to bisection, gprof, and finally to valgrind's toolset.
Let me assume the reason you are doing this is you want to locate any performance problems (bottlenecks) so you can fix them to get higher performance.
As opposed to measuring speed or getting coverage info.
It seems you're thinking the way to do this is to log the history of function calls and measure how long each call takes.
There's a different approach.
It's based on the idea that mainly the program walks a big call tree.
If time is being wasted it is because the call tree is more bushy than necessary,
and during the time that's being wasted, the code that's doing the wasting is visible on the stack.
It can be terminal instructions, but more likely function calls, at almost any level of the stack.
Simply pausing the program under a debugger a few times will eventually display it.
Anything you see it doing, on more than one stack sample, if you can improve it, will speed up the program.
It works whether or not the time is being spent in CPU, I/O or anything else that consumes wall clock time.
What it doesn't show you is tons of stuff you don't need to know.
The only way it can not show you bottlenecks is if they are very small,
in which case the code is pretty near optimal.
Here's more of an explanation.
Although I think it will be hard to do anything better than gprof, you can create a special class LOG for instance and instantiate it in the beginning of each function you want to log.
class LOG {
LOG(const char* ...) {
// log time_t of the beginning of the call
}
~LOG(const char* ...) {
// calculate the total time spent,
//by difference between current time and that saved in the constructor
}
};
void somefunction() {
LOG log(__FUNCTION__, __FILE__, ...);
.. do other things
}
Now you can integrate this approach with the preprocessing one you mentioned. Just add something like this in the beginning of each function you want to log:
// ### LOG
and then you replace the string automatically in debug builds (shoudn't be hard).
May be you should use a profiler. AQTime is a relatively good one for Visual Studio. (If you have VS2010 Ultimate, you already have a profiler.)