How to flush a range of address in CPU cache? - c++

I want to test the performance of a userspace program in linux running on x86. To calculate the performance, it is necessary for me to flush specific cache lines to memory (make sure those lines are invalidated and upon the next request there will be a cache miss).
I've already seen suggestions using cacheflush(2) which supposed to be a system call, yet g++ complains about it is not being declared. Also, I cannot use clflush_cache_range which apparently can be invoked only within a kernel program.
Right now what I tried to do is to use the following code:
static inline void clflush(volatile void *__p)
{
asm volatile("clflush %0" : "+m" (*(volatile char __force *)__p));
}
But this gives the following error upon compilation:
error: expected primary-expression before ‘volatile’
Then I changed it as follows:
static inline void clflush(volatile void *__p)
{
asm volatile("clflush %0" :: "m" (__p));
}
It compiled successfully, but the timing results did not change. I'm suspicious if the compiler removed it for the purpose of optimization.
Dose anyone has any idea how can I solve this problem?

The second one flushes the memory containing the pointer __p, which is on the stack, which is why it doesn’t have the effect you want.
The problem with the first one is that it uses the macro __force, which is defined in the Linux kernel and is unneeded here. (What does the __attribute__((force)) do?)
If you remove __force, it will do what you want.
(You should also change it to avoid using the variable name __p, which is a reserved identifier.)

Related

MSVC optimizer saves and restores XMM SIMD registers on an early-out path through a function. Why? [duplicate]

In C, if I have a function call that looks like
// main.c
...
do_work_on_object(object, arg1, arg2);
...
// object.c
void do_work_on_object(struct object_t *object, int arg1, int arg2)
{
if(object == NULL)
{
return;
}
// do lots of work
}
then the compiler will generate a lot of stuff in main.o to save state, pass parameters (hopefully in registers in this case), and restore state.
However, at link time it can be observed that arg1 and arg2 are not used in the quick-return path, so the clean-up and state restoration can be short-circuited. Do linkers tend to do this kind of thing automatically, or would one need to turn on link-time optimization (LTO) to get that kind of thing to work?
(Yes, I could inspect the disassembled code, but I'm interested in the behaviours of compilers and linkers in general, and on multiple architectures, so hoping to learn from others' experience.)
Assuming that profiling shows this function call is worth optimizing, should we expect the following code to be noticeably faster (e.g. without the need to use LTO)?
// main.c
...
if(object != NULL)
{
do_work_on_object(object, arg1, arg2);
}
...
// object.c
void do_work_on_object(struct object_t *object, int arg1, int arg2)
{
assert(object != NULL) // generates no code in release build
// do lots of work
}
Some compilers (like GCC and clang) are able to do "shrink-wrap" optimization to delay saving call-preserved regs until after a possible early-out, if they're able to spot the pattern. But some don't, e.g. apparently MSVC 16.11 still doesn't.
I don't think any do partial inlining of just the early-out check into the caller, to avoid even the overhead of arg-passing and the call / ret itself.
Since compiler/linker support for this is not universal and not always successful even for shrink-wrapping, you can write your code in a way that gets much of the benefit, at the cost of splitting the logic of your function into two places.
If you have a fast-path that takes hardly any code, but happens often enough to matter, put that part in a header so it gets inlined, with a fallback to calling the rest of the function (which you make private, so it can assume that any checks in the inlined part are already done).
e.g. par2's routine that processes a block of data has a fast-path for when the galois16 factor is zero. (dst[i] += 0 * src[i] is a no-op, even when * is a multiply in Galois16, and += is a GF16 add (i.e. a bitwise XOR)).
Note how the commit in question renames the old function to InternalProcess, and adds a new template<class g> inline bool ReedSolomon<g>::Process that checks for the fast-path, and otherwise calls InternalProcess. (as well as making a bunch of unrelated whitespace changes, and some ifdefs... It was originally a 2006 CVS commit.)
The comment in the commit claims an overall 8% speed gain for repairing.
Neither the setup or cleanup state code can be short-circuited, because the resulted compiled code is static, and it doesn't know what will happen when the program get's executed. So the compiler will always have to setup the whole parameter stack.
Think of two situations: in one object is nil, in the other is not. How will the assembly code know if to put on the stack the rest of the argument? Especially as the caller is the one responsible of placing the arguments at their proper location (stack or registry).

inside naked function - how to do simple assignment

This is the beginning of a function that already exists and works; the commented line is my addition and its purpose is to toggle a pin.
inline __attribute__((naked))
void CScheduler::SwapToThread(void* pNew, void* pPrev)
{
//*(volatile DWORD*)0x400FF08C = (1 << 14);
if (pPrev != NULL)
{
if (pPrev == this) // Special case to save scheduler stack on startup
{
asm("mov lr,%0"::"p"(&CScheduler_Run_Exit)); // load r1 with schedulers End thread
asm("orr lr, 1");
When I uncomment my addition, my hard fault handler executes. I get it has something to do with this being a naked function but I don't understand why a simple assignment causes a problem.
Two questions:
Why does this line trigger the hard fault?
How can I perform this assignment inside this function?
It was only luck that your previous version of the function happened to work without crashing.
The only thing that can safely be put inside a naked function is a pure Basic Asm statement. https://gcc.gnu.org/onlinedocs/gcc/ARM-Function-Attributes.html. You can split it up into multiple Basic Asm statements, instead of asm("insn \n\t" / "insn2 \n\t" / ...);, but you have to write the entire function in asm yourself.
While using extended asm or a mixture of basic asm and C code may appear to work, they cannot be depended upon to work reliably and are not supported.
If you want to run C++ code from a naked function, you could call a regular function (or bl on ARM, jal on MIPS, etc.), following to the standard calling convention.
As for the specific reason in this case? Maybe creating that address in a register stepped on the function args, leading to the branches going wrong? Inspect the generated asm if you want, but it's 100% unsupported.
Or maybe it ended up using more registers, and since it's naked didn't properly save/restore call-preserved registers? I haven't looked at the code-gen myself for naked functions.
Are you sure this function needs to be naked? I guess that's because you manipulate lr to return to the new context.
If you don't want to just write more logic in asm, maybe have this function's caller do more work (and maybe pass it pointer and/or boolean args telling it more simply what it needs to do, so your inputs are already in registers, and you don't need to access globals).

Patch C/C++ function to just return without execution

I want to avoid one system function executing in a large project. It is impossible to redefine it or add some ifdef logic. So I want to patch the code to just the ret operation.
The functions are:
void __cdecl _wassert(const wchar_t *, const wchar_t *, unsigned);
and:
void __dj_assert(const char *, const char *, int, const char *) __attribute__((__noreturn__));
So I need to patch the first one on Visual C++ compiler, and the second one on GCC compiler.
Can I just write the ret instruction directly at the address of the _wassert/__dj_assert function, for x86/x64?
UPDATE:
I just wanna modify function body like this:
*_wassert = `ret`;
Or maybe copy another function body like this:
void __cdecl _wassert_emptyhar_t *, const wchar_t *, unsigned)
{
}
for (int i = 0; i < sizeof(void*); i++) {
((char*)_wassert)[i] = ((char*)_wassert_empty
}
UPDATE 2:
I really don't understand why there are so many objections against silent asserts. In fact, there is no asserts in the RELEASE mode, but nobody cares. I just want to be able turning on/off the asserts in the DEBUG mode.
You need to understand the calling conventions for your particular processor ISA and system ABI. See this for x86 & x86-64 calling conventions.
Some calling conventions require more than a single ret machine instruction in the epilogue, and you have to count with that. BTW, code of some function usually resides in a read-only code segment, and you'll need some dirty tricks to patch it and write inside it.
You could compile a no-op function of the same signature, and ask the compiler to show the emitted assembler code (e.g. with gcc -O -Wall -fverbose-asm -S if using GCC....)
On Linux you might use dynamic linker LD_PRELOAD tricks. If using a recent GCC you might perhaps consider customizing it with MELT, but I don't think it is worthwhile in your particular case...
However, you apparently have some assert failure. It is very unlikely that your program could continue without any undefined behavior. So practically speaking, your program will very likely crash elsewhere with your proposed "fix", and you'll lose more of your time with it.
Better take enough time to correct the original bug, and improve your development process. Your way is postponing a critical bug correction, and you are extremely likely to spend more time avoiding that bug fix than dealing with it properly (and finding it now, not later) as you should. Avoid increasing your technical debt and making your code base even more buggy and rotten.
My feeling is that you are going nowhere (except to a big failure) with your approach of patching the binary to avoid assert-s. You should find out why there are violated, and improve the code (either remove the obsolete assert, or improve it, or correct the bug elsewhere that assert has detected).
On Gnu/Linux you can use the --wrapoption like this:
gcc source.c -Wl,--wrap,functionToPatch -o prog
and your source must add the wrapper function:
void *__wrap_functionToPatch () {} // simply returns
Parameters and return values as needed for your function.

gdb - re-setting a const

I have
const int MAX_CONNECTIONS = 500;
//...
if(clients.size() < MAX_CONNECTIONS) {
//...
}
I'm trying to find the "right" choice for MAX_CONNECTIONS. So I fire up gdb and set MAX_CONNECTIONS = 750. But it seems my code isn't responding to this change. I wonder if it's because the const int was resolved at compile time even though it wound up getting bumped at runtime. Does this sound right, and, using GDB is there any way I can bypass this effect without having to edit the code in my program? It takes a while just to warm up to 500.
I suspect that the compiler, seeing that the variable is const, is inlining the constant into the assembly and not having the generated code actually read the value of the MAX_CONNECTIONS variable. The C++ spec is worded in a way where if a variable of primitive type is explicitly marked const, the compiler can make certain assumptions about it for the purposes of optimization, since any attempt to change that constant is either (1) illegal or (2) results in undefined behavior.
If you want to use GDB to do things like this, consider marking the variable volatile rather than const to indicate to the compiler that it shouldn't optimize it. Alternatively, have this information controlled by some other data source (say, a configuration option inside a file) so that you aren't blasting the program's memory out from underneath it in order to change the value.
Hope this helps!
By telling it it's const, you're telling the compiler it has freedom to not load the value, but to build it directly into the code when possible. An allocated copy may still exist for those times when the particular instructions chosen need to load a value rather than having an immediate value, or it could be omitted by the compiler as well. That's a bit of a loose answer short on standardese, but that's the basic idea.
As this post is quite old, my answer is more like a reference to my future self. Assuming you compiled in debug mode, running the following expression in the debugger (lldb in my case) works:
const_cast<int&>(MAX_CONNECTIONS) = 750
In case you have to change the constant often, e.g. in a loop, set a breakpoint and evaluate the expression each time the breakpoint is hit
breakpoint set <location>
breakpoint command add <breakpoint_id>
const_cast<int&>(MAX_CONNECTIONS) = 750
DONE

Does an arbitrary instruction pointer reside in a specific function?

I have a very difficult problem I'm trying to solve: Let's say I have an arbitrary instruction pointer. I need to find out if that instruction pointer resides in a specific function (let's call it "Foo").
One approach to this would be to try to find the start and ending bounds of the function and see if the IP resides in it. The starting bound is easy to find:
void *start = &Foo;
The problem is, I don't know how to get the ending address of the function (or how "long" the function is, in bytes of assembly).
Does anyone have any ideas how you would get the "length" of a function, or a completely different way of doing this?
Let's assume that there is no SEH or C++ exception handling in the function. Also note that I am on a win32 platform, and have full access to the win32 api.
This won't work. You're presuming functions are contigous in memory and that one address will map to one function. The optimizer has a lot of leeway here and can move code from functions around the image.
If you have PDB files, you can use something like the dbghelp or DIA API's to figure this out. For instance, SymFromAddr. There may be some ambiguity here as a single address can map to multiple functions.
I've seen code that tries to do this before with something like:
#pragma optimize("", off)
void Foo()
{
}
void FooEnd()
{
}
#pragma optimize("", on)
And then FooEnd-Foo was used to compute the length of function Foo. This approach is incredibly error prone and still makes a lot of assumptions about exactly how the code is generated.
Look at the *.map file which can optionally be generated by the linker when it links the program, or at the program's debug (*.pdb) file.
OK, I haven't done assembly in about 15 years. Back then, I didn't do very much. Also, it was 680x0 asm. BUT...
Don't you just need to put a label before and after the function, take their addresses, subtract them for the function length, and then just compare the IP? I've seen the former done. The latter seems obvious.
If you're doing this in C, look first for debugging support --- ChrisW is spot on with map files, but also see if your C compiler's standard library provides anything for this low-level stuff -- most compilers provide tools for analysing the stack etc., for instance, even though it's not standard. Otherwise, try just using inline assembly, or wrapping the C function with an assembly file and a empty wrapper function with those labels.
The most simple solution is maintaining a state variable:
volatile int FOO_is_running = 0;
int Foo( int par ){
FOO_is_running = 1;
/* do the work */
FOO_is_running = 0;
return 0;
}
Here's how I do it, but it's using gcc/gdb.
$ gdb ImageWithSymbols
gdb> info line * 0xYourEIPhere
Edit: Formatting is giving me fits. Time for another beer.