Emitting a pseudo-instruction during relaxation

Emitting a pseudo-instruction during relaxation - llvm

In an LLVM backend, I'd like to emit a pseudo-instruction as the relaxation of a real (non-pseudo-)instruction. The problem is, I am unable to find a way to add my pseudo-instruction expander pass in my TargetPassConfig such that it will be applied to the output of my AsmBackend::relaxInstruction.
Looking at MCAssembler::relaxInstruction, which seems to be the driver for relaxation, it passes the relaxation's result directly to the instruction encoder:
bool MCAssembler::relaxInstruction(MCAsmLayout &Layout,
MCRelaxableFragment &F) {
if (!fragmentNeedsRelaxation(&F, Layout))
return false;
++stats::RelaxedInstructions;
// FIXME-PERF: We could immediately lower out instructions if we can tell
// they are fully resolved, to avoid retesting on later passes.
// Relax the fragment.
MCInst Relaxed;
getBackend().relaxInstruction(F.getInst(), F.getSubtargetInfo(), Relaxed);
// Encode the new instruction.
//
// FIXME-PERF: If it matters, we could let the target do this. It can
// probably do so more efficiently in many cases.
SmallVector<MCFixup, 4> Fixups;
SmallString<256> Code;
raw_svector_ostream VecOS(Code);
getEmitter().encodeInstruction(Relaxed, VecOS, Fixups, F.getSubtargetInfo());
// Update the fragment.
F.setInst(Relaxed);
F.getContents() = Code;
F.getFixups() = Fixups;
return true;
}
To me, this implies that I'm "on my own" in ensuring that my backend's relaxInstruction doesn't emit any pseudo-instructions. So how do I hook my pseudo-instruction expander pass into my relaxInstruction?

Inside your target's CodeEmitter::encodeInstruction it usually ends up with a call to getBinaryCodeForInstr. Before this call can you expand your pseudo instruction in to the two real ones?

Related

Optimal place to call __syncthreads()

Given that the code is correct, is there some potential performance benefit in calling __syncthreads as late as possible, as early as possible, or does it not matter? Here's an example with comments that demonstrate the question:
__global__ void kernel(const float* data) {
__shared__ float shared_data[64];
if (threadIdx.x < 64) {
shared_data[threadIdx.x] = data[threadIdx.x];
}
// Option #1: Place the call to `__syncthreads()` here?
// Here is a lot of code that doesn't use `shared_data`.
// Option #2: Place the call to `__syncthreads()` here?
// Here is some code that uses `shared_data`.
}

What you are facing is a split between where the writes are made and where they should be visible to the entire block.
NVIDIA has recently introduced a mechanism for just that: arrive + wait.
You start with initializing a barrier:
void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);
Then you arrive at your "option 1" position, with the bar token you initialized:
__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);
then you have your unrelated code, and then finally, wait for everyone to arrive at your "option 2" position:
bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);
... but note that this call doesn't block, i.e you'll have to actively "wait".
Alternatively, you can use NVIDIA's C++ wrappers for this mechanism, presented here.
Note that this functionality is relatively new, with Compute Capability at least 7.0 required, and 8.0 or later recommended.

Clean-up a timed-out future

I need to run a function with a timeout. If it didn't return within the given timeout, I need to discard it and fallback to a different method.
Following is a (greatly) simplified sample code to highlight the problem. (In reality, this is an always running, highly available application. There I first read from the cache and try to read from the database only if the cache has stale data. However, if the database query took long, I need to continue with the stale data.)
My question is, in the case where the future read timed out, do I have to handle the clean-up of the future separately (i.e. keep a copy and check if it is ready time-to-time)? or can I simply ignore it (i.e. keep the code as is).
/* DB query can be time-consuming, but the result is fresh */
string readFromDatabase(){
// ...
// auto dbValue = db.query("select name from users where id=" + _id);
// ...
return dbValue;
}
/* Cache query is instant, but the result could be stale */
string readFromLocalCache(){
// ...
// auto cachedVal = _cache[_id];
// ...
return cachedVal;
}
int getValue(){
// Idea:
// - Try reading from the database.
// - If the db query didn't return within 1 second, fallback to the other method.
using namespace std::chrono_literals;
auto fut = std::async(std::launch::async, [&](){ return readFromDatabase(); });
switch (fut.wait_for(1s)){
case std::future_status::ready: // query returned within allotted time
{
auto freshVal = fut.get();
// update cache
return freshVal;
}
case std::future_status::timeout: // timed out, fallback ------ (*)
{
break;
}
case std::future_status::deferred: // should not be reached
{
break;
}
}
return readFromLocalCache();
// quetion? what happens to `fut`?
}

My question is, in the case where the future read timed out, do I have to handle the clean-up of the future separately (i.e. keep a copy and check if it is ready time-to-time)? or can I simply ignore it (i.e. keep the code as is).
From my personal perspective, it depends on what you want. Under your current (minimal) implementation, the getValue function will be blocked by the future's destructor(see cppreference page and some SO questions).
If you do not want the blocking behavior, there are some solutions, as proposed in this question, like:
move the future to some outside scope
use a detached executor and some handy code/data structure to handle the return status
see if you can replace the future with some timeout support I/O operations like select/poll
etc.

Pointer passed to function changes unexpectedly

I'm designing a preloader-based lock tracing utility that attaches to Pthreads, and I've run into a weird issue. The program works by providing wrappers that replace relevant Pthreads functions at runtime; these do some logging, and then pass the args to the real Pthreads function to do the work. They do not modify the arguments passed to them, obviously. However, when testing, I discovered that the condition variable pointer passed to my pthread_cond_wait() wrapper does not match the one that gets passed to the underlying Pthreads function, which promptly crashes with "futex facility returned an unexpected error code," which, from what I've gathered, usually indicates an invalid sync object passed in. Relevant stack trace from GDB:
#8 __pthread_cond_wait (cond=0x7f1b14000d12, mutex=0x55a2b961eec0) at pthread_cond_wait.c:638
#9 0x00007f1b1a47b6ae in pthread_cond_wait (cond=0x55a2b961f290, lk=0x55a2b961eec0)
at pthread_trace.cpp:56
I'm pretty mystified. Here's the code for my pthread_cond_wait() wrapper:
int pthread_cond_wait(pthread_cond_t* cond, pthread_mutex_t* lk) {
// log arrival at wait
the_tracer.add_event(lktrace::event::COND_WAIT, (size_t) cond);
// run pthreads function
GET_REAL_FN(pthread_cond_wait, int, pthread_cond_t*, pthread_mutex_t*);
int e = REAL_FN(cond, lk);
if (e == 0) the_tracer.add_event(lktrace::event::COND_LEAVE, (size_t) cond);
else {
the_tracer.add_event(lktrace::event::COND_ERR, (size_t) cond);
}
return e;
}
// GET_REAL_FN is defined as:
#define GET_REAL_FN(name, rtn, params...) \
typedef rtn (*real_fn_t)(params); \
static const real_fn_t REAL_FN = (real_fn_t) dlsym(RTLD_NEXT, #name); \
assert(REAL_FN != NULL) // semicolon absence intentional
And here's the code for __pthread_cond_wait in glibc 2.31 (this is the function that gets called if you call pthread_cond_wait normally, it has a different name because of versioning stuff. The stack trace above confirms that this is the function that REAL_FN points to):
int
__pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mutex)
{
/* clockid is unused when abstime is NULL. */
return __pthread_cond_wait_common (cond, mutex, 0, NULL);
}
As you can see, neither of these functions modifies cond, yet it is not the same in the two frames. Examining the two different pointers in a core dump shows that they point to different contents, as well. I can also see in the core dump that cond does not appear to change in my wrapper function (i.e. it's still equal to 0x5... in frame 9 at the crash point, which is the call to REAL_FN). I can't really tell which pointer is correct by looking at their contents, but I'd assume it's the one passed in to my wrapper from the target application. Both pointers point to valid segments for program data (marked ALLOC, LOAD, HAS_CONTENTS).
My tool is definitely causing the error somehow, the target application runs fine if it is not attached. What am I missing?
UPDATE: Actually, this doesn't appear to be what's causing the error, because calls to my pthread_cond_wait() wrapper succeed many times before the error occurs, and exhibit similar behavior (pointer value changing between frames without explanation) each time. I'm leaving the question open, though, because I still don't understand what's going on here and I'd like to learn.
UPDATE 2: As requested, here's the code for tracer.add_event():
// add an event to the calling thread's history
// hist_entry ctor gets timestamp & stack trace
void tracer::add_event(event e, size_t obj_addr) {
size_t tid = get_tid();
hist_map::iterator hist = histories.contains(tid);
assert(hist != histories.end());
hist_entry ev (e, obj_addr);
hist->second.push_back(ev);
}
// hist_entry ctor:
hist_entry::hist_entry(event e, size_t obj_addr) :
ts(chrono::steady_clock::now()), ev(e), addr(obj_addr) {
// these are set in the tracer ctor
assert(start_addr && end_addr);
void* buf[TRACE_DEPTH];
int v = backtrace(buf, TRACE_DEPTH);
int a = 0;
// find first frame outside of our own code
while (a < v && start_addr < (size_t) buf[a] &&
end_addr > (size_t) buf[a]) ++a;
// skip requested amount of frames
a += TRACE_SKIP;
if (a >= v) a = v-1;
caller = buf[a];
}
histories is a lock-free concurrent hashmap from libcds (mapping tid->per-thread vectors of hist_entry), and its iterators are guaranteed to be thread-safe as well. GNU docs say backtrace() is thread-safe, and there's no data races mentioned in the CPP docs for steady_clock::now(). get_tid() just calls pthread_self() using the same method as the wrapper functions, and casts its result to size_t.

Hah, figured it out! The issue is that Glibc exposes multiple versions of pthread_cond_wait(), for backwards compatibility. The version I reproduce in my question is the current version, the one we want to call. The version that dlsym() was finding is the backwards-compatible version:
int
__pthread_cond_wait_2_0 (pthread_cond_2_0_t *cond, pthread_mutex_t *mutex)
{
if (cond->cond == NULL)
{
pthread_cond_t *newcond;
newcond = (pthread_cond_t *) calloc (sizeof (pthread_cond_t), 1);
if (newcond == NULL)
return ENOMEM;
if (atomic_compare_and_exchange_bool_acq (&cond->cond, newcond, NULL))
/* Somebody else just initialized the condvar. */
free (newcond);
}
return __pthread_cond_wait (cond->cond, mutex);
}
As you can see, this version tail-calls the current one, which is probably why this took so long to detect: GDB is normally pretty good at detecting frames elided by tail calls, but I'm guessing it didn't detect this one because the functions have the "same" name (and the error doesn't affect the mutex functions because they don't expose multiple versions). This blog post goes into much more detail, coincidentally specifically about pthread_cond_wait(). I stepped through this function many times while debugging and sort of tuned it out, because every call into glibc is wrapped in multiple layers of indirection; I only realized what was going on when I set a breakpoint on the pthread_cond_wait symbol, instead of a line number, and it stopped at this function.
Anyway, this explains the changing pointer phenomenon: what happens is that the old, incorrect function gets called, reinterprets the pthread_cond_t object as a struct containing a pointer to a pthread_cond_t object, allocates a new pthread_cond_t for that pointer, and then passes the newly allocated one to the new, correct function. The frame of the old function gets elided by the tail-call, and to a GDB backtrace after leaving the old function it looks like the correct function gets called directly from my wrapper, with a mysteriously changed argument.
The fix for this was simple: GNU provides the libdl extension dlvsym(), which is like dlsym() but also takes a version string. Looking for pthread_cond_wait with version string "GLIBC_2.3.2" solves the problem. Note that these versions do not usually correspond to the current version (i.e. pthread_create()/exit() have version string "GLIBC_2.2.5"), so they need to be looked up on a per-function basis. The correct string can be determined either by looking at the compat_symbol() or versioned_symbol() macros that are somewhere near the function definition in the glibc source, or by using readelf to see the names of the symbols in the compiled library (mine has "pthread_cond_wait##GLIBC_2.3.2" and "pthread_cond_wait##GLIBC_2.2.5").

mutex lock is not unlocking

I use a mutex to lock and unlock a variable as I call getter from main thread continuously in the update cycle and I call setter from another thread. I provided the code for setter and getter below
Definition
bool _flag;
System::Mutex m_flag;
Calls
#define LOCK(MUTEX_VAR) MUTEX_VAR.Lock();
#define UNLOCK(MUTEX_VAR) MUTEX_VAR.Unlock();
void LoadingScreen::SetFlag(bool value)
{
LOCK(m_flag);
_flag = value;
UNLOCK(m_flag);
}
bool LoadingScreen::GetFlag()
{
LOCK(m_flag);
bool value = _flag;
UNLOCK(m_flag);
return value;
}
This works well half the time, but at times the variable gets locked on calling SetFlag and hence it is never set thereby disturbing the flow of code.
Can anyone tell me how to solve this issue?
EDIT:
This is the workaround i finally did. This is just a temporary solution. If anyone has a better answer please let me know.
bool _flag;
bool accessingFlag = false;
void LoadingScreen::SetFlag(bool value)
{
if(!accessingFlag)
{
_flag = value;
}
}
bool LoadingScreen::GetFlag()
{
accessingFlag = true;
bool value = _flag;
accessingFlag = false;
return value;
}

The issue you have (which user1192878 alludes to) is due to delayed compiler load/stores. You need to use memory barriers to implement the code. You may declare the volatile bool _flag;. But this is not needed with compiler memory barriers for a single CPU system. Hardware barriers (just below in the Wikipedia link) are needed for multi-cpu solutions; the hardware barrier's ensure the local processor's memory/cache is seen by all CPUs. The use of mutex and other interlocks is not needed in this case. What exactly do they accomplish? They just create deadlocks and are not needed.
bool _flag;
#define memory_barrier __asm__ __volatile__ ("" ::: "memory") /* GCC */
void LoadingScreen::SetFlag(bool value)
{
_flag = value;
memory_barrier(); /* Ensure write happens immediately, even for in-lines */
}
bool LoadingScreen::GetFlag()
{
bool value = _flag;
memory_barrier(); /* Ensure read happens immediately, even for in-lines */
return value;
}
Mutexes are only needed when multiple values are being set at the same time. You may also change the bool type to sig_atomic_t or LLVM atomics. However, this is rather pedantic as bool will work on most every practical CPU architecture. Cocoa's concurrency pages also have some information on alternative API's to do the same thing. I believe gcc's in-line assembler is the same syntax as used with Apple's compilers; but that could be wrong.
There are some limitations to the API. The instance GetFlag() returns, something can call SetFlag(). GetFlag() return value is then stale. If you have multiple writers, then you can easily miss one SetFlag(). This maybe important if the higher level logic is prone to ABA problems. However, all of these issue exist with/without mutexs. The memory barrier only solves the issue that a compiler/CPU will not cache the SetFlag() for a prolonged time and it will re-read the value in GetFlag(). Declaring volatile bool flag will generally result in the same behavior, but with extra side-effects and does not solve multi-CPU issues.
std::atomic<bool>As per stefan and atomic_set(&accessing_flag, true); will generally do the same thing as describe above in their implementations. You may wish to use them if they are available on your platforms.

First of all you should use RAII for mutex lock/unlock. Second you either do not show some other code that uses _flag directly, or there is something wrong with mutex you are using (unlikely). What library provides System::Mutex?

The code looks right if System::Mutex is correctly implemented.
Something to be mentioned:
As others pointed out, RAII is better than macro.
It might be better to define accessingFlag and _flag as volatile.
I think the temp solution you got is not correct if you compile with optimization.bool LoadingScreen::GetFlag()
{
accessingFlag = true; // might be reordered or deleted
bool value = _flag; // might be optimized away
accessingFlag = false; // might be reordered before value set
return value; // might be optimized to directly returen _flag or register
}
In above code, optimizer could do nasty things. For example, there is nothing to prevent the compiler eliminate the first assignment to accessingFlag=true, or it could be reordered, cached. For example, for compiler point of view, if single-threaded, the first assignment to accessingFlag is useless because the value true is never used.
Use mutex to protect a single bool variable is expensive since most of time spent on switching OS mode (from kernel to user back and forth). It might not be bad to use a spinlock (detail code depend on your target platform). It should be something like:spinlock_lock(&lock); _flag = value; spinlock_unlock(&lock);
Also atomic variable is good here as well. It might look like:
atomic_set(&accessing_flag, true);

Have you considered using CRITICAL_SECTION? This is only available on Windows, so you lose some portability, but it is an effective user level mutex.

The second block of code that you provided may modify the flag while it is being read, even in uni processor settings.
The original code that you posted is correct, and cannot lead to deadlocks under two assumptions:
The m_flags lock is correctly initialized, and not modified by any other code.
The lock implementation is correct.
If you want a portable lock implementation, I would suggest using OpenMP:
How to use lock in openMP?
From your description it seems like you want to busy wait for a thread to process some input. In this case, stefans solution (declare the flag std::atomic) is probably best. On semi-sane x86 systems, you could also declare the flag volatile int. Just don't do this for unaligned fields (packed structures).
You can avoid busy waiting with two locks. The first lock is unlocked by the slave when it finishes processing and locked by the main thread when waiting for the slave to finish. The second lock is unlocked by the main thread when providing input, and locked by the slave when waiting for input.

Here's a technique I've seen somewhere, but couldn't find the source anymore. If I find it, I will edit the answer. Basically, the writer will just write, but the reader will read the value of the set variable more than once, and only when all copies are consistent, it would use it. And I've changed the writer so that it will try to keep writing the value as long as it does not match the value it expects.
bool _flag;
void LoadingScreen::SetFlag(bool value)
{
do
{
_flag = value;
} while (_flag != value);
}
bool LoadingScreen::GetFlag()
{
bool value;
do
{
value = _flag;
} while (value != _flag);
return value;
}

optimizing branching by re-ordering

I have this sort of C function -- that is being called a zillion times:
void foo ()
{
if (/*condition*/)
{
}
else if(/*another_condition*/)
{
}
else if (/*another_condition_2*/)
{
}
/*And so on, I have 4 of them, but we can generalize it*/
else
{
}
}
I have a good test-case that calls this function, causing certain if-branches to be called more than the others.
My goal is to figure the best way to arrange the if statements to minimize the branching.
The only way I can think of is to do write to a file for every if condition branched to, thereby creating a histogram. This seems to be a tedious way. Is there a better way, better tools?
I am building it on AS3 Linux, using gcc 3.4; using oprofile (opcontrol) for profiling.

It's not portable, but many versions of GCC support a function called __builtin_expect() that can be used to tell the compiler what we expect a value to be:
if(__builtin_expect(condition, 0)) {
// We expect condition to be false (0), so we're less likely to get here
} else {
// We expect to get here more often, so GCC produces better code
}
The Linux kernel uses these as macros to make them more intuitive, cleaner, and more portable (i.e. redefine the macros on non-GCC systems):
#ifdef __GNUC__
# define likely(x) __builtin_expect((x), 1)
# define unlikely(x) __builtin_expect((x), 0)
#else
# define likely(x) (x)
# define unlikely(x) (x)
#endif
With this, we can rewrite the above:
if(unlikely(condition)) {
// we're less likely to get here
} else {
// we expect to get here more often
}
Of course, this is probably unnecessary unless you're aiming for raw speed and/or you've profiled and found that this is a problem.

Try a profiler (gprof?) - it will tell you how much time is spent. I don't recall if gprof counts branches, but if not, just call a separate empty method in each branch.

Running your program under Callgrind will give you branch information. Also I hope you profiled and actually determined this piece of code is problematic, as this seems like a microoptimization at best. The compiler is going to generate a branch table from the if/else if/else if it's able to which would require no branching (this is dependent on what the conditionals are, obviously)0, and even failing that the branch predictor on your processor (assuming this is not for embedded work, if it is feel free to ignore me) is pretty good at determining the target of branches.

It doesn't actually matter what order you change them round to, IMO. The branch predictor will store the most common branch and auto take it anyway.
That said, there are something you could try ... You could maintain a set of job queues and then, based on the if statements, assign them to the correct job queue before executing them one after another at the end.
This could further be optimised by using conditional moves and so forth (This does require assembler though, AFAIK). This could be done by conditionally moving a 1 into a register, that is initialised as 0, on condition a. Place the pointer valueat the end of the queue and then decide to increment the queue counter or not by adding that conditional 1 or 0 to the counter.
Suddenly you have eliminated all branches and it becomes immaterial how many branch mispredictions there are. Of course, as with any of these things, you are best off profiling because, though it seems like it would provide a win ... it may not.

We use a mechanism like this:
// pseudocode
class ProfileNode
{
public:
inline ProfileNode( const char * name ) : m_name(name)
{ }
inline ~ProfileNode()
{
s_ProfileDict.Find(name).Value() += 1; // as if Value returns a nonconst ref
}
static DictionaryOfNodesByName_t s_ProfileDict;
const char * m_name;
}
And then in your code
void foo ()
{
if (/*condition*/)
{
ProfileNode("Condition A");
// ...
}
else if(/*another_condition*/)
{
ProfileNode("Condition B");
// ...
} // etc..
else
{
ProfileNode("Condition C");
// ...
}
}
void dumpinfo()
{
ProfileNode::s_ProfileDict.PrintEverything();
}
And you can see how it's easy to put a stopwatch timer in those nodes too and see which branches are consuming the most time.

Some counter may help. After You see the counters, and there are large differences, You can sort the conditions in a decreasing order.
static int cond_1, cond_2, cond_3, ...
void foo (){
if (condition){
cond_1 ++;
...
}
else if(/*another_condition*/){
cond_2 ++;
...
}
else if (/*another_condtion*/){
cond_3 ++;
...
}
else{
cond_N ++;
...
}
}
EDIT: a "destructor" can print the counters at the end of a test run:
void cond_print(void) __attribute__((destructor));
void cond_print(void){
printf( "cond_1: %6i\n", cond_1 );
printf( "cond_2: %6i\n", cond_2 );
printf( "cond_3: %6i\n", cond_3 );
printf( "cond_4: %6i\n", cond_4 );
}
I think it is enough to modify only the file that contains the foo() function.

Wrap the code in each branch into a function and use a profiler to see how many times each function is called.

Line-by-line profiling gives you an idea which branches are called more often.
Using something like LLVM could make this optimization automatically.

As a profiling technique, this is what I rely on.
What you want to know is: Is the time spent in evaluating those conditions a significant fraction of execution time?
The samples will tell you that, and if not, it just doesn't matter.
If it does matter, for example if the conditions include function calls that are on the stack a significant part of the time, what you want to avoid is spending much time in comparisons that are false. The way you tell this is, if you often see it calling a comparison function from, say, the first or second if statement, then catch it in such a sample and step out of it to see if it returns false or true. If it typically returns false, it should probably go farther down the list.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Emitting a pseudo-instruction during relaxation - llvm

Inside your target's CodeEmitter::encodeInstruction it usually ends up with a call to getBinaryCodeForInstr. Before this call can you expand your pseudo instruction in to the two real ones?

Related

Optimal place to call __syncthreads()

Clean-up a timed-out future

Pointer passed to function changes unexpectedly

mutex lock is not unlocking

optimizing branching by re-ordering

Categories

Resources