Initialization before we start a multithreading code - c++

int main(){
// X is a shared resource
initSharedResourceX();
startMultitreadingServer(); // handle requests concurrently with function handle() <- below. All handlers (run concurrently) access to X **read-only**.
}
int handle(){
return X.get(); // get() is read-only
}
I would like to avoid synchronization access to X by initialization a shared resource before we start. Do I need a compiler barrier? I can imagine that a compiler do someting like:
int main(){
startMultitreadingServer();
}
int handle(){
if(X is not initialized) {
initSharedResourceX();
}
return X.get();
}
and as we can see it breaks our program.
I know, the compiler must be super-smart to do it. Especially, the compiler must know what does it mean to initialize X. So, it must really super-super-smart.
But, can we assume that it is not?
What do you think?

If a compiler doesn't see the code of startMultitreadingServer function, then it is prohibited (by a language specification) to move any code around an invocation of the function.
If a compiler see the code of startMultitreadingServer function, then it should found a memory barrier (or any operation which cause this effect) inside the function. (Any thread-starting function should have a memory barrier inside; this should be stated in its contract/description). And again, the compiler cannot move (at least, forward) any code around this barrier.
So, in any case a compiler cannot move the code, preceding the thread-creation function call, after that call.

Related

C++ compiler reordering read and write across thread start

I have a piece of multithreaded that I'm not sure is not liable to a data race because of compiler reordering.
Here is a minimal example:
int main()
{
int x = 0;
x = 5;
auto t = std::thread([&x]()
{
++x;
});
t.join();
return 0;
}
Is the assignment of x = 5 guaranteed to be before the thread start?
Short answer: The code will work as expected. No reordering will take place
Long answer:
Compile time reordering
Let's consider what's going on.
You put a variable in automatic storage (x)
You create an object that holds a reference to this variable (the lambda)
You pass that object to an external function (the thread constructor)
The compiler does escape analysis during optimization. Due to this sequence of events, the variable x has escaped once point 3 is reached. Which means from the compiler's point of view, any external function (except those marked as pure) may read or modify the variable. Therefore its value has to be stored to the stack before each function call and has to be loaded from stack after the function.
You did not make x an atomic variable. So the compiler is free to ignore any potential multithreading effects. Therefore the value may not be reloaded multiple times from memory in between calls to external functions. It may still be reloaded if the compiler decides to not keep the value in a register in between uses.
Let's annotate and expand your source code to show it:
int main()
{
int x = 0;
x = 5; // stores on stack for use by external function in next line
auto t = std::thread([&x]() mutable
{
++x;
});
int x1 = x; // loads x from stack after thread constructor may (in theory) have modified it
int x2 = x; // probably no reload because not an atomic variable
x = 7; // new value stored on stack because join() could access it (in theory)
t.join();
int x3 = x; // reload from stack because join() could have changed it
return 0;
}
Again, this has nothing to do with multithreading. Escape analysis and external function calls are sufficient.
Any access from main() between thread creation and joining would also be undefined behavior because it would be a data-race on a non-atomic variable. But that's just a side-note.
This takes care of the compiler behavior. But what about the CPU? May it reorder instructions?
Run time reordering
For this, we can look at the C++ standard Section 32.4.2.2 [thread.thread.constr] clause 7:
Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.
The constructor means the thread constructor. f is the thread function, meaning the lambda in your case. So this means that any memory effects are synchronized properly.
The join() call also synchronizes. Therefore access to x after the join can not suffer from runtime-reordering.
The completion of the thread represented by *this synchronizes with (6.9.2) the corresponding successful join() return.
Side note
Unlike suggested in some comments, the compiler will not optimize the thread creation away for two reasons: 1. No compiler is sufficiently magical to figure this out. 2. The thread creation may fail, which is defined behavior. Therefore it has to be included in the runtime.

Can unused function-arguments become optimized away?

I'm trying to ensure an object - wrapped by a shared_ptr - is alive as long as a function is executed by passing it as value. However inside the function the object is not used at all, so I just want to use it for 'pinning':
void doSomething(std::shared_ptr<Foo>) {
// Perform some operations unrelated to the passed shared_ptr.
}
int main() {
auto myFoo{std::make_shared<Foo>()};
doSomething(std::move(myFoo)); // Is 'myFoo' kept alive until doSomething returns?
return 0;
}
I did check the behavior on different optimization-levels (GCC) and it seems that it works as intended, however I don't know whether the compiler still may optimize it away in certain scenarios.
You don't need to worry - the lifetime of the function argument at the call site is guaranteed to survive the function call. (This is why things like foo(s.c_str()) for a std::string s work.)
A compiler is not allowed to break that rule, subject to as if rule flexibility.
This very much depends on what the body of doSomething and Foo will actually look like. For instance, consider the following example:
struct X
{
~X() { std::cout << "2"; };
};
void f(std::shared_ptr<X>) { std::cout << "1"; }
int main()
{
auto p = std::make_shared<X>();
f(std::move(p));
}
This program has the very same observable effect as:
int main()
{
std::cout << "12";
}
and the order "12" is guaranteed. So, in the generated assembly, there may be no shared pointer used. However, most compilers will likely not perform such aggressive optimizations since there are dynamic memory allocations and virtual function calls involved internally, which is not that easy to optimize away.
The compiler could optimise away the copying of an object into a function argument if the function is being inlined and if the copying has no side effects.
Copying a shared_ptr increments its reference count so it does have side effects so the compiler can't optimise it away (unless the compiler can prove to itself that not modifying the reference count has no effect on the program).

Declaring a variable non-aliased in clang?

Is there a way to declare a variable is non-aliased in clang to allow for more optimizations where the variable is used?
I understand restrict can be used to declare pointers as non-aliasing.
However, I'm also wondering about variables which can be pointer into. I guess (perhaps wrongfully) that the compiler has to be careful about assuming things which can allow it to cache a variable's value instead of re-fetching it each time.
Example:
class Data
{
public:
void updateVal() {
// Updates m_val with some value each time it's called (value may differ across different calls)
...
}
int complicatedCalculation() const {
return 3 * m_val + 2;
}
int m_val;
};
class User
{
User(Data& data) : m_data{data} {}
void f()
{
m_data.updateVal();
for (int i=0; i<1000; ++i)
g();
}
void g()
{
// Will the optimizer be able to cache calc's value for use in all the calls to g() from f()?
int calc = m_data.complicatedCalculation();
// Do more work
...
}
Data& m_data;
};
Even if the answer to the question in the sample code is "yes", might it not change to "no" if the code were more complicated (e.g. work being under // Do more work), due to a possibility of a pointer contents being modified where the pointer might have pointed into m_data.m_val? Or is this something the compiler assumes never happens, unless it sees the address of m_val being taken somewhere in the code?
If it doesn't assume that, or even it does but the address of m_val does get taken somewhere (but we know its contents won't be modified), then it would be nice to be able to mark m_val as "safe" from aliasing concerns, so its value can be assumed to not be changed by pointer access.
The compiler will allocate a register to store calcin g unless it determines that there are other hotter variables that would be better to be stored in registers.
Now even if calc is stored in a register, this may still require a function call to complicatedCalculation and a memory access to m_val. The compiler may inline complicatedCalculation and eliminate the function call but it cannot eliminate the memory access unless it can determine that m_val is effectively a constant all the time.
What you really want is to eliminate the unnecessary memory accesses to m_val in the loop in f rather than in g. For this to happen, the compiler has to deem that g is eligible for inlining in f. Only when it's inlined can the compiler eliminate the unnecessary memory accesses. Even if g directly modifies m_val, the compiler can still allocate calc in a register and modifies it accordingly. The only caveat here is when g may throw an exception. If an exception is ever thrown, the in-memory version of m_val has to be updated to the latest value before the exception is allowed to propagate. The compiler has to emit code to ensure this. Without this code, it has to update the in-memory version of m_val in every iteration. I don't know which version of clang uses which approach. You have to examine the generated assembly code.
If the address of m_value is taken anywhere in the code, the compiler may not be able to eliminate any memory accesses to it. In this case, using restrict may help. m_value should not be modified through any other pointer because this violates the standard and results in undefined behavior. It's your responsibility to ensure that.
I hope that you care about this either because you have experimentally determined that this is a performance bottleneck in the code or you are just curious, rather than for any other reason.

How to effect a return of a value from the _calling_ function?

I would like to be able to force a 'double-return', i.e. to have a function which forces a return from its calling function (yes, I know there isn't always a real calling function etc.) Obviously I expect to be able to do this by manipulating the stack, and I assume it's possible at least in some non-portable machine-language way. The question is whether this can be done relatively cleanly and portably.
To give a concrete piece of code to fill in, I want to write the function
void foo(int x) {
/* magic */
}
so that the following function
int bar(int x) {
foo(x);
/* long computation here */
return 0;
}
returns, say, 1; and the long computation is not performed. Assume that foo() can assume it is only ever called by a function with bar's signature, i.e. an int(int) (and thus specifically knows what its caller return type is).
Notes:
Please do not lecture me about how this is bad practice, I'm asking out of curiosity.
The calling function (in the example, bar()) must not be modified. It will not be aware of what the called function is up to. (Again in the example, only the /* magic */ bit can be modified).
If it helps, you may assume no inlining is taking place (an unrealistic assumption perhaps).
The question is whether this can be done relatively cleanly and portably.
The answer is that it cannot.
Aside from all the non-portable details of how the call stack is implemented on different systems, suppose foo gets inlined into bar. Then (generally) it won't have its own stack frame. You can't cleanly or portably talk about reverse-engineering a "double" or "n-times" return because the actual call stack doesn't necessarily look like what you'd expect based on the calls made by the C or C++ abstract machine.
The information you need to hack this is probably (no guarantees) available with debug info. If a debugger is going to present the "logical" call stack to its user, including inlined calls, then there must be sufficient information available to locate the "two levels up" caller. Then you need to imitate the platform-specific function exit code to avoid breaking anything. That requires restoring anything that the intermediate function would normally restore, which might not be easy to figure out even with debug info, because the code to do it is in bar somewhere. But I suspect that since the debugger can show the state of that calling function, then at least in principle the debug info probably contains enough information to restore it. Then get back to that original caller's location (which might be achieved with an explicit jump, or by manipulating wherever it is your platform keeps its return address and doing a normal return). All of this is very dirty and very non-portable, hence my "no" answer.
I assume you already know that you could portably use exceptions or setjmp / longjmp. Either bar or the caller of bar (or both) would need to co-operate with that, and agree with foo how the "return value" is stored. So I assume that's not what you want. But if modifying the caller of bar is acceptable, you could do something like this. It's not pretty, but it just about works (in C++11, using exceptions). I'll leave it do you to figure out how do do it in C using setjmp / longjmp and with a fixed function signature instead of a template:
template <typename T, typename FUNC, typename ...ARGS>
T callstub(FUNC f, ARGS ...args) {
try {
return f(args...);
}
catch (EarlyReturnException<T> &e) {
return e.value;
}
}
void foo(int x) {
// to return early
throw EarlyReturnException<int>(1);
// to return normally through `bar`
return;
}
// bar is unchanged
int bar(int x) {
foo(x);
/* long computation here */
return 0;
}
// caller of `bar` does this
int a = callstub<int>(bar, 0);
Finally, not a "bad-practice lecture" but a practical warning -- using any trick to return early does not in general go well with code written in C or written in C++ that doesn't expect an exception to leave foo. The reason is that bar might have allocated some resource, or put some structure into a state that violates its invariants before calling foo, with the intention of freeing that resource or restoring the invariant in the code following the call. So for general functions bar, if you skip code in bar then you might cause a memory leak or an invalid data state. The only way to avoid this in general, regardless of what is in bar, is to allow the rest of bar to run. Of course if bar is written in C++ with the expectation that foo might throw, then it will have used RAII for the cleanup code and it will run when you throw. longjmping over adestructor has undefined behavior, though, so you have to decide before you start whether you're dealing with C++ or with C.
There are two portable ways to do this but both require caller function assistance. For C, it's setjmp + longjmp. For C++, it's exceptions usage (try + catch + throw). Both are quite similar in implementation (in essence, some early exceptions implementation was based on setjmp). And, there is definitely no portable way to do this without caller function awareness...
The only clean way is to modify your functions:
bool foo(int x) {
if (skip) return true;
return false;
}
int bar(int x) {
if (foo(x)) return 1;
/* long computation here */
return 0;
}
also can be done with setjmp() / longjmp(), but you'd have to modify your caller as well, and then you can as well do it cleanly.
Modify your void function foo() to return a boolean yes/no, and then wrap it in a macro of the same name:
#define foo(x) do {if (!foo(x)) return 1;} while (0)
The do .. while (0) is, as I'm sure you know, the standard swallow-the-semicolon trick.
Your may also need, in the header file where you declare foo(), to add extra parentheses, as in:
extern bool (foo)(int);
This prevents the macro (if already defined) from being used. Ditto for the foo() implementation.

Can code reordering affect my test

I am writing an unit test for a class to test for insertion when no memory is available. It relies on the fact that nbElementInserted is incremented AFTER insert_edge has returned.
void test()
{
adjacency_list a(true);
MemoryVacuum no_memory_after_this_line;
bool signalReceived = false;
size_t nbElementInserted = 0;
do
{
try
{
a.insert_edge( 0, 1, true ); // this should throw
nbElementInserted++;
}
catch(std::bad_alloc &)
{
signalReceived = true;
}
}
while (!signalReceived); // this loop is necessary because the
// memory vacuum only prevents new memory
// pages from being mapped. so the first
// allocations may succeed.
CHECK_EQUAL( nbElementInserted, a.nb_edges() );
}
Now I am wondering which of the two statement is true:
Reordering can happen, in which case nbElementInserted can be incremented before insert_edge throws an exception, and that invalidates my case. Reordering can happen because the visible result for the user is the same if the two lines are permuted.
Reordering cannot happen because insert_edge is a function and all the side effects of the function should be completed before going to the next line. Throwing is a side effect.
Bonus point: if the correct answer is “yes reordering can happen”, is a memory barrier between the 2 lines sufficient to fix it?
No. Reordering only comes into play in multithreaded or multiprocessing scenarios. In a single thread the compiler cannot reorder instructions in a way that would change the behavior of the program. Exceptions are not an exception to this rule.
Reordering becomes visible when two threads read and write to shared state. If thread A makes modifications to shared variables thread B can see those modifications out-of-order, or even not at all if it has the shared state cached. This can be due to optimizations in either thread A or thread B or both.
Thread A will always see its own modifications in-order, though. Each sequence point must happen in order, at least as far as the local thread is aware.
Let's say thread A executed this code:
a = foo() + bar();
b = baz;
Each ; introduces a sequence point. The compiler is allowed to call either foo() or bar() first, whichever it likes, since + does not introduce a sequence point. If you put printouts you might see foo() called first, or you might see bar() called first. Either one would be correct. It must call them before it assigns baz to b, though. If either foo() or bar() throws an exception b must retain its existing value.
However, if the compiler knew that foo() and bar() never throw, and their execution in no way depends on the value of b, it could reorder the two statements. It'd be a valid optimization. There would be no way for the thread A to know that statements had been reordered.
Thread B, on the other hand, would know. The problem in multithreaded programming is that sequence points don't apply to other threads. That's where memory barriers come in. Memory barriers are cross-thread sequence points, in a sense.