Unit tests fail only in ARM

Unit tests fail only in ARM - c++

I am working on a multithreaded program for a Raspberry Pi, and I have noticed that our current code runs perfectly in my computer and in the computers of other colleges, but it fails when running on ARM.
We are using C++11 for our project, and this is the output in our computer:
............
Success!
Test run complete. 12 tests run. 12 succeeded.
But when we try to run it on ARM, as you can see here: https://travis-ci.org/OpenStratos/server/builds/49297710
It says the following:
....
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
After some debugging, I have understood that the issue comes to this code: https://github.com/OpenStratos/server/blob/feature/temperature/serial/Serial.cpp#L91
this->open = false;
while( ! this->stopped);
And there is another thread doing the opposite:
while(this->open)
{
// Do stuff
}
this->stopped = true;
The first code is called when I need to stop the thread, and the double flag is ussed for the thread to be able to update the current object even if it's stopping. Both variables are of type std::atomic_bool, but it seems that in the while ( ! this->stopped); it does not check for it, ant it supposes a while (true);.
Is this the case? how can it be solved? why does it work differently on my x86_64 processor than on the ARM?
Thanks in advance.

The core guarantee made by std::atomic<T> is that you can always read a value. Consistency is not necessarily guaranteed.
Now, in this case you're relying on .operator bool which is equivalent to .load(memory_order_seq_cst) and operator=(x) which is .store(x, memory_order_seq_cst). This should give you a sequential consistent memory order.
The order you're observing on ARM appears sequentially consistent to me. The fast that you're not yet seeing stopped == true is OK. There's no time limit on that. The compiler cannot swap the memory operation with another memory operation, but it may indefinitely delay it.
The main question is why this thread should be stopped at all. If there was any real, observable work done in the loop body of that thread, that loop body could not be reordered relative to stopped==true check.

Finally the issue was that the environment I was creating in Travis.ci was not working properly. In real ARM hardware works properly.

Related

printing variables in GDB seems to change code behavior?

I'm debugging a large, ancient, piece of code that we just upgraded the OS/driver for, the entire thing is running 32 bit. The original developers of the code are long gone and much of it is still a black box to me.
I'm running it on the debugger. I narrowed down on a particular if statement within a larger loop, I need the 'else' part of the for loop to run to update some variables, but it was never running; implying that the variable that is being checked in the 'if' statement is always true.
Eventually I stepped into the method call (a simple getter on a private boolean) and printed the content of the variable. When I print the variable it is false, and the 'else' method will be entered when I return.
To experiment I've tried allowing the loop to run for 10 minutes, the 'else' method is never entered (as indicated by a breakpoint not being hit). Then when I print the variable being checked it's false and the variable is entered. It doesn't matter how long I let it run, or how many times I break and continue before printing the variable, the same pattern holds, I enter the 'else' method IFF I print the content of the variable that is being checked first.
To rule out some sort of datarace I've tried sitting at the breakpoint in question for the length of time it takes to do a print statement, a delay without a print doesn't result in entering the 'else' method.
What could cause such an odd behavior? Since we had issues with different architectures, running a 32 bit program on a 64 bit OS and, more importantly, the driver that it uses was not tested for 32 bit for years until they recompiled the driver for me under a 32 bit architecture, which would make me suspect the driver except that particular line of code that is misbehaving isn't touching the driver in any way. Still I suspect some sort of overflow or underflow may be happening due to a confusion caused by trying to force an old 32 bit program to run.
However, even assuming this could cause such an odd behavior, I don't know how to confirm if that is happening or otherwise debug a program where the act of looking at it changes it's behavior. I'd love any tip on what could cause such a problem or how I could move forward with debugging it.
Dammit Jim I'm a programmer, not a Quantum Mechanic!

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:
{
foobar = false;
}
and the other:
{
if (foobar) {
// ...
}
}
the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):
No mfence or something that "dramatic". On the write side, something more "dramatic" happens:
As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.
I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?
I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.

The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.
A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.
Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.
bool operation_done;
void thread1() {
while (!operation_done) {
sleep();
}
// do something that depends on operation being done
}
void thread2() {
// do the operation
operation_done = true;
}
This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.
The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.
Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.

This is my own version of #Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to #HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.
I was able to have a trivial program that reproduces the problem:
std::atomic_bool foobar(true);
//bool foobar = true;
long long cnt = 0;
long long loops = 400000000ll;
void thread_1() {
usleep(200000);
foobar = false;
}
void thread_2() {
while (loops--) {
if (foobar) {
++cnt;
}
}
std::cout << cnt << std::endl;
}
The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:
but quite different for read:
We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:
400000000
real 0m1.044s
user 0m1.032s
sys 0m0.005s
while when I run the atomic_bool case I get:
95393578
real 0m0.420s
user 0m0.414s
sys 0m0.003s
It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.
What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:
i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:
400000000
real 0m0.206s
user 0m0.001s
sys 0m0.002s
while the atomic_bool remains the same.
So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.

In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations. It is more useful to implement things like mutexes, semaphores and so on.
One reason why it's not very useful: As soon as you have two values that both need to be changed in an atomic way, you are absolutely stuck. You can't do it with atomic values. And it's quite rare that I want to change a single value in an atomic way.
In iOS and MacOS X, the three methods to use are: Protecting the change using #synchronized. Avoiding multi-threaded access by running code on a sequential queue (may be the main queue). Using mutexes.
I hope you are aware that atomicity for boolean values is rather pointless. What you have is a race condition: One thread stores a value, another reads it. Atomicity doesn't make a difference here. It makes (or might make) a difference if two threads accessing a variable at exactly the same time causes problems. For example, if a variable is incremented on two threads at exactly the same time, is it guaranteed that the final result is increased by two? That requires atomicity (or one of the methods mentioned earlier).
Sebastian makes the ridiculous claim that atomicity fixes the data race: That's nonsense. In a data race, a reader will read a value either before or after it is changed, whether that value is atomic or not doesn't make any difference whatsoever. The reader will read the old value or the new value, so the behaviour is unpredictable. All that atomicity does is prevent the situation that the reader would read some in-between state. Which doesn't fix the data race.

Design elements for inline asm in concurrent usage

I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.

You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.

Simulating CPU Load In C++

I am currently writing an application in Windows using C++ and I would like to simulate CPU load.
I have the following code:
void task1(void *param) {
unsigned elapsed =0;
unsigned t0;
while(1){
if ((t0=clock())>=50+elapsed){//if time elapsed is 50ms
elapsed=t0;
Sleep(50);
}
}
}
int main(){
int ThreadNr;
for(int i=0; i < 4;i++){//for each core (i.e. 4 cores)
_beginthread( task1, 0, &ThreadNr );//create a new thread and run the "task1" function
}
while(1){}
}
I wrote this code using the same methodology as in the answers given in this thread: Simulate steady CPU load and spikes
My questions are:
Have I translated the C# code from the other post correctly over to C++?
Will this code generate an average CPU load of 50% on a quad-core processor?
How can I, within reasonable accuracy, find out the load percentage of the CPU? (is task manager my only option?)
EDIT: The reason I ask this question is that I want to eventually be able to generate CPU loads of 10,20,30,...,90% within a reasonable tolerance. This code seems to work well for to generate loads 70%< but seems to be very inaccurate at any load below 70% (as measured by the task manager CPU load readings).
Would anyone have any ideas as to how I could generate said loads but still be able to use my program on different computers (i.e. with different CPUs)?

At first sight, this looks like not-pretty-but-correct C++ or C (an easy way to be sure is to compile it). Includes are missing (<windows.h>, <process.h>, and <time.h>) but otherwise it compiles fine.
Note that clock and Sleep are not terribly accurate, and Sleep is not terribly reliable either. On the average, the thread function should kind of work as intended, though (give or take a few percent of variation).
However, regarding question 2) you should replace the last while(1){} with something that blocks rather than spins (e.g. WaitForSingleObject or Sleep if you will). otherwise the entire program will not have 50% load on a quadcore. You will have 100% load on one core due to the main thread, plus the 4x 50% from your four workers. This will obviously sum up to more than 50% per core (and will cause threads to bounce from one core to the other, resulting in nasty side effects).
Using Task Manager or a similar utility to verify whether you get the load you want is a good option (and since it's the easiest solution, it's also the best one).
Also do note that simulating load in such a way will probably kind of work, but is not 100% reliable.
There might be effects (memory, execution units) that are hard to predict. Assume for example that you're using 100% of the CPU's integer execution units with this loop (reasonable assumption) but zero of it's floating point or SSE units. Modern CPUs may share resources between real or logical cores, and you might not be able to predict exactly what effects you get. Or, another thread may be memory bound or having significant page faults, so taking away CPU time won't affect it nearly as much as you think (might in fact give it enough time to make prefetching work better). Or, it might block on AGP transfers. Or, something else you can't tell.
EDIT:
Improved version, shorter code that fixes a few issues and also works as intended:
Uses clock_t for the value returned by clock (which is technically "more correct" than using a not specially typedef'd integer. Incidentially, that's probably the very reason why the original code does not work as intended, since clock_t is a signed integer under Win32. The condition in if() always evaluates true, so the workers sleep almost all the time, consuming no CPU.
Less code, less complicated math when spinning. Computes a wakeup time 50 ticks in the future and spins until that time is reached.
Uses getchar to block the program at the end. This does not burn CPU time, and it allows you to end the program by pressing Enter. Threads are not properly ended as one would normally do, but in this simple case it's probably OK to just let the OS terminate them as the process exits.
Like the original code, this assumes that clock and Sleep use the same ticks. That is admittedly a bold assumption, but it holds true under Win32 which you used in the original code (both "ticks" are milliseconds). C++ doesn't have anything like Sleep (without boost::thread, or C++11 std::thread), so if non-Windows portability is intended, you'd have to rethink anyway.
Like the original code, it relies on functions (clock and Sleep) which are unprecise and unreliable. Sleep(50) equals Sleep(63) on my system without using timeBeginPeriod. Nevertheless, the program works "almost perfectly", resulting in a 50% +/- 0.5% load on my machine.
Like the original code, this does not take thread priorities into account. A process that has a higher than normal priority class will be entirely unimpressed by this throttling code, because that is how the Windows scheduler works.
#include <windows.h>
#include <process.h>
#include <time.h>
#include <stdio.h>
void task1(void *)
{
while(1)
{
clock_t wakeup = clock() + 50;
while(clock() < wakeup) {}
Sleep(50);
}
}
int main(int, char**)
{
int ThreadNr;
for(int i=0; i < 4; i++) _beginthread( task1, 0, &ThreadNr );
(void) getchar();
return 0;
}

Here is an a code sample which loaded my CPU to 100% on Windows.
#include "windows.h"
DWORD WINAPI thread_function(void* data)
{
float number = 1.5;
while(true)
{
number*=number;
}
return 0;
}
void main()
{
while (true)
{
CreateThread(NULL, 0, &thread_function, NULL, 0, NULL);
}
}
When you build the app and run it, push Ctrl-C to kill the app.

You can use the Windows perf counter API to get the CPU load. Either for the entire system or for your process.

Strange behavior of go routine

I just tried the following code, but the result seems a little strange. It prints odd numbers first, and then even numbers. I'm really confused about it. I had hoped it outputs odd number and even number one after another, just like 1, 2, 3, 4... . Who can help me?
package main
import (
"fmt"
"time"
)
func main() {
go sheep(1)
go sheep(2)
time.Sleep(100000)
}
func sheep(i int) {
for ; ; i += 2 {
fmt.Println(i,"sheeps")
}
}

More than likely you are only running with one cpu thread. so it runs the first goroutine and then the second. If you tell go it can run on multiple threads then both will be able to run simultaneously provided the os has spare time on a cpu to do so. You can demonstrate this by setting GOMAXPROCS=2 before running your binary. Or you could try adding a runtime.Gosched() call in your sheep function and see if that triggers the runtime to allow the other goroutine to run.
In general though it's better not to assume ordering semantics between operations in two goroutines unless you specify specific synchronization points using a sync.Mutex or communicating between them on channels.

Unsynchronized goroutines execute in a completely undefined order. If you want to print out something like
1 sheeps
2 sheeps
3 sheeps
....
in that exact order, then goroutines are the wrong way to do it. Concurrency works well when you don't care so much about the order in which things occur.
You could impose an order in your program through synchronization (locking a mutex around the fmt.Println calls or using a channel), but it's pointless since you could more easily just write code that uses a single goroutine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js