C++ multiple threads using std::locale to separate by thousands- weird results - c++

I want to thousand-separate numbers (e.g. 6,000,000). After doing some research I wrote the below method:
template<class TYPE>
std::string thousands(const TYPE num)
{
std::ostringstream ss;
ss.imbue(std::locale(""));
ss << value;
return ss.str();
}
To be used like:
std::cout << thousands<uint64_t>(value) << std::endl;
Initially this seemed to work in a single threaded environment. However, I have started using multiple threads and even though I am passing in a stack variable (not shared across threads) the value of num was getting incremented by the formatting. When I removed the formatting the incrementing stopped.
Is there some oddity with using the above technique to format strings from multiple threads?
After some more testing, does imbue cache anything? The problem seems to trigger when both threads call it at the same time.
UPDATE
Whilst testing I was able to safely conclude the above code is definitely causing problems and some sort of caching/static is occurring. When I replace the above with return std::to_string(num) the problem never occurs. I revert back to the above code, problem occurs every time. I use one thread- no problem. Two threads calling imbue(), problem reappears.
Using Clang compiler, CentOS 7 OS.

Related

Multi threaded program on multi-core platform not printing output (std::cout) if I run a tight poll loop

As the OP states, I have multiple threads - two of them being tight poll loops (I need polling) with regular sleeps:- 1 seconds of sleep after every 10 seconds.
Program has multiple interim updates to be printed with:
std::cout << "progress report text" << std::endl;
Body of thread that polls, pretty much looks like:
void PollHardwareFunction ()
{
lastTimeSlept = std::chrono::HighResClock::now();
while (!stopSignal)
{
poll_hardware();
// Process the data received from hardware
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
}
}
Other threads are pretty normal that do few logical steps and prints status after each step.
void LongRunningFunction ()
{
int dataCounter = 0;
while (wait_for_data_from_hardware_in_concurrent_queue)
{
std::cout << "Data received: " << dataCounter++ << std::endl;
// Process the data received from hardware
std::cout << "STEP1 done." << std::endl;
std::cout << "STEP2 done." << std::endl;
std::cout << "STEP3 done." << std::endl;
}
}
This prints all messages as expected but only in bulk after 10 seconds. Making it look non responsive/stuck during this 10 seconds.
Program is run on following environment:
Compiled with GCC 6.2, run on RHEL 7, an 8 core CPU.
I notice that the program prints on the console only when the spinning threads go to sleep/idle. Once the busy threads go to sleep, all of the prints appear on my output console together. To add to it, data received from hardware is regular - say every 100 milliseconds.
With several CPU cores available free, why the program stays in non-responsive state till the spinning threads stop/pause?
From your comments:
My program is bit better structured - it uses atomic variables and some of lockfree data structures I have implemented.
and
poll_hardware is a function from the hardware vendor's API that reads hardware buffer and pushes data into a concurrent queue.
That sounds dubious. Did you write your own data structure or did you use an existing one? Regardless, please post the code for the queue. If the queue was provided by the vendor, please post the API.
my perspective here is to understand what can cause the programs output remain (feels) stuck where as std::cout << operator is executed (completed execution) with std::endl?
You don't call cout from PollHardwareFunction(), so the issue MUST be from wait_for_data_from_hardware_in_concurrent_queue blocking when it's not supposed to. (If you want to be sure, switch cout to cerr to avoid buffering writes.)
The first thing I would check is if poll_hardware() is dominating a lock by re-locking as soon as it releases. You may have created what is effectively a spin-lock. This is why user Snps suggested sleeping for 1ms in the comments. 1 yield is not enough. I understand that your data is time critical, but you said 100ms, so theoretically you could poll ever 50ms and be fine. A few ms should be totally OK for debugging purposes.
Lock dominating can be both caused by and solved with a reader/writer lock. Reader/writer locks need to be custom designed with the characteristics of the situation in mind. (how many threads are reading vs writing? how often do reads vs writes occur?)
The second thing I would check are your assumptions about sequential programming and memory caching in your lock-free data structures. Loads and stores can be delayed, rearranged, buffered, etc. as an optimization. Everyone is your "frienemy"--the compiler will do this, then the OS will do it, the CPU will take its turn, and then hardware will do it too.
To prevent this, you have to use a memory barrier (aka memory fence) to keep any of your frienemies from optimizing memory accesses. FYI, mutexes use memory barriers in their implementation. A quick way to see if this fixes your problem is to make your shared variables volatile. HOWEVER, don't trust volatile. It only keeps the compiler from reordering your commands, not necessarily the OS or CPU (depending on compiler, naturally).
It would be good to know about some of your other atomic variables, because there could be a logic bug there.
lastly, here your use of auto is defining a scoped variable lastTimeSlept that shadows the "actual" lastTimeSlept.
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
yikes! I don't think that's causing your issue, though.

Dealing with heavy profiling of execution times in C++

I'm currently working on a scientific computing project involving huge data and complex algorithms, so I need to do a lot of code profiling. I'm currently relying on <ctime> and clock_t to time the execution of my code. I'm perfectly happy with this solution... except that I'm basically timing everything and thus for every line of real code I have to call start_time_function123 = clock(), end_time_function123 = clock() and cout << "function123 execution time: " << (end_time_function123-start_time_function123) / CLOCKS_PER_SEC << endl. This leads to heavy code bloating and quickly makes my code unreadable. How would you deal with that?
The only solution I can think of would be to find an IDE allowing me to mark portions of my code (at different locations, even in different files) and to toggle hide/show all marked code with one button. This would allow me to hide the part of my code related to profiling most of the time and display it only whenever I want to.
Have a RAII type that marks code as timed.
struct timed {
char const* name;
clock_t start;
timed( char const* name_to_record):
name(name_to_record),
start(clock())
{}
~timed(){
auto end=clock();
std::cout << name << " execution time: " << (end-start) / CLOCKS_PER_SEC << std::endl;
}
};
The use it:
void foo(){
timed timer(__func__);
// code
}
Far less noise.
You can augment with non-scope based finish operations. When doing heavy profiling sometimes I like to include unique ids. Using cout esoecially with endl could result in it dominating timing; fast recording to a large buffer that is dumped out in an async manner may be optimal. If you need to time ms level time, even allocation, locks and string manipulation should be avoided.
You don't say so explicitly, but I assume you are looking for possible speedups - ways to reduce the time it takes.
You think you need to do this by measuring how much time different parts of it take. If you're interested, there's an orthogonal way to approach it.
Just get it running under a debugger (using a non-optimized debug build).
Manually interrupt it at random, by Ctrl-C, Ctrl-Break, or the IDE's "pause" button.
Display the call stack and carefully examine what the program is doing, at all levels.
Do this with a suspicion that whatever it's doing could be something wasteful that you could find a better way to do.
Then if you start it up again, and halt it again, and see it doing the same thing or something similar, you know you will get a substantial speedup if you fix it.
The fewer samples you took to see that thing twice, the more speedup you will get.
That's the random pausing technique, and the statistical rationale is here.
The reason you do it on a debug build is here.
After you've cut out the fat using this method, you can switch to an optimized build and get the extra margin it gives you.

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:
{
foobar = false;
}
and the other:
{
if (foobar) {
// ...
}
}
the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):
No mfence or something that "dramatic". On the write side, something more "dramatic" happens:
As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.
I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?
I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.
The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.
A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.
Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.
bool operation_done;
void thread1() {
while (!operation_done) {
sleep();
}
// do something that depends on operation being done
}
void thread2() {
// do the operation
operation_done = true;
}
This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.
The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.
Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.
This is my own version of #Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to #HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.
I was able to have a trivial program that reproduces the problem:
std::atomic_bool foobar(true);
//bool foobar = true;
long long cnt = 0;
long long loops = 400000000ll;
void thread_1() {
usleep(200000);
foobar = false;
}
void thread_2() {
while (loops--) {
if (foobar) {
++cnt;
}
}
std::cout << cnt << std::endl;
}
The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:
but quite different for read:
We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:
400000000
real 0m1.044s
user 0m1.032s
sys 0m0.005s
while when I run the atomic_bool case I get:
95393578
real 0m0.420s
user 0m0.414s
sys 0m0.003s
It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.
What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:
i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:
400000000
real 0m0.206s
user 0m0.001s
sys 0m0.002s
while the atomic_bool remains the same.
So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.
In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations. It is more useful to implement things like mutexes, semaphores and so on.
One reason why it's not very useful: As soon as you have two values that both need to be changed in an atomic way, you are absolutely stuck. You can't do it with atomic values. And it's quite rare that I want to change a single value in an atomic way.
In iOS and MacOS X, the three methods to use are: Protecting the change using #synchronized. Avoiding multi-threaded access by running code on a sequential queue (may be the main queue). Using mutexes.
I hope you are aware that atomicity for boolean values is rather pointless. What you have is a race condition: One thread stores a value, another reads it. Atomicity doesn't make a difference here. It makes (or might make) a difference if two threads accessing a variable at exactly the same time causes problems. For example, if a variable is incremented on two threads at exactly the same time, is it guaranteed that the final result is increased by two? That requires atomicity (or one of the methods mentioned earlier).
Sebastian makes the ridiculous claim that atomicity fixes the data race: That's nonsense. In a data race, a reader will read a value either before or after it is changed, whether that value is atomic or not doesn't make any difference whatsoever. The reader will read the old value or the new value, so the behaviour is unpredictable. All that atomicity does is prevent the situation that the reader would read some in-between state. Which doesn't fix the data race.

C++ multiple threads and processes in vector

While going through a c++ tutorial book(it's in Spanish so I apologize if my translation to English is not as proper as it should be) I have come across a particular code snippet that I do not fully understand in terms of the different processes that are happening in the background. For example, in terms of multiple address spaces, how would I determine if these are all withing the context of a single process(being that multiple threads are being added over each push to the vector)? How would I determine if each thread is different from the other if they have the exact same computation being made?)
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
int addthreads = 0;
void squarenum(int x) {
addthreads += x * x * x;
}
int main() {
vector<thread> septhread;
for (int i = 1; i <= 9; i++){
septhread.push_back(thread(&squarenum, i));
}
for (auto& th : septhread){
th.join();
}
cout << "Your answer = " << addthreads << endl;
system("pause");
return 0;
}
Every answer defaults to 2025, that much I understand. My basic issue is understanding the first part of my question.
By the way, the compiler required(if you are on Linux):
g++ -std=gnu++ -pthread threadExample.cpp -o threadExample
A thread is a "thread of execution" within a process, sharing the same address space, resources, etc. Depending on the operating system, hardware, etc, they may or may not run on the same CPU or CPU Thread.
A major issue with thread programming, as a result, is managing access to resources. If two threads access the same resource at the same time, Undefined Behavior can occur. If they are both reading, it may be fine, but if one is writing at the same moment the other is reading, numerous outcomes ensue. The simplest is that both threads are running on separate CPUs or cores and so the reader does not see the change made by the writer due to cache. Another is that the reader sees only a portion of the write (if it's a 64-bit value they might only see 32-bits changed).
Your code performs a read-modify-store operation, so the first thread to come along sees the value '0', calculates the result of x*x*x, adds it to 0 and stores the result.
Meanwhile the next thread comes along and does the same thing, it also sees 0 before performing its calculation, so it writes 0 + x*x*x to the value, overwriting the first thread.
These threads might not be in the order that you launched them; it's possible for thread #30 to get the first execution cycle rather than thread #1.
You may need to consider looking at std::atomic or std::mutex.

integer producing hex values error

something weird is happening to my program. I am currently using lots of threads in my program, and will not be feasible to paste everything here.
However this is my problem:
int value = 1000;
std::cout << value << std::endl;
//output: 3e8
Any idea why is my output 3e8?
Whats the command to fix it back to print decimal values?
Thanks in advance! :)
Some other thread changed the default output radix of the std::cout stream to hexadecimal. Note that 100010 = 3e816, i.e. 1000 == 0x3e8.
Somewhere in your program a call such as:
std::cout << std::hex << value;
has been used. To revert output to normal (decimal) use:
std::cout << std::dec;
here is a relevent link to the different ways numbers can be output on std::cout.
Also, as pointed out in the comments below, the standard method of modifying cout flags safely appears to be the following:
ios::fmtflags cout_flag_backup(cout.flags()); // store the current cout flags
cout.flags ( ios::hex ); // change the flags to what you want
cout.flags(cout_flag_backup); // restore cout to its original state
Link to IO base flags
As stated in the comments below, it would also be wise to point out that when using IO Streams it is a good idea to have some form of synchronisation between the threads and the streams, that is, make sure no two threads can use the same stream at one time.
Doing this will probably also centralise your stream calls, meaning that it will be far easier to debug something such as this in the future.
Heres an SO question that may help you
Chances are that another thread has changed the output to hex on cout. I doubt that these streams are thread-safe.