char instead of std::atomic_flag or std::atminc_bool - c++

I am wondering if one can use
char Flag;
instead of
std::atomic_flag Flag;
I know that C++ fundamental types, generally speaking, are not atomic/thread safe (that's why std::atomic exists), but also I know that size of char is always 1 byte. And I cannot imagine situation in which read/write of single byte is not thread safe.
Also I cannot find anything about thread safefy of char variable.
Consider following example (Win32, Visual Studio 2015, Release, optimisation disabled):
// Can be any integral type
using mytype_t = unsigned char;
#define VAL1 static_cast<mytype_t>(0x5555555555555555ULL)
#define VAL2 static_cast<mytype_t>(0xAAAAAAAAAAAAAAAAULL)
#define CYCLES (50 * 1000 * 1000)
void runtest_mytype()
{
// Just to stop checking thread
std::atomic_bool Stop = false;
const auto Started = ::GetTickCount64();
auto Val = VAL1;
std::thread threadCheck([&]()
{
// Checking values
while (!Stop)
{
const auto Val_ = Val;
if (VAL1 != Val_ && VAL2 != Val_)
std::cout << "Error! " << std::to_string(Val_) << std::endl;
}
});
std::thread thread1([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Val = VAL1;
});
std::thread thread2([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Val = VAL2;
});
thread1.join();
thread2.join();
std::cout << "mytype: finished in " << std::to_string(::GetTickCount64() - Started) << " ms" << std::endl;
Stop = true;
threadCheck.join();
}
void runtest_atomic_flag()
{
std::atomic_flag Flag;
const auto Started = ::GetTickCount64();
std::thread thread1([&]()
{
for (auto I = 0; I < CYCLES; ++I)
auto Val_ = Flag.test_and_set(std::memory_order_acquire);
});
std::thread thread2([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Flag.clear(std::memory_order_release);
});
thread1.join();
thread2.join();
std::cout << "atomic_flag: finished in " << std::to_string(::GetTickCount64() - Started) << " ms" << std::endl;
}
int _tmain(int argc, _TCHAR* argv[])
{
runtest_mytype();
runtest_atomic_flag();
std::getchar();
return 0;
}
It outputs something like this (during several tests, the values did not change much):
mytype: finished in 312 ms
atomic_flag: finished in 1669 ms
So, char instead of atomic_flag, works significantly faster, which can play role in some cases.
But I am far from the idea that std::atomiс_flag was invented in vain.
Please, help me figure it out. At least, can I use char, when I use only Windows, only Visual Studio, and I don't have to think about compatibility?

changes of atomic variables are also visible in other thread.
When using char, the modification might be not visible to other threads (that's why some people wrongly use volatile for synchronization).
Btw, modifying char concurrently without synchronization is UB.

Related

Chrono time counter is inaccurate when machine goes to sleep in the middle of execution?

I have the code sample bellow to measure the execution time of some piece of code:
int main()
{
auto before = chrono::steady_clock::now();
Sleep(30000);
auto after = chrono::steady_clock::now();
int duration = (std::chrono::duration_cast<std::chrono::seconds> ((after - before)).count());
cout << duration << endl;
return 0;
}
Normally it works fine and prints out 30 in the cout statement.
However, during testing I observed that if the computer were to go to sleep in between the auto before = ... statement and the auto after = ... statement (due to inactivity or whatever other reason), then the printed out time also counts the entire time the machine was asleep. This makes perfect sense since we are comparing a timepoint from before the machine going to sleep and one with after.
So my question is how can I make it so that the duration the machine was asleep is not counted in my final duration? Probably will need a ticker that doesn't increment while machine is asleep rather than timepoint measurements but I'm not aware of such a ticker.
This is a Windows specific question. As I understand, MacOS has mach_absolute_time which is exactly what I'm looking for in windows. I'm using MSVC 19.29.30147.0 as my compiler.
After looking around and testing it out, the solution is to use QueryUnbiasedInterruptTime
Running the following code snippet, I manually put my machine to sleep while the program was stuck on the sleep statement and I observed that the second print out consistently outputs 15 seconds regardless of how long I leave my machine in a sleeping state. However, the first print-out that uses GetTickCount64 will include the amount of time the machine was asleep.
int main()
{
ULONGLONG before_query, after_query= 0;
QueryUnbiasedInterruptTime(&before_query);
auto beforeticks = GetTickCount64();
Sleep(15000);
QueryUnbiasedInterruptTime(&after_query);
auto afterticks = GetTickCount64();
cout << "Ticks from gettickcount64 is " << (double (afterticks-beforeticks))/1000 << endl;
cout << "Unbiased time measure is " << double((after_query - before_query)/10000000) << endl;
return 0;
}
You are correct that the easiest way is to use a counter that is incremented each second. This is easily implemented with threads:
#include <thread>
#include <atomic>
#include <chrono>
using namespace std::literals::chrono_literals;
class ellapsed_counter {
std::atomic<bool> finished = false;
std::atomic<unsigned int> value = 0;
std::thread worker { [this] {
while(!finished) {
value++;
std::this_thread::sleep_for(1s);
}
} };
public:
void finish() noexcept {
finished = true;
if(worker.joinable()) worker.join();
}
unsigned int ellapsed() const noexcept { return value; }
};
This will keep incrementing on 1s intervals (probably with some error) as long as the process is running and should cease so when it is sleeping.
You can use it like this:
#include <iostream>
int main(int argc, const char *argv[]) {
ellapsed_counter counter;
unsigned int last = 0, count = 0;
while(count < 10) {
count = counter.ellapsed();
if(count != last) {
last = count;
std::cout << count << std::endl;
}
}
counter.finish();
return 0;
}
This will count from 1 to 10 seconds and exit.

Fast process std::bitset<65536> in parallel

Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
}
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.
Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
public:
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
;
}
void unlock() {
f_.clear(std::memory_order_release);
//f_.notify_all();
}
LockerC Locker() { return LockerC(*this); }
private:
class LockerC {
public:
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
private:
AtomicMutex * pmux_ = nullptr;
};
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
};
template <typename T>
class RingBuffer {
public:
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
}
++size_;
++last_;
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
}
void pop() {
if (size_ == 0)
return;
--size_;
++first_;
if (first_ >= buf_.size())
first_ = 0;
}
private:
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
};
template <typename T>
class Queue {
public:
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
q_.emplace(std::forward<Args>(args)...);
}
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
on_empty();
continue;
}
auto lock = m_.Locker();
if (q_.empty()) {
on_empty();
continue;
}
on_full();
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
}
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
private:
AtomicMutex m_;
RingBuffer<T> q_;
};
class RunInDestr {
public:
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
private:
std::function<void()> const & f_;
};
class Pool {
public:
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
};
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.emplace_back(Worker{});
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
CheckFinish();
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
};
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
}
} catch (...) {
exc_.store(true, std::memory_order_relaxed);
*pexc = std::current_exception();
}
});
}
}
~Pool() {
Wait();
Finish();
}
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
return;
Finish();
throw std::runtime_error("Pool: Exception occured!");
}
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
w.thr->join();
if (*w.pexc)
std::rethrow_exception(*w.pexc);
} catch (FinishExc const &) {}
workers_.clear();
}
template <typename ... Args>
void Emplace(Args && ... args) {
CheckExc();
tasks_.Emplace(std::forward<Args>(args)...);
}
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
break;
(*task)();
}
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
break;
}
if (done)
break;
}
CheckExc();
}
private:
void CheckFinish() {
if (finish_)
throw FinishExc{};
}
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
};
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
else
bset[i] = 1;
}
};
if constexpr(MultiThreaded)
pool.Emplace(std::move(task));
else
task();
}
if constexpr(MultiThreaded)
pool.Wait();
}
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
}
}
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
{
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
pool.Wait();
}
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
}
{
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
asyncs.clear();
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
asyncs.clear();
}
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
}
}
int main() {
try {
TimeMeasure();
Compute();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
}
}
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),
You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
++i;
hash = Hash(bits, hash);
});
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
}
}
// Wait for other threads to finish and hash
// to be calculated before starting next loop
barrier.arrive_and_wait();
}
});
}
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code: https://godbolt.org/z/en76Kv4nn
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).

If statement passes only when preceded by debug cout line (multi-threading in C)

I created this code to use for solving CPU intensive tasks real-time and potentially as a base for a game engine in the future. For it I created a system where there is an array of ints each thread modifies to signal whether they are done with their current task.
The problem occurs when running it with more than 4 threads. When using 6 threads or more, the "if (threadone_private == threadcount)" stops working UNLESS I add this debug line "cout << threadone_private << endl;" before it.
I cannot comprehend why this debug line makes any difference on whether the if conditional functions as expected, neither why it works without it when using 4 threads or less.
For this code I'm using:
#include <GL/glew.h>
#include <GLFW/glfw3.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <string>
#include <fstream>
#include <sstream>
using namespace std;
Right now this code only counts up to 60 trillion, in asynchronous steps of 3 billion, really fast.
Here are the relevant parts of the code:
int thread_done[6] = { 0,0,0,0,0,0 };
atomic<long long int> testvar1 = 0;
atomic<long long int> testvar2 = 0;
atomic<long long int> testvar3 = 0;
atomic<long long int> testvar4 = 0;
atomic<long long int> testvar5 = 0;
atomic<long long int> testvar6 = 0;
void task1(long long int testvar, int thread_number)
{
int continue_work = 1;
for (; ; ) {
while (continue_work == 1) {
for (int i = 1; i < 3000000001; i++) {
testvar++;
}
thread_done[thread_number] = 1;
if (thread_number==0) {
testvar1 = testvar;
}
if (thread_number == 1) {
testvar2 = testvar;
}
if (thread_number == 2) {
testvar3 = testvar;
}
if (thread_number == 3) {
testvar4 = testvar;
}
if (thread_number == 4) {
testvar5 = testvar;
}
if (thread_number == 5) {
testvar6 = testvar;
}
continue_work = 0;
}
if (thread_done[thread_number] == 0) {
continue_work = 1;
}
}
}
And here is the relevant part of the main thread:
int main() {
long long int testvar = 0;
int threadcount = 6;
int threadone_private = 0;
thread thread_1(task1, testvar, 0);
thread thread_2(task1, testvar, 1);
thread thread_3(task1, testvar, 2);
thread thread_4(task1, testvar, 3);
thread thread_5(task1, testvar, 4);
thread thread_6(task1, testvar, 5);
for (; ; ) {
if (threadcount == 0) {
for (int i = 1; i < 3000001; i++) {
testvar++;
}
cout << testvar << endl;
}
else {
while (testvar < 60000000000000) {
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
cout << threadone_private << endl;
if (threadone_private == threadcount) {
testvar = testvar1 + testvar2 + testvar3 + testvar4 + testvar5 + testvar6;
cout << testvar << endl;
thread_done[0] = 0;
thread_done[1] = 0;
thread_done[2] = 0;
thread_done[3] = 0;
thread_done[4] = 0;
thread_done[5] = 0;
}
}
}
}
}
I expected that since each worker thread only modifies one int out of the array threadone_private, and since the main thread only ever reads it until all worker threads are waiting, that this if (threadone_private == threadcount) should be bulletproof... Apparently I'm missing something important that goes wrong whenever I change this:
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
cout << threadone_private << endl;
if (threadone_private == threadcount) {
To this:
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
//cout << threadone_private << endl;
if (threadone_private == threadcount) {
Disclaimer: Concurrent code is quite complicated and easy to get wrong, so it's generally a good idea to use higher level abstractions. There are a whole lot of details that are easy to get wrong without ever noticing. You should think very carefully about doing such low-level programming if you're not an expert. Sadly C++ lacks good built-in high level concurrent constructs, but there are libraries out there that handle this.
It's unclear what the whole code is supposed to do anyhow to me. As far as I can see whether the code ever stops relies purely on timing - even if you did the synchronization correctly - which is completely non deterministic. Your threads could execute in such a way that thread_done is never all true.
But apart from that there is at least one correctness issue: You're reading and writing to int thread_done[6] = { 0,0,0,0,0,0 }; without synchronization. This is undefined behavior so the compiler can do what it wants.
What probably happens is that the compiler sees that it can cache the value of threadone_private since the thread never writes to it so the value cannot change (legally). The external call to std::cout means it can't be sure that the value isn't change behind its back so it has to read the value each iteration new (also std::cout uses locks which causes synchronization in most implementations which again limits what the compiler can assume).
I cannot see any std::mutex, std::condition_variable or variants of std::lock in your code. Doing multithreading without any of those will never succeed reliably. Because whenever multiple threads modify the same data, you need to make sure only one thread (including your main thread) has access to that data at any given time.
Edit: I noticed you use atomic. I do not have any experience with this, however I know using mutexes works reliably.
Therefore, you need to lock every access (read or write) to that data with a mutex like this:
//somewhere
std::mutex myMutex;
std::condition_variable myCondition;
int workersDone = 0;
/* main thread */
createWorkerThread1();
createWorkerThread2();
{
std::unique_lock<std::mutex> lock(myMutex); //waits until mutex is locked.
while(workersDone != 2) {
myCondition.wait(lock); //the mutex is unlocked while waiting
}
std::cout << "the data is ready now" << std::endl;
} //the lock is destroyed, unlocking the mutex
/* Worker thread */
while(true) {
{
std::unique_lock<std::mutex> lock(myMutex); //waits until mutex is locked
if(read_or_modify_a_piece_of_shared_data() == DATA_FINISHED) {
break; //lock leaves the scope, unlocks the mutex
}
}
prepare_everything_for_the_next_piece_of_shared_data(); //DO NOT access data here
}
//data is processed
++workersDone;
myCondition.notify_one(); //no mutex here. This wakes up the waiting thread
I hope this gives you an idea on how to use mutexes and condition variables to gain thread safety.
Disclaimer: 100% pseudo code ;)

c++ threading, duplicate/missing threads

I'm trying to write a program that concurrently add and removes items from a "storehouse". I have a "Monitor" class that handles the "storehouse" operations:
class Monitor
{
private:
mutex m;
condition_variable cv;
vector<Storage> S;
int counter = 0;
bool busy = false;;
public:
void add(Computer c, int index) {
unique_lock <mutex> lock(m);
if (busy)
cout << "Thread " << index << ": waiting for !busy " << endl;
cv.wait(lock, [&] { return !busy; });
busy = true;
cout << "Thread " << index << ": Request: add " << c.CPUFrequency << endl;
for (int i = 0; i < counter; i++) {
if (S[i].f == c.CPUFrequency) {
S[i].n++;
busy = false; cv.notify_one();
return;
}
}
Storage s;
s.f = c.CPUFrequency;
s.n = 1;
// put the new item in a sorted position
S.push_back(s);
counter++;
busy = false; cv.notify_one();
}
}
The threads are created like this:
void doThreadStuff(vector<Computer> P, vector <Storage> R, Monitor &S)
{
int Pcount = P.size();
vector<thread> myThreads;
myThreads.reserve(Pcount);
for (atomic<size_t> i = 0; i < Pcount; i++)
{
int index = i;
Computer c = P[index];
myThreads.emplace_back([&] { S.add(c, index); });
}
for (size_t i = 0; i < Pcount; i++)
{
myThreads[i].join();
}
// printing results
}
Running the program produced the following results:
I'm familiar with race conditions, but this doesn't look like one to me. My bet would be on something reference related, because in the results we can see that for every "missing thread" (threads 1, 3, 10, 25) I get "duplicate threads" (threads 2, 9, 24, 28).
I have tried to create local variables in functions and loops but it changed nothing.
I have heard about threads sharing memory regions, but my previous work should have produced similar results, so I don't think that's the case here, but feel free to prove me wrong.
I'm using Visual Studio 2017
Here you catch local variables by reference in a loop, they will be destroyed in every turn, causing undefined behavior:
for (atomic<size_t> i = 0; i < Pcount; i++)
{
int index = i;
Computer c = P[index];
myThreads.emplace_back([&] { S.add(c, index); });
}
You should catch index and c by value:
myThreads.emplace_back([&S, index, c] { S.add(c, index); });
Another approach would be to pass S, i and c as arguments instead of capturing them by defining the following non-capturing lambda, th_func:
auto th_func = [](Monitor &S, int index, Computer c){ S.add(c, index); };
This way you have to explicitly wrap the arguments that must be passed by reference to the thread's callable object with std::reference_wrapper by means of the function template std::ref(). In your case, only S:
for (atomic<size_t> i = 0; i < Pcount; i++) {
int index = i;
Computer c = P[index];
myThreads.emplace_back(th_func, std::ref(S), index, c);
}
Failing to wrap with std::reference_wrapper the arguments that must be passed by reference will result in a compile-time error. That is, the following won't compile:
myThreads.emplace_back(th_func, S, index, c); // <-- it should be std::ref(S)
See also this question.

Multi-threaded performance std::string

We are running some code on a project that uses OpenMP and I've run into something strange. I've included parts of some play code that demonstrates what I see.
The tests compare calling a function with a const char* argument with a std::string argument in a multi-threaded loop. The functions essentially do nothing and so have no overhead.
What I do see is a major difference in the time it takes to complete the loops. For the const char* version doing 100,000,000 iterations the code takes 0.075 seconds to complete compared with 5.08 seconds for the std::string version. These tests were done on Ubuntu-10.04-x64 with gcc-4.4.
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Code below and many thanks for your responses.
Compiled with: g++ -Wall -Wextra -O3 -fopenmp string_args.cpp -o string_args
#include <iostream>
#include <map>
#include <string>
#include <stdint.h>
// For wall time
#ifdef _WIN32
#include <time.h>
#else
#include <sys/time.h>
#endif
namespace
{
const int64_t g_max_iter = 100000000;
std::map<const char*, int> g_charIndex = std::map<const char*,int>();
std::map<std::string, int> g_strIndex = std::map<std::string,int>();
class Timer
{
public:
Timer()
{
#ifdef _WIN32
m_start = clock();
#else /* linux & mac */
gettimeofday(&m_start,0);
#endif
}
float elapsed()
{
#ifdef _WIN32
clock_t now = clock();
const float retval = float(now - m_start)/CLOCKS_PER_SEC;
m_start = now;
#else /* linux & mac */
timeval now;
gettimeofday(&now,0);
const float retval = float(now.tv_sec - m_start.tv_sec) + float((now.tv_usec - m_start.tv_usec)/1E6);
m_start = now;
#endif
return retval;
}
private:
// The type of this variable is different depending on the platform
#ifdef _WIN32
clock_t
#else
timeval
#endif
m_start; ///< The starting time (implementation dependent format)
};
}
bool contains_char(const char * id)
{
if( g_charIndex.empty() ) return false;
return (g_charIndex.find(id) != g_charIndex.end());
}
bool contains_str(const std::string & name)
{
if( g_strIndex.empty() ) return false;
return (g_strIndex.find(name) != g_strIndex.end());
}
void do_serial_char()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_char()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_serial_str()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_str()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter ; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
int main()
{
std::cout << "Starting single-threaded loop using std::string\n";
do_serial_str();
std::cout << "\nStarting multi-threaded loop using std::string\n";
do_parallel_str();
std::cout << "\nStarting single-threaded loop using char *\n";
do_serial_char();
std::cout << "\nStarting multi-threaded loop using const char*\n";
do_parallel_char();
}
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Yes, it is due to the allocation and copying for std::string on every iteration.
A sufficiently smart compiler could potentially optimize this, but it is unlikely to happen with current optimizers. Instead, you can hoist the string yourself:
void do_parallel_str()
{
int found(0);
Timer clock;
std::string const str = "pos"; // you can even make it static, if desired
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str(str) )
{
++found;
}
}
//clock.stop(); // Or use something to that affect, so you don't include
// any of the below expression (such as outputing "Loop time: ") in the timing.
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
Does changing:
if( contains_str("pos") )
to:
static const std::string str = "pos";
if( str )
Change things much? My current best guess is that the implicit constructor call for std::string every loop would introduce a fair bit of overhead and optimising it away whilst possible is still a sufficiently hard problem I suspect.
std::string (in your case temporary) requires dynamic allocation, which is a very slow operation, compared to everything else in your loop. There are also old implementations of standard library that did COW, which also slow in multi-threaded environment. Having said that, there is no reason why compiler cannot optimize temporary string creation and optimize away the whole contains_str function call, unless you have some side effects there. Since you didn't provide implementation for that function, it's impossible to say if it could be completely optimized away.