atomic compare and conditionally subtract if less - c++

I manage some memory that is used by concurrent threads, and I have a variable
unsigned int freeBytes
When I request some memory from a task
unsigned int bytesNeeded
I must check if
bytesNeeded<=freeBytes
and if yes keep the old value of freeBytes and subtract atomically from freeBytes bytesNeeded.
Does the atomic library OR the x86 offers such possibilities ?

Use an atomic compare-and-swap operation. In pseudo-code:
do {
unsigned int n = load(freeBytes);
if (n < bytesNeeded) { return NOT_ENOUGH_MEMORY; }
unsigned int new_n = n - bytesNeeded;
} while (!compare_and_swap(&freeBytes, n, new_n));
With real C++ <atomic> variables the actual could would look pretty similar:
#include <atomic>
// Global counter for the amount of available bytes
std::atomic<unsigned int> freeBytes; // global
// attempt to decrement the counter by bytesNeeded; returns whether
// decrementing succeeded.
bool allocate(unsigned int bytesNeeded)
{
for (unsigned int n = freeBytes.load(); ; )
{
if (n < bytesNeeded) { return false; }
unsigned int new_n = n - bytesNeeded;
if (freeBytes.compare_exchange_weak(n, new_n)) { return true; }
}
}
(Note that the final compare_exchange_weak takes the first argument by reference and updates it with the current value of the atomic variable in the event that the exchange fails.)
By contrast, incrementing the value ("deallocate?) can be done with a simple atomic addition (unless you want to check for overflow). This is to some extent symptomatic of lock-free containers: Creating something is relatively easy, assuming infinite resources, but removing requires trying in a loop.

Related

How to run all threads in sequence as static with out using opemMP for?

I'm new to openMP and multi-threading.
I have been given a task to run a method as static, dynamic, and guided without using OpenMPfor loop which means I cant use scheduled clauses.!
I could create parallel threads with parallel and could assign loop iterations to threads equally
but how to make it static and dynamic(1000 block) and guided?
void static_scheduling_function(const int start_count,
const int upper_bound,
int *results)
{
int i, tid, numt;
#pragma omp parallel private(i,tid)
{
int from, to;
tid = omp_get_thread_num();
numt = omp_get_num_threads();
from = (upper_bound / numt) * tid;
to = (upper_bound / numt) * (tid + 1) - 1;
if (tid == numt - 1)
to = upper_bound - 1;
for (i = from; i < to; i++)
{
//compute one iteration (i)
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
======================================
For dynamic i have tried something like this
void chunk_scheduling_function(const int start_count, const int upper_bound, int* results) {
int numt, shared_lower_iteration_counter=start_count;
for (int shared_lower_iteration_counter=start_count; shared_lower_iteration_counter<upper_bound;){
#pragma omp parallel shared(shared_lower_iteration_counter)
{
int tid = omp_get_thread_num();
int from,to;
int chunk = 1000;
#pragma omp critical
{
from= shared_lower_iteration_counter; // 10, 1010
to = ( shared_lower_iteration_counter + chunk ); // 1010,
shared_lower_iteration_counter = shared_lower_iteration_counter + chunk; // 1100 // critical is important while incrementing shared variable which decides next iteration
}
for(int i = from ; (i < to && i < upper_bound ); i++) { // 10 to 1009 , i< upperbound prevents other threads from executing call
int start = i;
int end = i + 1;
compute_iterations(start, end, results);
}
}
}
}
This looks like a university assignment (and a very good one IMO), I will not provide the complete solution, instead I will provide what you should be looking for.
The static scheduler looks okey; Notwithstanding, it can be improved by taking into account the chunk size as well.
For the dynamic and guided schedulers, they can be implemented by using a variable (let us name it shared_iteration_counter) that will be marking the current loop iteration that should pick up next by the threads. Therefore, when a thread needs to request a new task to work with (i.e., a new loop iteration) it queries that variable for that. In pseudo code would look like the following:
int thread_current_iteration = shared_iteration_counter++;
while(thread_current_iteration < MAX_SIZE)
{
// do work
thread_current_iteration = shared_iteration_counter++;
}
The pseudo code is assuming chunk size of 1 (i.e., shared_iteration_counter++) you will have to adapt to your use-case. Now, because that variable will be shared among threads, and every thread will be updating it, you need to ensure mutual exclusion during the updates of that variable. Fortunately, OpenMP offers means to achieve that, for instance, using #pragma omp critical, explicitly locks, and atomic operations. The latter is the better option for your use-case:
#pragma omp atomic
shared_iteration_counter = shared_iteration_counter + 1;
For the guided scheduler:
Similar to dynamic scheduling, but the chunk size starts off large and
decreases to better handle load imbalance between iterations. The
optional chunk parameter specifies them minimum size chunk to use. By
default the chunk size is approximately loop_count/number_of_threads.
In this case, not only you have to guarantee mutual exclusion of the variable that will be used to count the current loop iteration to be pick up by threads, but also guarantee mutual exclusion of the chunk size variable, since it also changes.
Without given it way too much bear in mind that you may need to considered how to deal with edge-cases such as your current thread_current_iteration= 1000 and your chunks_size=1000 with a MAX_SIZE=1500. Hence, thread_current_iteration + chunks_size > MAX_SIZE, but there is still 500 iterations to be computed.

Spurious underflow in C++ lock-free queue implementation

I'm trying to implement a lock-free queue that uses a linear circular buffer to store data. In contrast to a general-purpose lock-free queue I have the following relaxing conditions:
I know the worst-case number of elements that will ever be stored in the queue. The queue is part of a system that operates on a fixed set of elements. The code will never attempt to store more elements in the queue as there are elements in this fixed set.
No multi-producer/multi-consumer. The queue will either be used in a multi-producer/single-consumer or a single-producer/multi-consumer setting.
Conceptually, the queue is implemented as follows
Standard power-of-two ring buffer. The underlying data-structure is a standard ring-buffer using the power-of-two trick. Read and write indices are only ever incremented. They are clamped to the size of the underlying array when indexing into the array using a simple bitmask. The read pointer is atomically incremented in pop(), the write pointer is atomically incremented in push().
Size variable gates access to pop(). An additional "size" variable tracks the number of elements in the queue. This eliminates the need to perform arithmetic on the read and write indices. The size variable is atomically incremented after the entire write operation has taken place, i.e. the data has been written to the backing storage and the write cursor has been incremented. I'm using a compare-and-swap (CAS) operation to atomically decrement size in pop() and only continue, if size is non-zero. This way pop() should be guaranteed to return valid data.
My queue implementation is as follows. Note the debug code that halts execution whenever pop() attempts to read past the memory that has previously been written by push(). This should never happen, since ‒ at least conceptually ‒ pop() may only proceed if there are elements on the queue (there should be no underflows).
#include <atomic>
#include <cstdint>
#include <csignal> // XXX for debugging
template <typename T>
class Queue {
private:
uint32_t m_data_size; // Number of elements allocated
std::atomic<T> *m_data; // Queue data, size is power of two
uint32_t m_mask; // Bitwise AND mask for m_rd_ptr and m_wr_ptr
std::atomic<uint32_t> m_rd_ptr; // Circular buffer read pointer
std::atomic<uint32_t> m_wr_ptr; // Circular buffer write pointer
std::atomic<uint32_t> m_size; // Number of elements in the queue
static uint32_t upper_power_of_two(uint32_t v) {
v--; // https://graphics.stanford.edu/~seander/bithacks.html
v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16;
v++;
return v;
}
public:
struct Optional { // Minimal replacement for std::optional
bool good;
T value;
Optional() : good(false) {}
Optional(T value) : good(true), value(std::move(value)) {}
explicit operator bool() const { return good; }
};
Queue(uint32_t max_size)
: // XXX Allocate 1 MiB of additional memory for debugging purposes
m_data_size(upper_power_of_two(1024 * 1024 + max_size)),
m_data(new std::atomic<T>[m_data_size]),
m_mask(m_data_size - 1),
m_rd_ptr(0),
m_wr_ptr(0),
m_size(0) {
// XXX Debug code begin
// Fill the memory with a marker so we can detect invalid reads
for (uint32_t i = 0; i < m_data_size; i++) {
m_data[i] = 0xDEADBEAF;
}
// XXX Debug code end
}
~Queue() { delete[] m_data; }
Optional pop() {
// Atomically decrement the size variable
uint32_t size = m_size.load();
while (size != 0 && !m_size.compare_exchange_weak(size, size - 1)) {
}
// The queue is empty, abort
if (size <= 0) {
return Optional();
}
// Read the actual element, atomically increase the read pointer
T res = m_data[(m_rd_ptr++) & m_mask].load();
// XXX Debug code begin
if (res == T(0xDEADBEAF)) {
std::raise(SIGTRAP);
}
// XXX Debug code end
return res;
}
void push(T t) {
m_data[(m_wr_ptr++) & m_mask].store(t);
m_size++;
}
bool empty() const { return m_size == 0; }
};
However, underflows do occur and can easily be triggered in a multi-threaded stress-test. In this particular test I maintain two queues q1 and q2. In the main thread I feed a fixed number of elements into q1. Two worker threads read from q1 and push onto q2 in a tight loop. The main thread reads data from q2 and feeds it back to q1.
This works fine if there is only one worker-thread (single-producer/single-consumer) or as long as all worker-threads are on the same CPU as the main thread. However, it fails as soon as there are two worker threads that are explicitly scheduled onto a different CPU than the main thread.
The following code implements this test
#include <pthread.h>
#include <thread>
#include <vector>
static void queue_stress_test_main(std::atomic<uint32_t> &done_count,
Queue<int> &queue_rd, Queue<int> &queue_wr) {
for (size_t i = 0; i < (1UL << 24); i++) {
auto res = queue_rd.pop();
if (res) {
queue_wr.push(res.value);
}
}
done_count++;
}
static void set_thread_affinity(pthread_t thread, int cpu) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
if (pthread_setaffinity_np(thread, sizeof(cpu_set_t),
&cpuset) != 0) {
throw "Error while calling pthread_setaffinity_np";
}
}
int main() {
static constexpr uint32_t n_threads{2U}; // Number of worker threads
//static constexpr uint32_t n_threads{1U}; // < Works fine
static constexpr uint32_t max_size{16U}; // Elements in the queue
std::atomic<uint32_t> done_count{0}; // Number of finished threads
Queue<int> queue1(max_size), queue2(max_size);
// Launch n_threads threads, make sure the main thread and the two worker
// threads are on different CPUs.
std::vector<std::thread> threads;
for (uint32_t i = 0; i < n_threads; i++) {
threads.emplace_back(queue_stress_test_main, std::ref(done_count),
std::ref(queue1), std::ref(queue2));
set_thread_affinity(threads.back().native_handle(), 0);
}
set_thread_affinity(pthread_self(), 1);
//set_thread_affinity(pthread_self(), 0); // < Works fine
// Pump data from queue2 into queue1
uint32_t elems_written = 0;
while (done_count < n_threads || !queue2.empty()) {
// Initially fill queue1 with all values from 0..max_size-1
if (elems_written < max_size) {
queue1.push(elems_written++);
}
// Read elements from queue2 and put them into queue1
auto res = queue2.pop();
if (res) {
queue1.push(res.value);
}
}
// Wait for all threads to finish
for (uint32_t i = 0; i < n_threads; i++) {
threads[i].join();
}
}
Most of the time this program triggers the trap in the queue code, which means that pop() attempts to read memory that has never been touched by push() ‒ although pop() should only succeed if push() has been called at least as often as pop().
You can compile and run the above program with GCC/clang on Linux using
c++ -std=c++11 queue.cpp -o queue -lpthread && ./queue
Either just concatenate the above two code blocks or download the complete program here.
Note that I'm a complete novice when it comes to lock-free datastructures. I'm perfectly aware that there are plenty of battle-tested lock-free queue implementations for C++. However, I simply can't figure out why the above code does not work as intended.
You have two bugs, one of which can cause the failure you observe.
Let's look at your push code, except we'll allow only one operation per statement:
void push(T t)
{
auto const claimed_index = m_wr_ptr++; /* 1 */
auto const claimed_offset = claimed_index & m_mask; /* 2 */
auto& claimed_data = m_data[claimed_offset]; /* 3 */
claimed_data.store(t); /* 4 */
m_size++; /* 5 */
}
Now, for a queue with two producers, there is a window of vulnerability to a race condition between operations 1 and 4:
Before:
m_rd_ptr == 1
m_wr_ptr == 1
m_size == 0
Producer A:
/* 1 */ claimed_index = 1; m_wr_ptr = 2;
/* 2 */ claimed_offset = 1;
Scheduler puts Producer A to sleep here
Producer B:
/* 1 */ claimed_index = 2; m_wr_ptr = 3;
/* 2 */ claimed_offset = 2;
/* 3 */ claimed_data = m_data[2];
/* 4 */ claimed_data.store(t);
/* 5 */ m_size = 1;
After:
m_size == 1
m_rd_ptr == 1
m_wr_ptr == 3
m_data[1] == 0xDEADBEAF
m_data[2] == value_produced_by_B
The consumer now runs, sees m_size > 0, and reads from m_data[1] while increasing m_rd_ptr from 1 to 2. But m_data[1] hasn't been written by Producer A yet, and Producer B wrote to m_data[2].
The second bug is the complementary case in pop() when a consumer thread is interrupted between the m_rd_ptr++ action and the .load() call. It can result in reading values out of order, potentially so far out of order that the queue has completely circled and overwritten the original value.
Just because two operations in a single source statement are atomic does not make the entire statement atomic.

bit test and set (BTS) on a tbb atomic variable

I want to do bitTestAndSet on a tbb atomic variable.
atomic.h from tbb does not seem to have any bit operations.
If I treat the tbb atomic variable as a normal pointer and do __sync_or_and_fetch gcc compiler doesn't allow that.
Is there a workaround for this?
Related question:
assembly intrinsic for bit test and set (BTS)
A compare_and_swap loop can be used, like this:
// Atomically perform i|=j. Return previous value of i.
int bitTestAndSet( tbb::atomic<int>& i, int j ) {
int o = i; // Atomic read (o = "old value")
while( (o|j)!=o ) { // Loop exits if another thread sets the bits
int k = o;
o = i.compare_and_swap(k|j,k);
if( o==k ) break; // Successful swap
}
return o;
}
Note that if the while condition succeeds on the first try, there will be only an acquire fence, not a full fence. Whether that matters depends on context.
If there is risk of high contention, then some sort of backoff scheme should be be used in the loop. TBB uses a class atomic_backoff for contention management internally, but it's not currently part of the public TBB API.
There is a second way, if portability is not a concern and you are willing to exploit the undocumented fact that the layout of a tbb::atomic and T are the same on x86 platforms. In that case, just operate on the tbb::atomic using assembly code. The program below demonstrates this technique:
#include <tbb/tbb.h>
#include <cstdio>
inline int SetBit(int array[], int bit) {
int x=1, y=0;
asm("bts %2,%0\ncmovc %3,%1" : "+m" (*array), "+r"(y) : "r" (bit), "r"(x));
return y;
}
tbb::atomic<int> Flags;
volatile int Result;
int main() {
for( int i=0; i<16; ++i ) {
int k = i*i%32;
std::printf("bit at %2d was %d. Flags=%8x\n", k, SetBit((int*)&Flags,k), +Flags);
}
}

prevent std::atomic from overflowing

I have an atomic counter (std::atomic<uint32_t> count) which deals out sequentially incrementing values to multiple threads.
uint32_t my_val = ++count;
Before I get my_val I want to ensure that the increment won't overflow (ie: go back to 0)
if (count == std::numeric_limits<uint32_t>::max())
throw std::runtime_error("count overflow");
I'm thinking this is a naive check because if the check is performed by two threads before either increments the counter, the second thread to increment will get 0 back
if (count == std::numeric_limits<uint32_t>::max()) // if 2 threads execute this
throw std::runtime_error("count overflow");
uint32_t my_val = ++count; // before either gets here - possible overflow
As such I guess I need to use a CAS operation to make sure that when I increment my counter, I am indeed preventing a possible overflow.
So my questions are:
Is my implementation correct?
Is it as efficient as it can be (specifically do I need to check against max twice)?
My code (with working exemplar) follows:
#include <iostream>
#include <atomic>
#include <limits>
#include <stdexcept>
#include <thread>
std::atomic<uint16_t> count;
uint16_t get_val() // called by multiple threads
{
uint16_t my_val;
do
{
my_val = count;
// make sure I get the next value
if (count.compare_exchange_strong(my_val, my_val + 1))
{
// if I got the next value, make sure we don't overflow
if (my_val == std::numeric_limits<uint16_t>::max())
{
count = std::numeric_limits<uint16_t>::max() - 1;
throw std::runtime_error("count overflow");
}
break;
}
// if I didn't then check if there are still numbers available
if (my_val == std::numeric_limits<uint16_t>::max())
{
count = std::numeric_limits<uint16_t>::max() - 1;
throw std::runtime_error("count overflow");
}
// there are still numbers available, so try again
}
while (1);
return my_val + 1;
}
void run()
try
{
while (1)
{
if (get_val() == 0)
exit(1);
}
}
catch(const std::runtime_error& e)
{
// overflow
}
int main()
{
while (1)
{
count = 1;
std::thread a(run);
std::thread b(run);
std::thread c(run);
std::thread d(run);
a.join();
b.join();
c.join();
d.join();
std::cout << ".";
}
return 0;
}
Yes, you need to use CAS operation.
std::atomic<uint16_t> g_count;
uint16_t get_next() {
uint16_t new_val = 0;
do {
uint16_t cur_val = g_count; // 1
if (cur_val == std::numeric_limits<uint16_t>::max()) { // 2
throw std::runtime_error("count overflow");
}
new_val = cur_val + 1; // 3
} while(!std::atomic_compare_exchange_weak(&g_count, &cur_val, new_val)); // 4
return new_val;
}
The idea is the following: once g_count == std::numeric_limits<uint16_t>::max(), get_next() function will always throw an exception.
Steps:
Get current value of the counter
If it is maximal, throw an exception (no numbers available anymore)
Get new value as increment of the current value
Try to atomically set new value. If we failed to set it (it was done by another thread already), try again.
If efficiency is a big concern then I'd suggest not being so strict on the check. I'm guessing that under normal use overflow won't be an issue, but do you really need the full 65K range (your example uses uint16)?
It would be easier if you assume some maximum on the number of threads you have running. This is a reasonable limit since no program has unlimited numbers of concurrency. So if you have N threads you can simply reduce your overflow limit to 65K - N. To compare if you overflow you don't need a CAS:
uint16_t current = count.load(std::memory_order_relaxed);
if( current >= (std::numeric_limits<uint16_t>::max() - num_threads - 1) )
throw std::runtime_error("count overflow");
count.fetch_add(1,std::memory_order_relaxed);
This creates a soft-overflow condition. If two threads come here at once both of them will potentially pass, but that's okay since the count variable itself never overflows. Any future arrivals at this point will logically overflow (until count is reduced again).
It seems to me that there's still a race condition where count will be set to 0 momentarily such that another thread will see the 0 value.
Assume that count is at std::numeric_limits<uint16_t>::max() and two threads try to get the incremented value. At the moment that Thread 1 performs the count.compare_exchange_strong(my_val, my_val + 1), count is set to 0 and that's what Thread 2 will see if it happens to call and complete get_val() before Thread 1 has a chance to restore count to max().

Finding the minimum of concurrent counters

I have several counters that keep increasing (never decreasing) by concurrent threads. Each thread is responsible of one counter. Occasionally, one of the threads would need to find the minimum of all counters. I do this with a simple iteration over all counters and select the minimum. I need to ensure that this minimum is no greater than any of the counters. Currently, I don't use any concurrency mechanisms. Is there any chance that I get a wrong answer (i.e., end up with a minimum that is greater than one of the counters). The code works most of the time, but occasionally (less than 0.1% of the time), it breaks by finding a minimum that is larger than one of the counters. I use a C++ code, and the code looks like this.
unsigned long int counters[NUM_COUNTERS];
void* WorkerThread(void* arg) {
int i_counter = *((int*) arg);
// DO some work
counters[i_counter]++;
occasionally {
unsigned long int min = counters[i_counter];
for (int i = 0; i < NUM_COUNTERS; i++) {
if (counters[i] < min)
min = counters[i];
}
// The minimum is now stored in min
}
}
Update:
After employing the fix suggested by #JerryCoffin, the code looks like this
unsigned long int counters[NUM_COUNTERS];
void* WorkerThread(void* arg) {
int i_counter = *((int*) arg);
// DO some work
counters[i_counter]++;
occasionally {
unsigned long int min = counters[i_counter];
for (int i = 0; i < NUM_COUNTERS; i++) {
unsigned long int counter_i = counters[i];
if (counter_i < min)
min = counter_i;
}
// The minimum is now stored in min
}
}
Yes, it's broken -- it has a race condition.
In other words, when you pick out the smallest value, it's undoubtedly smaller than any other you look at -- but if the other thread increments it after you do the comparison, it could end up larger than some other counter by the time you try to use it.
if (counters[i] < min)
// could change between the comparison above and the assignment below
min = counters[i];
The relatively short interval between comparing and saving the value explains why the answer you're getting is right most of the time -- it'll only go wrong if there's a context switch immediately after the comparison, and the other thread increments that counter often enough before control switches back that it's no longer the smallest counter by the time it gets saved.