I'm trying to implement a readers writers solution in C++ with std::thread.
I create several reader-threads that run in an infinite loop, pausing for some time in between each read access. I tried to recreate the algorithm presented in Tanenbaum's book of Operating Systems:
rc_mtx.lock(); // lock for incrementing readcount
read_count += 1;
if (read_count == 1) // if this is the first reader
db_mtx.lock(); // then make a lock on the database
rc_mtx.unlock();
cell_value = data_base[cell_number]; // read data from database
rc_mtx.lock();
read_count -= 1; // when finished 'sign this reader off'
if (read_count == 0) // if this was the last one
db_mtx.unlock(); // release the lock on the database mutex
rc_mtx.unlock();
Of course, the problem is that the thread who might satisfy the condition of being the last reader (and therefore want to do the unlock) has never acquired the db_mtx.
I tried to open another 'mother' thread for the readers to take care of
acquiring and releasing the mutex, but I go lost during the process.
If there is an elegant way to overcome this issue (thread might try to release a mutex that has never been acquired) in an elegant way I'd love to hear!
You can use a condition variable to pause writers if readers are in progress, instead of using a separate lock.
// --- read code
rw_mtx.lock(); // will block if there is a write in progress
read_count += 1; // announce intention to read
rw_mtx.unlock();
cell_value = data_base[cell_number];
rw_mtx.lock();
read_count -= 1; // announce intention to read
if (read_count == 0) rw_write_q.notify_one();
rw_mtx.unlock();
// --- write code
std::unique_lock<std::mutex> rw_lock(rw_mtx);
write_count += 1;
rw_write_q.wait(rw_lock, []{return read_count == 0;});
data_base[cell_number] = cell_value;
write_count -= 1;
if (write_count > 0) rw_write_q.notify_one();
This implementation has a fairness issue, because new readers can cut in front of waiting writers. A completely fair implementation would probably involve a proper queue that would allow new readers to wait behind waiting writers, and new writers to wait behind any waiting readers.
In C++14, you can use a shared_timed_mutex instead of mutex to achieve multiple readers/single writer access.
// --- read code
std::shared_lock<std::shared_timed_mutex> read_lock(rw_mtx);
cell_value = data_base[cell_number];
// --- write code
std::unique_lock<std::shared_timed_mutex> write_lock(rw_mtx);
data_base[cell_number] = cell_value;
There will likely be a plain shared_mutex implementation in the next C++ standard (probably C++17).
Related
3 Consumers 2 producers. Reading and writing to one buffer.
Producer A is pushing 1 element to buffer (length N) and Producer B is pushing 2 elements to buffer. No active waiting. I can't use System V semaphores.
Sample code for producer A:
void producerA(){
while(1){
sem_wait(full);
sem_wait(mutex);
Data * newData = (Data*) malloc(sizeof(Data));
newData->val = generateRandomletter();
newData->A = false;
newData->B = false;
newData->C = false;
*((Data*) mem+tail) = *newData;
++elements;
tail = (tail + 1) % N;
sem_post(mutex);
sem_post(empty);
}
}
Consumers look similar except they read or consume but that's irrelevant.
I am having a lot of trouble with Producer B. Obviously I can't do things like
sem_wait(full); sem_wait(full);
I also tried having a different semaphore for producer B that would be upped the first time there are 2 or more free spots in the buffer. But that didn't work out because I still need to properly lower and increase semaphores full and empty.
In what ways can I solve this problem?
https://gist.github.com/RobPiwowarek/65cb9896c109699c70217ba014b9ed20
That would be solution to the entire problem I had.
TLDR:
The easiest synchronisation I can provide was with using semaphores full and empty to represent the number of elements I have pushed to buffer. However that kind of solution does not work for POSIX semaphores if I have a producer that creates 2 elements.
My solution is a different concept.
The outline of a process comes down to:
while(1){
down(mutex);
size = get size
if (condition related to size based on what process this is)
{
do your job;
updateSize(int diff); // this can up() specific semaphores
// based on size
// each process has his own semaphore
up(mutex);
}
else
{
up(mutex);
down(process's own semaphore);
continue;
}
}
I hope this will be useful to someone in the future.
I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All
An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}
What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.
Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime
First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.
I'm working on a program that simulates a gas station. Each car at the station is it's own thread. Each car must loop through a single bitmask to check if a pump is open, and if it is, update the bitmask, fill up, and notify other cars that the pump is now open. My current code works but there are some issues with load balancing. Ideally all the pumps are used the same amount and all cars get equal fill-ups.
EDIT: My program basically takes a number of cars, pumps, and a length of time to run the test for. During that time, cars will check for an open pump by constantly calling this function.
int Station::fillUp()
{
// loop through the pumps using the bitmask to check if they are available
for (int i = 0; i < pumpsInStation; i++)
{
//Check bitmask to see if pump is open
stationMutex->lock();
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
stationMutex->unlock();
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
// Turning the bit back off
stationMutex->lock();
freeMask &= ~(1 << i);
stationCondition->notify_one();
stationMutex->unlock();
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return 1;
}
stationMutex->unlock();
}
// If not pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return -1;
}
I feel the issue has something to do with locking the bitmask when I read it. Do I need to have some sort of mutex or lock around the if check?
It looks like every car checks the availability of pump #0 first, and if that pump is busy it then checks pump #1, and so on. Given that, it seems expected to me that pump #0 would service the most cars, followed by pump #1 serving the second-most cars, all the way down to pump #(pumpsInStation-1) which only ever gets used in the (relatively rare) situation where all of the pumps are in use simultaneously at the time a new car pulls in.
If you'd like to get better load-balancing, you should probably have each car choose a different random ordering to iterate over the pumps, rather than having them all check the pumps' availability in the same order.
Normally I wouldn't suggest refactoring as it's kind of rude and doesn't go straight to the answer, but here I think it would help you a bit to break your logic into three parts, like so, to better show where the contention lies:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int i = 0; i < pumpsInStation; i++)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
void Station::releasePump(int n)
{
ScopedLocker locker(&stationMutex);
freeMask &= ~(1 << n);
stationCondition->notify_one();
}
bool Station::fillUp()
{
// If a pump is available:
int i = acquirePump();
if (i != -1)
{
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
releasePump(i)
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return true;
}
// If no pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return false;
}
Now when you have the code in this form, there is a load balancing issue which is important to fix if you don't want to "exhaust" one pump or if it too might have a lock inside. The issue lies in acquirePump where you are checking the availability of free pumps in the same order for each car. A simple tweak you can make to balance it better is like so:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int n = 0, i = startIndex; n < pumpsInStation; ++n, i = (i+1) % pumpsInStation)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
// Change the starting index used to search for a free pump for
// the next car.
startIndex = (startIndex+1) % pumpsInStation;
// Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
Another thing I have to ask is if it's really necessary (ex: for memory efficiency) to use bit flags to indicate whether a pump is used. If you can use an array of bool instead, you'll be able to avoid locking completely and simply use atomic operations to acquire and release pumps, and that'll avoid creating a traffic jam of locked threads.
Imagine that the mutex has a queue associated with it, containing the waiting threads. Now, one of your threads manages to get the mutex that protects the bitmask of occupied stations, checks if one specific place is free. If it isn't, it releases the mutex again and loops, only to go back to the end of the queue of threads waiting for the mutex. Firstly, this is unfair, because the first one to wait is not guaranteed to get the next free slot, only if that slot happens to be the one on its loop counter. Secondly, it causes an extreme amount of context switches, which is bad for performance. Note that your approach should still produce correct results in that no two cars collide while accessing a single filling station, but the behaviour is suboptimal.
What you should do instead is this:
lock the mutex to get exclusive access to the possible filling stations
locate the next free filling station
if none of the stations are free, wait for the condition variable and restart at point 2
mark the slot as occupied and release the mutex
fill up the car (this is where the sleep in the simulation actually makes sense, the other one doesn't)
lock the mutex
mark the slot as free and signal the condition variable to wake up others
release the mutex again
Just in case that part isn't clear to you, waiting on a condition variable implicitly releases the mutex while waiting and reacquires it afterwards!
I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.
All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.
if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.
The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
First of all, I know that it can be implemented with a mutex and condition variable, but I want the most efficient implementation possible.
I would like a semaphore with a fast-path when there's no contention. On Linux this is easy with a futex; for example, here's a wait:
if (AtomicDecremenIfPositive(_counter) > 0) return; // Uncontended
AtomicAdd(&_waiters, 1);
do
{
if (syscall(SYS_futex, &_counter, FUTEX_WAIT_PRIVATE, 0, nullptr, nullptr, 0) == -1) // Sleep
{
AtomicAdd(&_waiters, -1);
throw std::runtime_error("Failed to wait for futex");
}
}
while (AtomicDecrementIfPositive(_counter) <= 0);
AtomicAdd(&_waiters, -1);
and post:
AtomicAdd(&_counter, 1);
if (Load(_waiters) > 0 && syscall(SYS_futex, &_counter, FUTEX_WAKE_PRIVATE, 1, nullptr, nullptr, 0) == -1) throw std::runtime_error("Failed to wake futex"); // Wake one
At first I thought for Windows to just use NtWaitForKeyedEvent(). The problem is it's not a direct substitution because it doesn't atomically check the value at _counter before going into the kernel, and so can miss the wake from NtReleaseKeyedEvent(). Worse, then NtReleaseKeyedEvent() would block.
What's the best solution?
Windows has native semaphores with CreateSemaphore. Until and unless you have some kind of documented performance problem doing it the normal way, you shouldn't even consider optimizations that are fragile or hardware-specific.
I think something like this should work:
// bottom 16 bits: post count
// top 16 bits: wait count
struct Semaphore { unsigned val; }
wait(struct Semaphore *s)
{
retry:
do
old = s->val;
if old had posts (bottom 16 bits != 0)
new = old - 1
wait = false
else
new = old + 65536
wait = true
until successful CAS of &s->val from old to new
if wait == true
wait on keyed event
goto retry;
}
post(struct Semaphore *s)
{
do
old = s->val;
if old had waiters (top 16 bits != 0)
// perhaps new = old - 65536 and remove the "goto retry" above?
// not sure, but this is safer...
new = old - 65536 + 1
release = true
else
new = old + 1
release = false
until successful CAS of &s->val from old to new
if release == true
release keyed event
}
edit: that said, I'm not sure this would help you a lot. Your thread pool usually should be big enough that a thread is always ready to process your request. This means that not only waits, but also posts will always take the slow path and go to the kernel. So, counting semaphores are probably the one primitive where you do not really care about a userspace-only fastpath. Stock Win32 semaphores should be good enough. That said, I'm happy to be proven wrong!
I vote for your first idea, e.g critical section and condition variable. Critical section is fast enough and it does use interlocked operation before it goes to sleep. Or, you can experiment with SRWLocks instead of critical section. Condition variables (and SRWLocks) are very fast - their only problem is that there are no conditions on XP, but maybe you do not need to target this platform .
Qt has all kinds of things like QMutex, QSemaphore which are implemented in spirit like what you presented in your question.
Actually, I would suggest replacing the futex stuff with the usual OS-provided synchronization primitives; it should not matter much since that is the slow path anyway.