I'm processing items in multiple threads, and the producers may output them into buffers out of order. Some later pipeline stages are not memoryless, and I need to put the partially processed items in order, so I have a thread gathering them from buffers output by previous stage workers and putting them into a standard heap-based priority queue, pulling from the top of the heap while the item counter is the successor to the last item that was pulled.
The item are stamped with a 32-bit unsigned counter by the hardware that generates them. There are several thousand items per second, and after a few days the counter wraps around. How do I handle this without switching to 64-bit counters? The program needs to be able to run indefinitely.
[Edit]
One idea I had is that, since the heap is limited in size to a few million items, I can modify the heap comparator to check the difference between the counters being compared and set a threshold of say half the maximum value of the unsigned, which if exceeded would be taken to assume a wraparound has occurred. The downside is the overhead of an extra conditional per each item checked in heap operations, and I don't know if there's some way to reduce it to a combination of subtraction/cast/etc. with just a single comparison.
How about using a 2nd queue. The insert operation switches on wrap and popping switches when current queue is empty, or just use a flag for the active queue
Related
I am using an std::queue to buffer messages on my network (CAN bus in this case). During an interrupt I am adding the message to the "inbox". Then my main program checks every cycle if the queue is empty, if not handles the messages. Problem is, the queue is pop'd until empty (it exits from while (! inbox.empty()), but the next time I push data to it, it works as normal BUT the old data is still hanging out at the back.
For example, first message pushes a "1" to the queue. Loop reads
1
Next message is "2". Next read is
2
1
If I were to get in TWO messages before another read, "3", "4", then next read would be
3
4
2
1
I am very confused. I am also working with an STM32F0 ARM chip and mbed online, and have no idea if this is working poorly on the hardware or what!
I was concerned about thread safety, so I added an extra buffer queue and only push to the inbox when it "unlocked". And once I ran this I have not seen any conflict occur anyway!
Pusher code:
if (bInboxUnlocked) {
while (! inboxBuffer.empty()) {
inbox.push (inboxBuffer.front());
inboxBuffer.pop();
}
inbox.push(msg);
} else {
inboxBuffer.push(msg);
printf("LOCKED!");
}
Main program read code
bInboxUnlocked = 0;
while (! inbox.empty()) {
printf("%d\r\n", inbox.front().data);
inbox.pop();
}
bInboxUnlocked = 1;
Thoughts anyone? Am I using this wrong? Any other ways to easily accomplish what I am doing? I expect the buffers to be small enough to implement a small circular array, but with queue on hand I was hoping not to have to do that.
Based on what I can figure out from a basic Google search, your CPU is a single core CPU, essentially. If so, then there should not be any memory fencing issues to deal with, here.
If, on the other hand, you had multiple CPU cores to deal with here, it will be necessary to either cram in explicit fences, in key places, or employ C++11 classes like std::mutex, that will take care of this for you.
But going with the original use case of a single CPU, and no memory fencing issues, if you can guarantee that:
A) There's some definite upper limit on the number of messages you expect to buffer by your interrupt handling code in the queue before it gets drained, and:
B) the messages you're buffering are PODs
Then a potential alternative to std::queue worth exploring here is to roll your own simple queue, using nothing more than a static std::array, or maybe a std::vector, an int head pointer, and an int tail pointer. A google search should find plenty of examples of implementing this simple algorithm:
The puller checks "if head != tail", if so, reads the message in queue[head] and increments head. Increment means: head=(head+1)%queuesize. The puller checks if incrementing tail (also modulo queuesize) results in head, if so the queue has filled up (something that shouldn't happen, according to the prerequisites of this approach). If not, put the message into queue[tail], and increment tail.
If all of these operations are done in the right order, the net effect would be the same as using std::queue but:
1) Without the overhead of std::queue and the heap allocation it uses. Should be a major win on an embedded platform.
2) Since the queue is a vector, in contiguous memory, this should take advantage of CPU caching that's often the case here, with traditional CPUs.
I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.
I have a large vector of objects and I just need to iterate over the vector using multiple threads and read the objects (no modification to data or the vector). What is the most efficient method to do this? Could it be done in a lock free fashion, maybe using an atomic variable? what is most easy to read implementation of such multithreading process?
Edit:
I do not want more than one thread reads the same element of vector (reading is time consuming in this case). When one thread is reading an element, I want the next thread reads the first not-yet-read element. For example when thread 1 is reading object 1, I want thread 2 reads object 2. whenever one of them is done, it can read object 3, so on and so forth.
Splitting the input in equal parts is really really easy, it doesn't use locks and doesn't cause memory sharing. So try that, measure how much time each thread needs to complete and check if it's a relevant difference.
If the difference is relevant, consider using an array of one atomic<bool> per element, before reading an element, the thread does compare_exchange_strong on the flag related to that element (I think you can even use memory_order_relaxed, but use memory_order_acq_rel at first, only try relaxed if the performance doesn't satisfy you) and only actually processes the element if the exchange succeeds. Otherwise it tries with the next element, because someone is processing or has already processed the current one.
If you can't then you can use a single atomic<int> to store the index of the next element to be processed. The threads just use fetch_add or the postfix++ to atomically get the next element to process and increment the counter (the considerations for the memory ordering are the same as above). If the variance in reading times is high (as determined by step 1), you will have low contention on the atomic variable, so it will perform well.
If the contention is still too high, and you get a significant slowdown, try to estimate in advance how much time it will take to read an element. If you can, then sort your vector by estimated read time, and make the n-th thread read every n-th element, so that the load will be split more evenly.
In my Project with CUDA I need to have a data structure(available to all threads in the block)that is similar a "stash". In this stash there are multiple spaces which could be either empty or full. I need this data structure to spit out empty space when each thread asks for. The thread will ask for a space in the stash, put something in, and mark this position as full. I could not use a fifo because fetching from stash is random. Any position(and multiple positions)could be marked as empty or full.
The initial version I have is that I use an array to represent whether the space is empty or not. each thread will loop through each position space(using atomicCAS) until it finds a empty spot. But this algorithm the searching time depends on how full the stash is, which is not acceptable in my design.
How could I design a datastructure that the fetching time and write back time does not depend on how full the stash is?
Does this remind anyone of anything any algorithm similar?
Thanks
You could implement this with a FIFO containing a list of free locations.
At startup you fill the FIFO with all locations.
Then whenever you want a space, you take the next element from the FIFO .
When you are finished with the slot, you can place the address back into the FIFO again.
This should have O(1) allocation and deallocation time.
You could implement a hash table (SeparateChaining) with ThreadID as the key.
It is more or less similar to array of linked lists. This way you need not put a lock on the entire array as you did earlier. Instead, you use atomicCAS only while reading a linkedlist from a specific index. Thereby, you can have n threads running in parallel where array size is n.
Note: The distribution of threads however depends on the hash function.
I need to design a data structure that supports the following operations:
search element in the data structure based on a key that is an interval. For example the value within interval 1-5 may be 3, from 6-11 may be 5 and so on. The intervals are contiguous and there is no overlap between them.
find previous and next interval - this is almost as frequent as search for an interval.
split the intervals, join consecutive intervals
concurrency: I have restricted the design to one writer thread and other reader threads and as follows. The writer thread can split or join intervals, or modify the value in an interval. Any reader thread reads the value only within one interval (a reader may read multiple intervals, but then I don't have to serialize all reads - it is okay to have writes in between two reads). There are about 20-80 reads by each reader per write. Further, I still have to decide the number of readers but it would be around 2-8.
I consider using list for adding and deleting elements in the middle. There would only be a limited number of intervals - so probably using map won't be right. This kind of access (one writer, multiple readers) is not well-supported by STL list. boost::intrusive::list seems appropriate. On top of the intrusive list, I will have to acquire locks to read/write the intervals.
Also, I understand intrusive list may be used for better cache locality (along with appropriate memory allocation for the contained objects) than STL list.
Is the approach alright? If yes, I would also be interested to know about you experience with using intrusive::list, particularly for a multithreaded application.
You have 2 different issues here:
How to represent your data structure
How to make it thread safe, in an efficient manor
Your data structure will be doing (20-80) x (2-8) reads for every write.
(1). First, lets assume your range is a data structure as follows
struct Interval
{
Interval(int start, int length)
: m_start(start),
m_length(length)
{}
int m_start;
int m_length;
int value; // Or whatever
};
Since reads massively outnumber writes, lookup needs to be fast, while modifications don't.
Using a list for your data structure means O(N) lookups and O(1) modification - exactly the wrong way around.
The simplest possible representation of your structure is a vector. If the intervals are held in sorted order, lookups are O(logN) and modifications are O(N).
To implement this, just add a comparator to Interval:
bool operator<(const Interval& rhs) const
{
return m_start < rhs.m_start;
}
You can then use std::lower_bound to find the first interval equal or lower to your search interval in O(logN).
Next and previous interval are O(1) - decrement or increment the returned iterator.
Splitting an interval means inserting a new element after the current one and adjusting the length of the current - O(N).
Joining two intervals means adding the length of the next one to the current one and erasing the next one - O(N).
You should reserve() enough space in the vector for the maximum number of elements to minimise resizing overhead.
(2). Following Knuth, 'premature optimisation is the root of all evil'.
A single read/write lock on the structure holding your vector<Interval> will more than likely be sufficient. The only possible problems are (2a) writer starvation because readers monopolise the lock, or (2b) reader starvation because the writers updates take too long.
(2a) If (and only if) you face writer starvation, you could make the locking more granular. Its extremely likely that this won't be the case though. To do this:
Make your vector hold its intervals by pointer, not value. This is so the resizes do not move your objects around in memory. Have each interval contain a read/write lock.
For reads:
Take the read lock of the collection, then of the interval you want. If you don't need to read any other intervals, give up the collection lock as soon as you have acquired the interval lock, to allow other threads to proceed.
If you need to read other buckets, you can read-lock them in any order until you give up the collection read lock, at which point the writer may add or remove any intervals you havent locked. Order doesn't matter while acquiring these locks since the writer cannot be changing the vector while you hold the a read lock on the collection, and read locks do not contend.
For writes:
Take the write lock of the collection, then of the interval you want. Note that you must hold the collection write lock for the entirety of of any updates that will add or remove intervals. You can give up the collection lock if you are only updating one interval. Otherwise, you need to hold the write lock and acquire a write lock on any intervals you will be modifying. You can acquire the interval locks in any order, since no readers can acquire new read locks without the collection lock.
The above works by being more selfish to the writer thread, which should eliminate starvation.
(2b) If you face reader starvation, which is even more unlikely, the best solution is to split the collection writes and reads apart. Hold the collection by shared pointer, and have a single write lock on it.
For reads:
Take the write lock and a copy of the shared_ptr. Give up the write lock. The reader can now read the collection without any locks (its immutable).
For writes:
Take a shared_ptr to the collection as per the reader, giving up the lock. Make a private copy of the collection and modify it (no locks required since its a private copy). Take the write lock again and replace the existing shared_ptr with your new collection. The last thread to finish with the old collection will destroy it. All future threads will use the newly updated collection.
Note that this algorithm only works with one writer, as per your problem description.
A concurrent binary tree might be a good fit, allowing reads and writes to different intervals to proceed in parallel.