tbb: parallel find first element

tbb: parallel find first element - c++

I have got this problem:
Find the first element in a list, for which a given condition holds.
Unfortunately, the list is quite long (100.000 elements), and evaluation the condition for each element takes in total about 30 seconds using one single Thread.
Is there a way to cleanly parallelize this problem? I have looked through all the tbb patterns, but could not find any fitting.
UPDATE: for performance reason, I want to stop as early as possible when an item is found and stop processing the rest of the list. That's why I believe I cannot use parallel_while or parallel_do.

I'm not too familiar with libraries for this, but just thinking aloud, could you not have a group of threads iterating at different at the same stride from different staring points?
Say you decide to have n threads (= number of cores or whatever), each thread should be given a specific starting point up to n, so the first thread starts on begin(), the next item it compares is begin() + n, etc. etc. second thread starts on begin()+1 and then it's next comparison is in n too etc.
This way you can have a group of threads iterating in parallel through the list, the iteration itself is presumably not expensive - just the comparison. No node will be compared more than once and you can have some condition which is set when a match is made by any of the threads and all should check this condition before iterating/comparing..
I think it's pretty straightforward to implement(?)

I think the best way to solve this problem with TBB is parallel_pipeline.
There should be (at least) two stages in the pipeline. The 1st stage is serial; it just reads the next element from the list and passes it to the 2nd stage. This 2nd stage is parallel; it evaluates the condition of interest for a given element. As soon as the condition is met, the second stage sets a flag (which should be either atomic or protected with a lock) to indicate that a solution is found. The first stage must check this flag and stop reading the list once the solution is found.
Since condition evaluation is performed in parallel for a few elements, it can happen that a found element is not the first suitable one in the list. If this is important, you also need to keep an index of the element, and when a suitable solution is found you detect whether its index is less than that of a previously known solution (if any).
HTH.

ok, I have done it this way:
Put all elements into a tbb::concurrent_bounded_queue<Element> elements.
Create an empty tbb::concurrent_vector<Element> results.
Create a boost::thread_group, and create several threads that run this logic:
logic to run in parallel:
Element e;
while (results.empty() && elements.try_pop(e) {
if (slow_and_painfull_check(e)) {
results.push_back(e);
}
}
So when the first element is found, all other threads will stop processing the next time they check results.empty().
It is possible that two or more threads are working on an element for which slow_and_painfull_check returns true, so I just put the result into a vector and deal with this outside of the parallel loop.
After all threads in the thread group have finished, I check all elements in the results and use the one that comes first.

you can take a look at http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html for parallel algorithms implementations.
And in particular you need find_if algorithm http://www.cplusplus.com/reference/algorithm/find_if/

I see two opportunities for parallelism here: evaluating one element on multiple threads, or evaluating multiple elements at once on different threads.
There isn't enough information to determine the difficulty nor the effectiveness of evaluating one element on multiple threads. If this is easy, the 30 second per element time could be reduced.
I do not see a clean fit into TBB for this problem. There are issues with lists not having random access iterators, determining when to stop, and guaranteeing the first element is found. There may be some games you can play with the ranges to get it to work though.
You could use some lower level thread constructs to implement this yourself as well, but there are a number of places for incorrect results to be returned. To prevent such errors, I would recommend using an existing algorithm. You could convert the list to an array (or some other structure with random access iterators) and use the experimental libstdc++ Parellel Mode find_if algorithm user383522 referenced.

If it's a linked list, A parallel search isn't going to add much speed. However, linked lists tend to perform poorly with caches. You may get a tiny performance increase if you have two threads: one does the find_first_element, and one simply iterates through the list, making sure not to get more than X (100?) ahead of the first thread. The second thread doesn't do any comparisons, but will assure that the items are cached as well as possible for the first thread. This may help your time, or it might make little difference, or it might hinder. Test everything.

Can't you transform the list to a balanced tree or similar? Such data structures are easier to process in parallel - usually you get back the overhead you may have paid in making it balanced in the first time... For example, if you write functional-style code, check this paper: Balanced trees inhabiting functional parallel programming

If you are using GCC, GNU OpenMP provides parallel std functions
link

I've never heard of the Intel tbb library but a quick open and scan of the Tutorial led me to parallel_for which seems like it will do the trick.

Related

How to parallelize std::partition using TBB

Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?
Here is what I'm thinking:
if the array is small, std::partition it (serial) and return
else, treat the array as 2 interleaved arrays using custom iterators (interleave in cache-sized blocks)
start a parallel partition task for each pair of iterators (recurse to step 1)
swap elements between the two partition/middle pointers*
return the merged partition/middle pointer
*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.
Any thoughts before I try it?

I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.
I'm assuming the input is a random-access sequence. The steps are:
Allocate a temporary array big enough to hold a copy of the input sequence.
Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.
There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.
If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.
Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.

Why not to parallel something similar to std::partition_copy instead? The reasons are:
for std::partition, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.
you'll pay memory for parallelism anyway when using the threads and tasks.
if the objects are heavy, it is more reasonable to swap (shared) pointers anyway
if the results can be stored concurrently then threads can work independently.
It's pretty straight-forward to apply a parallel_for (for random-access iterators) or tbb::parallel_for_each (for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:
using tbb::parallel_reduce (only for random-access iterators), store the results locally to the task body and move-append them in join() from another task
use tbb::concurrent_vector's method grow_by() to copy local results in a bunch or just push() each result separately on arrival.
cache thread-local results in tbb::combinable TLS container and combine them later
The exact semantics of std::partition_copy can be achieved by copy from the temporary storage from above or
(only for random-access output iterators) use atomic<size_t> cursors to synchronize where to store the results (assuming there is enough space)

Your approach should be correct, but why not follow the regular divide-and-conquer (or parallel_for) method? For two threads:
split the array in two. Turn your [start, end) into [start, middle), [middle, end).
run std::partition on both ranges in parallel.
merge the partitioned results. This can be done with a parallel_for.
This should make better use of the cache.

It seems to me like this should parallelize nicely, any thoughts before I try it?
Well... maybe a few:
There's no real reason to create more tasks than you have cores. Since your algorithm is recursive, you also need to keep track not to create additional threads, after you reach your limit, cause it'll just be a needless effort.
Keep in mind that splitting and merging the arrays costs you processing power, so set the split size in a way, which won't actually slow your calculations down. Splitting a 10-element array can be tempting, but wont get you where you want to be. Since the complexity of std::partition is linear, it's fairly easy to overestimate the speed of the task.
Since you asked and gave an algorithm, I hope you actually need parallelization here. If so - there's nothing much to add, the algorithm itself looks really fine :)

Approximate sort (array/vector), predictable runtime

Background:
I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.
What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.
Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).
However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).
The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.
Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.
tl;dr
So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.
The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.
In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.
Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".
MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?
Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?

What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)

Use a separate vector for each priority. Then you don't need to sort them.

Sounds like a nice example where near-sort algorithms can be useful.
Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.
The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.
To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:
https://en.wikipedia.org/wiki/Soft_heap
There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms
If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.

Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.

If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).
std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process
if (events.size() <= guaranteed_bandwidth) {
// let all N workers loose on [begin(events), end(events)) range
} else {
auto nth = guaranteed_bandwidth * loss_from_nth_element;
std::nth_element(begin(events), begin(events) + nth);
// let all N workers loose on [begin(events), nth) range
// reprioritize [nth, end(events)) range and append to events for next time quantum
}
This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).
The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.
NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.

If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.
This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

Multithreaded bruteforce comparisons between elements to create a graph

I have N elements who needs to be compared between each other to create a graph. It gives (N*N-1)/2 comparisons in total.
I want to multithread those comparisons I also have several constraints:
Each element is quite big, it is a matrix actually, so copying all elements in each thread would take too much memory.
Each comparison should occur, meaning I cannot skip one.
At each time a new element can be added in the list this is very tricky because I need to track what has been done, to do just the new ones.
Since the number of comparisons could be huge, like 20millions, I cannot have a queue that big.
Lastly, one could stop the process at any time, I must be able to resume where I was even in other execution of the app.
So far I have a Master thread which contains all the elements and several worker in a thread pool. The worker threads compare a list of pairs or a range of elements. I have a thought of a comparison generator which gives the next X comparisons on demand.
How could I build this generator ?
Should I copy every pairs for the workers, use a ReadWriteLock directly from the worker to read the data from Master ?
How could I track the progress on every thread ?
How could I stop and resume the state of the comparisons ?
I am sorry if that's a lot of questions.
Thank you !

Assuming reads are thread-safe (it usually is as long as no one is writing), a simple solution is to subdivide the tasks among the set of worker threads in some manner, doing so in advance. For instance, for n workers, you could allocate pair (x, y) to worker x mod n. The only communication is letting each worker know its ordinal (0…n-1). Each thread should drop its answers into a private array, which can be collated after everyone else finishes.
A more sophisticated model that accommodates varying worker productivity is to push every value 0…N-1 onto a queue. Each worker thread pulls a number, x, off the queue, evaluates every (x, y) pair, and then goes back for another x.
If you want to take the time, it's more efficient to enqueue pairs so as to minimise cache-thrashing. This is a tricky problem. Essentially, you want to enqueue pairs from small clusters of elements so that every pair within a cluster is evaluated at approximately the same time. As tricky as this is, it could make a huge difference to the efficiency of your algorithm.

Efficient way to organize used and unused elements in a large concurrent array

I have about 18 million elements in an array that are initialized and ready to be used by a simple manager called ElementManager (this number will later climb to a little more than a billion in later iterations of the program). A class, A, which must use the elements communicates with ElementManager that returns the next available element for consumption. That element is now in use and cannot be reused until recycled, which may happen often. Class A is concurrent, that is, it can ask ElementManager for an available element in several threads. The elements in this case is an object that stores three vertices to make a triangle.
Currently, the ElementManager is using Intel TBB concurrent_bounded_queue called mAllAvailableElements. There is also another container (a TBB concurrent_vector) that contains all elements, regardless of whether they are available for use or not, called mAllElements. Class A asks for the next available element, the manager tries to pop the next available element from the queue. The popped element is now in use.
Now when class A has done what it has to do, control is handed to class B which now has to iterate through all elements that are in use and create meshes (to take advantage of concurrency, the array is split into several smaller arrays to create submeshes which scales with the number of available threads - the reason for this is that creating a mesh must be done serially). For this I am currently iterating over the container mAllElements (this is also concurrent) and grabbing any element that is in use. The elements, as mentioned above, contain polygonal information to create meshes. Iteration in this case takes a long time as it has to check each element and query whether it is in use or not, because if it is not in use then it should not be part of a mesh.
Now imagine if only 1 million out of the possible 18 million elements were in use (but more than 5-6 million were recycled). Worse yet, due to constant updates to only part of the mesh (which happens concurrently) means the in use elements are fragmented throughout the mAllElements container.
I thought about this for quite some time now and one flawed solution that I came up with was to create another queue of elements named mElementsInUse, which is also a concurrent_queue. I can push any element that is now in use. Problem with this approach is that since it is a queue, any element in that queue can be recycled at any time (an update in a part of the mesh) and declared not in use and since I can only pop the front element, this approach fails. The only other approach I can think of is to defragment the concurrent_vector mAllElements every once in a while when no operations are taking place.
I think my approach to this problem is wrong and thus my post here. I hope I explained the problem in enough detail. It seems like a common memory management problem, but I cannot come up with any search terms to search for it.

How about using a bit vector to indicate which of your elements are in use? It's easy to partition it for parallel processing when building your full mesh, and you can use atomic operations on words in the vector and thus avoid locks.

Concurrent binary chop algorithm

Is there a way (or is it even theoretically possible) to implement a binary search algorithm concurrently? I'm guessing the answer may well be no for two reasons:
Despite lots of Googling I haven't found a concurrent implementation anywhere
Each iterative cycle of the binary chop depends on the values from the previous one, so even if each iteration was a separate thread it would have to block until the previous one completed, making it sequential.
However, I'd like some clarification on this front (and if it is possible, any links or examples?)

At first, it looks like binary search is completely nonparallel. But notice that there are only three possible outcomes:
You hit the element
The element searched for is before the element you hit
The element is after
So we start three parallel processes:
Hit the element
Assume the element is before, search here
Assume the element is after, search there
As soon as we know the result from the first of these, we can kill the one which is not going to find the element. But at the same time, the process that searched in the right spot, has doubled the search rate, that is current speedup is 2 out of a possible 3.
Naturally, this approach can be generalized if you have more than 3 cores at your disposal. An important aside is that this way of thinking is what is done inside hardware. Look up carry-lookahead adders for instance.

I think you can figure the answer! To parallelize, there must be some work that can be divided. In case of the bin-search, there is nothing that could possibly be divided and parallelized. bin-search gets into the middle of an array of values. This work cannot be divided. Etc.. until it find the solution.
What in your opinion could be parallelized?

If you have n worker threads, you can split the array in n segments and run n binary searches concurrently, combining the results when they are ready. Apart from this cheap trick, I can see no obvious way to introduce parallelism.

You could always try a not-quite-binary search, essentially if you have n cores then you can split the array into n+1 pieces. From there you search each of the "cut-points" and see whether the value is larger or smaller than the cut point, this results in you having a fifth of the original search space as opposed to half, as you will be able to select a smaller section.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js