I have N elements who needs to be compared between each other to create a graph. It gives (N*N-1)/2 comparisons in total.
I want to multithread those comparisons I also have several constraints:
Each element is quite big, it is a matrix actually, so copying all elements in each thread would take too much memory.
Each comparison should occur, meaning I cannot skip one.
At each time a new element can be added in the list this is very tricky because I need to track what has been done, to do just the new ones.
Since the number of comparisons could be huge, like 20millions, I cannot have a queue that big.
Lastly, one could stop the process at any time, I must be able to resume where I was even in other execution of the app.
So far I have a Master thread which contains all the elements and several worker in a thread pool. The worker threads compare a list of pairs or a range of elements. I have a thought of a comparison generator which gives the next X comparisons on demand.
How could I build this generator ?
Should I copy every pairs for the workers, use a ReadWriteLock directly from the worker to read the data from Master ?
How could I track the progress on every thread ?
How could I stop and resume the state of the comparisons ?
I am sorry if that's a lot of questions.
Thank you !
Assuming reads are thread-safe (it usually is as long as no one is writing), a simple solution is to subdivide the tasks among the set of worker threads in some manner, doing so in advance. For instance, for n workers, you could allocate pair (x, y) to worker x mod n. The only communication is letting each worker know its ordinal (0…n-1). Each thread should drop its answers into a private array, which can be collated after everyone else finishes.
A more sophisticated model that accommodates varying worker productivity is to push every value 0…N-1 onto a queue. Each worker thread pulls a number, x, off the queue, evaluates every (x, y) pair, and then goes back for another x.
If you want to take the time, it's more efficient to enqueue pairs so as to minimise cache-thrashing. This is a tricky problem. Essentially, you want to enqueue pairs from small clusters of elements so that every pair within a cluster is evaluated at approximately the same time. As tricky as this is, it could make a huge difference to the efficiency of your algorithm.
Related
I have a collection C of elements that can be partitioned into nonempty subsets S_1, S_2,..., S_n. There's a routine expensiveCalculation(Subset S) that does certain expensive calculation on all of the elements of each subset.
Elements can be added to C dynamically, each of which will either end up in an existing subset or be the first element of a new one. When a new element is added to a subset, expensive calculation needs to be recalculated.
I want to limit calls to expensiveCalculation() so that each subset gets an associated timer in case a new element is added to it. If any new elements are added to the subset while the timer is active, the timer is reset. Once the timer on S_k times out, it triggers a call to expensiveCalculation(S_k).
What I would like to know is the following:
Does this problem have a textbook name? I have a strong feeling that it should be a fairly common problem.
I need to implement this in AWS ecosystem, so any advice on which AWS services would be ideal for solving this one would be appreciated.
I have a worker class, and I can submit jobs to the worker. Worker keeps these jobs and runs them sequentially in the order of priority (priority can be any unsigned int basically). For this case std::priority_queue or even a std::set/map could be used to store jobs ordered by priority and then worker would be able to to extract them in order in O(1). Adding jobs would be O(log N).
Now, the requirement that I have is to be able to change priority of any submitted job. In case of std::set/map I'd need to remove and add back the job with different priority. This would be O(log N) and on top of that with set/map it would reallocate nodes internally afaik (this might possibly be avoided with C++17 though). What makes it unusual is that in my case I'll update job priorities way more often than scheduling or executing them. Basically I might schedule a job once, and before it's executed I may end up updating its priority thousands times. In fact, priorities of each job will be changed like 10-20 times a second.
In my case it's reasonably safe to assume that I won't have more than 10K jobs in the queue. At start of my process I expect it always to grow to 10K or so jobs and as these jobs are removed queue should eventually be almost empty all the time, and occasionally there would be 10-50 new jobs added, but it shouldn't grow more than 1000 jobs. Jobs would be removed at a rate of a few jobs a second. Because of my weird requirement of that frequent priority update std::priority_queue or a set don't seem like a good fit. Plain std::list seems to be a better choice: priority change or update/removal is O(1), and when I need to remove jobs it's O(N) to walk entire list to find highest priority item which should happen less frequently than modifying priorities.
One other observation that even though job priorities change often, these changes do not necessarily result in ordering change, e.g. I could possibly simply update key element of my set (by casting away constness or making key mutable?) if that change would still keep that modified element between left and right nodes. What would you suggest for such priority queue? Any boost container or custom data structure design is OK.
In case of set/map I use priority as a key. To make keys unique in my case each key is actually two integers: job sequence number (derived from atomic int that I increment for each new request) and actual priority number. This way if I add multiple jobs with the same priority, they will be executed in order they were scheduled, as sequence numbers would keep them ordered.
A simple priority heap should fit your requirements. Insertion, removal and priority change is all O(log n). But you said usually the priority change would not result in a change in the order. So in case of a priority heap when you change the priority you would check the changed item against the parent and the 2 children and if none of the heap conditions are violated no up or down heap action is required. So only rarely the full O(log n) time will be needed. Practically it will be more like O(1).
Now for efficient operation it is crucial that given an item I you can find the position of that item in the heap in O(1) and access the parent and children.
If the heap simply contains the items in an array then that is all just pointer arithmetic. The drawback is that reordering the heap means copying the items.
If you store pointers to items in the heap then you have to also store a back reference to the position in the heap in the items them self. When you reorder the heap you then only swap the pointers and update the back references.
Basically your are looking for a IndexPriorityQueue. You can implement your own varient of the index priority queue based on your requirement.
A index priority queue allows you to decrease key or increase the key , i.e basically you can increase and decrease the priority of your jobs.
The following is the java implementation of the IndexMinQueue, hope it helps you. IndexMinQueue
I have got this problem:
Find the first element in a list, for which a given condition holds.
Unfortunately, the list is quite long (100.000 elements), and evaluation the condition for each element takes in total about 30 seconds using one single Thread.
Is there a way to cleanly parallelize this problem? I have looked through all the tbb patterns, but could not find any fitting.
UPDATE: for performance reason, I want to stop as early as possible when an item is found and stop processing the rest of the list. That's why I believe I cannot use parallel_while or parallel_do.
I'm not too familiar with libraries for this, but just thinking aloud, could you not have a group of threads iterating at different at the same stride from different staring points?
Say you decide to have n threads (= number of cores or whatever), each thread should be given a specific starting point up to n, so the first thread starts on begin(), the next item it compares is begin() + n, etc. etc. second thread starts on begin()+1 and then it's next comparison is in n too etc.
This way you can have a group of threads iterating in parallel through the list, the iteration itself is presumably not expensive - just the comparison. No node will be compared more than once and you can have some condition which is set when a match is made by any of the threads and all should check this condition before iterating/comparing..
I think it's pretty straightforward to implement(?)
I think the best way to solve this problem with TBB is parallel_pipeline.
There should be (at least) two stages in the pipeline. The 1st stage is serial; it just reads the next element from the list and passes it to the 2nd stage. This 2nd stage is parallel; it evaluates the condition of interest for a given element. As soon as the condition is met, the second stage sets a flag (which should be either atomic or protected with a lock) to indicate that a solution is found. The first stage must check this flag and stop reading the list once the solution is found.
Since condition evaluation is performed in parallel for a few elements, it can happen that a found element is not the first suitable one in the list. If this is important, you also need to keep an index of the element, and when a suitable solution is found you detect whether its index is less than that of a previously known solution (if any).
HTH.
ok, I have done it this way:
Put all elements into a tbb::concurrent_bounded_queue<Element> elements.
Create an empty tbb::concurrent_vector<Element> results.
Create a boost::thread_group, and create several threads that run this logic:
logic to run in parallel:
Element e;
while (results.empty() && elements.try_pop(e) {
if (slow_and_painfull_check(e)) {
results.push_back(e);
}
}
So when the first element is found, all other threads will stop processing the next time they check results.empty().
It is possible that two or more threads are working on an element for which slow_and_painfull_check returns true, so I just put the result into a vector and deal with this outside of the parallel loop.
After all threads in the thread group have finished, I check all elements in the results and use the one that comes first.
you can take a look at http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html for parallel algorithms implementations.
And in particular you need find_if algorithm http://www.cplusplus.com/reference/algorithm/find_if/
I see two opportunities for parallelism here: evaluating one element on multiple threads, or evaluating multiple elements at once on different threads.
There isn't enough information to determine the difficulty nor the effectiveness of evaluating one element on multiple threads. If this is easy, the 30 second per element time could be reduced.
I do not see a clean fit into TBB for this problem. There are issues with lists not having random access iterators, determining when to stop, and guaranteeing the first element is found. There may be some games you can play with the ranges to get it to work though.
You could use some lower level thread constructs to implement this yourself as well, but there are a number of places for incorrect results to be returned. To prevent such errors, I would recommend using an existing algorithm. You could convert the list to an array (or some other structure with random access iterators) and use the experimental libstdc++ Parellel Mode find_if algorithm user383522 referenced.
If it's a linked list, A parallel search isn't going to add much speed. However, linked lists tend to perform poorly with caches. You may get a tiny performance increase if you have two threads: one does the find_first_element, and one simply iterates through the list, making sure not to get more than X (100?) ahead of the first thread. The second thread doesn't do any comparisons, but will assure that the items are cached as well as possible for the first thread. This may help your time, or it might make little difference, or it might hinder. Test everything.
Can't you transform the list to a balanced tree or similar? Such data structures are easier to process in parallel - usually you get back the overhead you may have paid in making it balanced in the first time... For example, if you write functional-style code, check this paper: Balanced trees inhabiting functional parallel programming
If you are using GCC, GNU OpenMP provides parallel std functions
link
I've never heard of the Intel tbb library but a quick open and scan of the Tutorial led me to parallel_for which seems like it will do the trick.
I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.
A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)
It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.
I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue
A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.
I have about 18 million elements in an array that are initialized and ready to be used by a simple manager called ElementManager (this number will later climb to a little more than a billion in later iterations of the program). A class, A, which must use the elements communicates with ElementManager that returns the next available element for consumption. That element is now in use and cannot be reused until recycled, which may happen often. Class A is concurrent, that is, it can ask ElementManager for an available element in several threads. The elements in this case is an object that stores three vertices to make a triangle.
Currently, the ElementManager is using Intel TBB concurrent_bounded_queue called mAllAvailableElements. There is also another container (a TBB concurrent_vector) that contains all elements, regardless of whether they are available for use or not, called mAllElements. Class A asks for the next available element, the manager tries to pop the next available element from the queue. The popped element is now in use.
Now when class A has done what it has to do, control is handed to class B which now has to iterate through all elements that are in use and create meshes (to take advantage of concurrency, the array is split into several smaller arrays to create submeshes which scales with the number of available threads - the reason for this is that creating a mesh must be done serially). For this I am currently iterating over the container mAllElements (this is also concurrent) and grabbing any element that is in use. The elements, as mentioned above, contain polygonal information to create meshes. Iteration in this case takes a long time as it has to check each element and query whether it is in use or not, because if it is not in use then it should not be part of a mesh.
Now imagine if only 1 million out of the possible 18 million elements were in use (but more than 5-6 million were recycled). Worse yet, due to constant updates to only part of the mesh (which happens concurrently) means the in use elements are fragmented throughout the mAllElements container.
I thought about this for quite some time now and one flawed solution that I came up with was to create another queue of elements named mElementsInUse, which is also a concurrent_queue. I can push any element that is now in use. Problem with this approach is that since it is a queue, any element in that queue can be recycled at any time (an update in a part of the mesh) and declared not in use and since I can only pop the front element, this approach fails. The only other approach I can think of is to defragment the concurrent_vector mAllElements every once in a while when no operations are taking place.
I think my approach to this problem is wrong and thus my post here. I hope I explained the problem in enough detail. It seems like a common memory management problem, but I cannot come up with any search terms to search for it.
How about using a bit vector to indicate which of your elements are in use? It's easy to partition it for parallel processing when building your full mesh, and you can use atomic operations on words in the vector and thus avoid locks.