How does mapPartitions behave in a loop?

How does mapPartitions behave in a loop? - mapreduce

I want to understand, how does mapPartitions function behave in the following code. Does it create separate partitions in each iteration and assigns them to the nodes. Or The partitions and the mappings of partitions to nodes is preserved across iterations?
Ideally I would like to keep the same partitioning for the whole loop.
for i in range(10):
x = rdd.mapPartitions(fun).reduce(lambda a,b:a+b)

It depends. If rdd is cached then partitions will be computed once and preserved across iterations unless there is some kind of failure and task is rescheduled on another worker. Otherwise it will be recomputed for each iteration. In such a case the answer depends on a lineage of the rdd. If there is no shuffling involved or you use deterministic partitioning and ordering then the answer is positive. Otherwise it is unlikely you'll see the same content in each iteration.
If you are concerned about performance then caching will be enough. If you think about performing some side effects inside mapPartitions and you want these to be preserved between iterations then you cannot depend on that.

Related

Data structure for a priority queue of jobs with priority that changes often

I have a worker class, and I can submit jobs to the worker. Worker keeps these jobs and runs them sequentially in the order of priority (priority can be any unsigned int basically). For this case std::priority_queue or even a std::set/map could be used to store jobs ordered by priority and then worker would be able to to extract them in order in O(1). Adding jobs would be O(log N).
Now, the requirement that I have is to be able to change priority of any submitted job. In case of std::set/map I'd need to remove and add back the job with different priority. This would be O(log N) and on top of that with set/map it would reallocate nodes internally afaik (this might possibly be avoided with C++17 though). What makes it unusual is that in my case I'll update job priorities way more often than scheduling or executing them. Basically I might schedule a job once, and before it's executed I may end up updating its priority thousands times. In fact, priorities of each job will be changed like 10-20 times a second.
In my case it's reasonably safe to assume that I won't have more than 10K jobs in the queue. At start of my process I expect it always to grow to 10K or so jobs and as these jobs are removed queue should eventually be almost empty all the time, and occasionally there would be 10-50 new jobs added, but it shouldn't grow more than 1000 jobs. Jobs would be removed at a rate of a few jobs a second. Because of my weird requirement of that frequent priority update std::priority_queue or a set don't seem like a good fit. Plain std::list seems to be a better choice: priority change or update/removal is O(1), and when I need to remove jobs it's O(N) to walk entire list to find highest priority item which should happen less frequently than modifying priorities.
One other observation that even though job priorities change often, these changes do not necessarily result in ordering change, e.g. I could possibly simply update key element of my set (by casting away constness or making key mutable?) if that change would still keep that modified element between left and right nodes. What would you suggest for such priority queue? Any boost container or custom data structure design is OK.
In case of set/map I use priority as a key. To make keys unique in my case each key is actually two integers: job sequence number (derived from atomic int that I increment for each new request) and actual priority number. This way if I add multiple jobs with the same priority, they will be executed in order they were scheduled, as sequence numbers would keep them ordered.

A simple priority heap should fit your requirements. Insertion, removal and priority change is all O(log n). But you said usually the priority change would not result in a change in the order. So in case of a priority heap when you change the priority you would check the changed item against the parent and the 2 children and if none of the heap conditions are violated no up or down heap action is required. So only rarely the full O(log n) time will be needed. Practically it will be more like O(1).
Now for efficient operation it is crucial that given an item I you can find the position of that item in the heap in O(1) and access the parent and children.
If the heap simply contains the items in an array then that is all just pointer arithmetic. The drawback is that reordering the heap means copying the items.
If you store pointers to items in the heap then you have to also store a back reference to the position in the heap in the items them self. When you reorder the heap you then only swap the pointers and update the back references.

Basically your are looking for a IndexPriorityQueue. You can implement your own varient of the index priority queue based on your requirement.
A index priority queue allows you to decrease key or increase the key , i.e basically you can increase and decrease the priority of your jobs.
The following is the java implementation of the IndexMinQueue, hope it helps you. IndexMinQueue

How to parallelize std::partition using TBB

Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?
Here is what I'm thinking:
if the array is small, std::partition it (serial) and return
else, treat the array as 2 interleaved arrays using custom iterators (interleave in cache-sized blocks)
start a parallel partition task for each pair of iterators (recurse to step 1)
swap elements between the two partition/middle pointers*
return the merged partition/middle pointer
*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.
Any thoughts before I try it?

I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.
I'm assuming the input is a random-access sequence. The steps are:
Allocate a temporary array big enough to hold a copy of the input sequence.
Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.
There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.
If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.
Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.

Why not to parallel something similar to std::partition_copy instead? The reasons are:
for std::partition, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.
you'll pay memory for parallelism anyway when using the threads and tasks.
if the objects are heavy, it is more reasonable to swap (shared) pointers anyway
if the results can be stored concurrently then threads can work independently.
It's pretty straight-forward to apply a parallel_for (for random-access iterators) or tbb::parallel_for_each (for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:
using tbb::parallel_reduce (only for random-access iterators), store the results locally to the task body and move-append them in join() from another task
use tbb::concurrent_vector's method grow_by() to copy local results in a bunch or just push() each result separately on arrival.
cache thread-local results in tbb::combinable TLS container and combine them later
The exact semantics of std::partition_copy can be achieved by copy from the temporary storage from above or
(only for random-access output iterators) use atomic<size_t> cursors to synchronize where to store the results (assuming there is enough space)

Your approach should be correct, but why not follow the regular divide-and-conquer (or parallel_for) method? For two threads:
split the array in two. Turn your [start, end) into [start, middle), [middle, end).
run std::partition on both ranges in parallel.
merge the partitioned results. This can be done with a parallel_for.
This should make better use of the cache.

It seems to me like this should parallelize nicely, any thoughts before I try it?
Well... maybe a few:
There's no real reason to create more tasks than you have cores. Since your algorithm is recursive, you also need to keep track not to create additional threads, after you reach your limit, cause it'll just be a needless effort.
Keep in mind that splitting and merging the arrays costs you processing power, so set the split size in a way, which won't actually slow your calculations down. Splitting a 10-element array can be tempting, but wont get you where you want to be. Since the complexity of std::partition is linear, it's fairly easy to overestimate the speed of the task.
Since you asked and gave an algorithm, I hope you actually need parallelization here. If so - there's nothing much to add, the algorithm itself looks really fine :)

Approximate sort (array/vector), predictable runtime

Background:
I need to process some hundred thousand events (producing results) given a hard time limit. The clock is literally ticking, and when the timer fires, whatever is done at that point must be flushed out.
What isn't ready by that time is either discarded (depending on an importance metric) or processed during the next time quantum (with an "importance boost", i.e. adding a constant to the importance metric).
Now ideally, the CPU is much faster than needed, and the whole set is ready a long time before the end of the time slice. Unluckily, the world is rarely ever ideal, and "hundred thousands" becomes "tens of millions" before you know.
Events are added to the back of a queue (which is really a vector) as they come in, and are processed from the front during the respective next quantum (so the program always processes the last quantum's input).
However, not all events are equally important. In case the available time is not sufficient, it would be preferrable to drop unimportant events rather than important ones (this is not a strict requirement, since important events will be copied to the next time quantum's queue, but doing so further adds to the load so it isn't a perfect solution).
The obvious thing to use would be, of course, a priority queue / heap. Unluckily, heapifying 100k elements isn't precisely a free operation either (or parallel), and then I end up with objects being in some non-obvious and not necessarily cache-friendly memory locations, and pulling elements from a priority queue doesn't parallelize nicely.
What I would really like is somewhat like a vector that is sorted or at least "somewhat approximately sorted", which one can traverse sequentially afterwards. This would trivially allow me to create e.g. 12 threads (or any other number, one per CPU) that process e.g. 1/64 of the range (or another size) each, slowly advancing from the front to the end, and eventually dropping/postponing what's left over -- which will be events of little importantance that can be discarded.
Simply sorting the complete range using std::sort would be the easiest, most straightforward solution. However, the time it takes to sort items reduces the time available to actually process elements within the fixed time budget, and sorting time is for the most part single-CPU time (and parallel sort isn't that great either).
Also, doing a perfect sort (which isn't really needed) may bring forth worst case complexity whereas an approximate sort should ideally perform at its optimum and have a very predictable cost.
tl;dr
So, what I'm looking for is a way to sort an array/vector only approximately, but fast, and with a predictable (or guaranteed) runtime.
The sort key would be a small integer typically between 10 and 1000. Being postponed to the next time quantum might increase ("priority boost") that value by a small amount, e.g. 100 or 200.
In a different question where humans are supposed to do an approximate sort using "subjective compare"(?) shell sort was proposed. On various sorting demo applets, it seems like at least for the "random shuffle" input that's typical in these, shell sort can indeed do an "approximate sort" that doesn't look too bad with 3-4 passes over the data (and at least the read-tap is strictly sequential). Unluckily it seems to be somewhat of a black art to choose gap values that work well, and runtime estimates seem to involve a lot of looking into the crystal ball as well.
Comb sort with a relatively large shrink factor (such as 2 or 3?) seems tempting as well, since it visits memory strictly sequentially (on both taps) and is able to move far out elements by a great distance quickly. Again, judging from sorting demo applets, it seems like 3-4 passes already give a rather reasonable "approximate sort".
MSD radix sort comes to mind, though I am not sure how it would perform given typical 16/32bit integers in which most of the most significant bits are all zero! One would probably have to do an initial pass to find the most significant bit in the whole set, followed by 2-3 actual sort passes?
Is there a better algorithm or a well-known working approach with one of the algorithms I mentioned?

What comes to mind is to iterate over the vector and if some event is less important, don't process it but put it aside. As soon as the entire vector has been read, have a look at the events put aside. Of course you can use several buckets with different priorities. And only store references there, you don't want to move megabytes of data. (posted as an answer now as requested by Damon)

Use a separate vector for each priority. Then you don't need to sort them.

Sounds like a nice example where near-sort algorithms can be useful.
Back a decade Chazelle has developed a nice data-structure that somewhat works like a heap. The key difference is the time complexity though. It has constant time for all important operations, e.g. insert, remove, find lowest element etc.
The trick of this data-structure is, that it breaks the O(n*log n) complexity barrier by allowing for some error in the sort order.
To me that sounds pretty much what you need. The data-structure is called soft heap and explained on wikipedia:
https://en.wikipedia.org/wiki/Soft_heap
There are other algorithms that allow for some error in favor to speed as well. You'll find them if you google for Near Sort Algorithms
If you try that algorithm please give some feedback how it works in practice. I'm really eager to hear from you how the algorithm performs in practice.

Sounds like you want to use std::partition: move the part that interests you to the front, and the others to the back. Its complexity is in the order of O(n), but it is cache-friendly, so it's probably a lot faster than sorting.

If you have limited "bandwidth" in processing events (say a 128K per time quantum), you could use std::nth_element to select the 128K (minus some percentage lost due to making that computation) most promising events (assuming you have an operator< that compares priorities) in O(N) time. Then you process those in parallel, and when you are done, you reprioritize the remainder (again in O(N) time).
std::vector<Event> events;
auto const guaranteed_bandwidth = 1<<17; // 128K is always possible to process
if (events.size() <= guaranteed_bandwidth) {
// let all N workers loose on [begin(events), end(events)) range
} else {
auto nth = guaranteed_bandwidth * loss_from_nth_element;
std::nth_element(begin(events), begin(events) + nth);
// let all N workers loose on [begin(events), nth) range
// reprioritize [nth, end(events)) range and append to events for next time quantum
}
This guarantees that in the case that your bandwith threshold is reached, you process the most valuable elements first. You could even speed up the nth_element by a poor man's parallelization (e.g. let each of N workers compute M*128K/N best elements for small M in parallel, and then do a final merge and another nth_element on the M*128K elements).
The only weakness is that in case your system is really overloaded (billions of events, maybe due to some DOS attack) it could take more than the entire quantum to run nth_element (even when quasi-parallized) and you actually process nothing. But if the processing time per event is much larger (say a few 1,000 cycles) than a priority comparison (say a dozen cycles), this should not happen under regular loads.
NOTE: for performance reasons, it's of course better to sort pointers/indices into the main event vector, this is not shown for brevity.

If you have N worker threads, give each worker thread 1/Nth of the original unsorted array. The first thing the worker will do is your approximate fast sorting algorithm of preference on it's individual piece of the array. Then, they can each process their array peice in order - roughly performing higher priority items first, and also being very cache friendly. This way, you don't take a hit for trying to sort the entire array, or even trying to approximately sort the entire array; and what little sorting there is, is entirely parallelized. Sorting 10 pieces individually is much cheaper than sorting the whole thing.
This would work best if the priorities of items to process are randomly distributed. If there is some ordering to them you'll wind up with a thread being flooded by or starved of high priority items to process.

tbb: parallel find first element

I have got this problem:
Find the first element in a list, for which a given condition holds.
Unfortunately, the list is quite long (100.000 elements), and evaluation the condition for each element takes in total about 30 seconds using one single Thread.
Is there a way to cleanly parallelize this problem? I have looked through all the tbb patterns, but could not find any fitting.
UPDATE: for performance reason, I want to stop as early as possible when an item is found and stop processing the rest of the list. That's why I believe I cannot use parallel_while or parallel_do.

I'm not too familiar with libraries for this, but just thinking aloud, could you not have a group of threads iterating at different at the same stride from different staring points?
Say you decide to have n threads (= number of cores or whatever), each thread should be given a specific starting point up to n, so the first thread starts on begin(), the next item it compares is begin() + n, etc. etc. second thread starts on begin()+1 and then it's next comparison is in n too etc.
This way you can have a group of threads iterating in parallel through the list, the iteration itself is presumably not expensive - just the comparison. No node will be compared more than once and you can have some condition which is set when a match is made by any of the threads and all should check this condition before iterating/comparing..
I think it's pretty straightforward to implement(?)

I think the best way to solve this problem with TBB is parallel_pipeline.
There should be (at least) two stages in the pipeline. The 1st stage is serial; it just reads the next element from the list and passes it to the 2nd stage. This 2nd stage is parallel; it evaluates the condition of interest for a given element. As soon as the condition is met, the second stage sets a flag (which should be either atomic or protected with a lock) to indicate that a solution is found. The first stage must check this flag and stop reading the list once the solution is found.
Since condition evaluation is performed in parallel for a few elements, it can happen that a found element is not the first suitable one in the list. If this is important, you also need to keep an index of the element, and when a suitable solution is found you detect whether its index is less than that of a previously known solution (if any).
HTH.

ok, I have done it this way:
Put all elements into a tbb::concurrent_bounded_queue<Element> elements.
Create an empty tbb::concurrent_vector<Element> results.
Create a boost::thread_group, and create several threads that run this logic:
logic to run in parallel:
Element e;
while (results.empty() && elements.try_pop(e) {
if (slow_and_painfull_check(e)) {
results.push_back(e);
}
}
So when the first element is found, all other threads will stop processing the next time they check results.empty().
It is possible that two or more threads are working on an element for which slow_and_painfull_check returns true, so I just put the result into a vector and deal with this outside of the parallel loop.
After all threads in the thread group have finished, I check all elements in the results and use the one that comes first.

you can take a look at http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html for parallel algorithms implementations.
And in particular you need find_if algorithm http://www.cplusplus.com/reference/algorithm/find_if/

I see two opportunities for parallelism here: evaluating one element on multiple threads, or evaluating multiple elements at once on different threads.
There isn't enough information to determine the difficulty nor the effectiveness of evaluating one element on multiple threads. If this is easy, the 30 second per element time could be reduced.
I do not see a clean fit into TBB for this problem. There are issues with lists not having random access iterators, determining when to stop, and guaranteeing the first element is found. There may be some games you can play with the ranges to get it to work though.
You could use some lower level thread constructs to implement this yourself as well, but there are a number of places for incorrect results to be returned. To prevent such errors, I would recommend using an existing algorithm. You could convert the list to an array (or some other structure with random access iterators) and use the experimental libstdc++ Parellel Mode find_if algorithm user383522 referenced.

If it's a linked list, A parallel search isn't going to add much speed. However, linked lists tend to perform poorly with caches. You may get a tiny performance increase if you have two threads: one does the find_first_element, and one simply iterates through the list, making sure not to get more than X (100?) ahead of the first thread. The second thread doesn't do any comparisons, but will assure that the items are cached as well as possible for the first thread. This may help your time, or it might make little difference, or it might hinder. Test everything.

Can't you transform the list to a balanced tree or similar? Such data structures are easier to process in parallel - usually you get back the overhead you may have paid in making it balanced in the first time... For example, if you write functional-style code, check this paper: Balanced trees inhabiting functional parallel programming

If you are using GCC, GNU OpenMP provides parallel std functions
link

I've never heard of the Intel tbb library but a quick open and scan of the Tutorial led me to parallel_for which seems like it will do the trick.

How to repeatedly insert elements into a sorted list fast

I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.

A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)

It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.

I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue

A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js