In theory, is find_end parallelizable? - c++

I'm currently working on an open-std proposal to bring parallel functionality to the project I am working on, but I've run into a road block with find_end.
Now find_end can be described as:
An algorithm that searches for the last subsequence of elements [s_first, s_last) in the
range [first, last). The first version uses operator== to compare the
elements, the second version uses the given binary predicate p.
it's requirements are laid out by cppreference. Now I had no problem parallelizing find/findif/findifnot etc. These could easily be split up into separate partitions that were executed asynchronously and I had no trouble. The problem with find_end is that splitting the algorithm up into chunks is not a solution, because if we have say a vector:
1 2 3 4 5 1 2 3 8
and we want to search for 1 2.
Ok first off I seperate the vector into chunks asynchronously and just search for the range in each chunk right? Seemed easy enough to me, however what happens if for some reason there are only 3 available cores, so the vector is separated into 3 chunks:
1 2 3|4 5 1|2 3 8
Now I have a problem, the second 1 2 range is split into different partitions. This is going to lead to a lot of invalid results for someone has x amount of cores that end up splitting the search results into y different partitions. I was thinking I would do some sort of search chunks -> merge y chunks into y/2 chunks -> search -> in a recursive style search, but this just seem's so inefficient, the whole point of this algorithm is to improve efficiency. I might be overthinking this ordeal as well
tl;dr, is there a way to parallelize find_end in a way I am not thinking of?

Yes, there is a way.
Let N be the size of the range you are looking for.
Once you've separated your vector in 3 chunks (3 separate worker threads) :
1 2 3|4 5 1|2 3 8
You can allow each thread to run across its right adjacent chunk (if any) for N-1 elements (since only read operations are involved on the sequence, this is perfectly fine and thread-safe).
In this case : (N = 2)
Core 1 run on 1 2 3 4
Core 2 run on 4 5 1 2
Core 3 run on 2 3 8

Since the point of find_end is to find the last occurrence of a needle in a haystack, parallelization by splitting the haystack into contiguous segments is often not going to produce any benefit because if the needle is actually in the last segment, the work done by all processors other than the one assigned to the last segment is wasted, and the time is precisely the same as it would have been with a single processor. In theory, the parallel evaluation allows you to cap the maximum search time, which is of benefit if (1) processors are not in competition for other tasks and (2) there are relatively few instances of the needle in the haystack.
In addition, you need to be able to coordinate process termination; each process can abandon the search when it finds a match or when its younger sibling has either found a match or abandoned the search. Once process 0 has found a match or run out of places to look for them, the lowest-index process with a match wins.
An alternative is to interleave the searches. If you have k processors, then processor 0 is given the sequences which end at last-0, last-k, last-2k..., processor 1 is given the sequences which end at last-1, last-k-1, last-2k-1... and in general processor i (0 ≤ i < k) works on last-i, last-k-i, last-2k-i...
Process coordination is slightly different from the first alternative. Again, each individual process can stop as soon as it finds a match. Also, any process can stop as soon as its current target is less than the latest match found by another process.
While that should result in reasonable parallelization of the search, it's not clear to me that it will be do better than a non-parallelized linear-time algorithm such as Knuth-Morris-Pratt or Boyer-Moore, either of which can be trivially modified to search right-to-left. These algorithms will be particularly useful in the not uncommon case where the needle is a compile-time constant, allowing for the possibility to precompute the necessary shift tables. The non-interleaved parallelization can benefit from KMP or BM, with the same caveat as above: it is likely that most of the participating process will prove to not have been useful.

Related

Why are these algorithms running faster than they should be?

I created a C++ program that outputs the input size vs. execution time (microseconds) of algorithms and writes the results to a .csv file. Upon importing the .csv in LibreOffice Calc and plotting graphs,
I noticed that binary search for input sizes upto 10000 is running in constant time even though I search for an element not in the array. Similarly, for upto the same input size, merge sort seems to run in linear time instead of the linear-logarithmic time it runs in in all cases.
Insertion Sort and Bubble Sort run just fine and the output plot resembles their worst case quadratic complexity very closely.
I provide the input arrays from a file. For n = 5, the contents of the file are as follows. Each line represents an input array:
5 4 3 2 1
4 3 2 1
3 2 1
2 1
1
The results.csv file on running insertion sort is:
Input,Time(ms)
5,4
4,3
3,2
2,2
1,2
The graph of binary search for maximum input 100 is here.
Also the graph of merge sort for maximum input 1000 is here which looks a lot like it is linear (the values in the table also suggest so).
Any help as to why this is happening will be greatly appreciated.
Here is a link to the github repository for the source code: https://github.com/dhanraj-s/Time-Complexity
Complexity is about asymptotic worst case behaviour.
...worst case...
Even a quadratic algorithm may fall back to a linear variant if the input allows. It's complexity is still quadratic because for the worst case, the algorithm can only guarantee a quadratic runtime.
...asymptotic...
It might well be that the asymptotic behaviour for the algorithms starts to settle in only for input sizes much bigger than what you chose.
That being said, in practice complexity alone is not the most useful metric, but if you do care about performance, you need to measure.

Algorithm for Enumerating Hamiltonian Cycles of a Complete Graph (Permutations where loops, reverses, wrap-arounds or repeats don't count)

I want to generate all the Hamiltonian Cycles of a complete undirected graph (permutations of a set where loops and reverses count as duplicates, and are left out).
For example, permutations of {1,2,3} are
Standard Permutations:
1,2,3
1,3,2
2,1,3
2,3,1
3,1,2
3,2,1
What I want the program/algorithm to print for me:
1,2,3
Since 321 is just 123 backward, 312 is just 123 rotated one place, etc.
I see a lot of discussion on the number of these cycles a given set has, and algorithms to find if a graph has a Hamiltonian cycle or not, but nothing on how to enumerate them in a complete, undirected graph (i.e. a set of numbers that can be preceded or succeeded by any other number in the set).
I would really like an algorithm or C++ code to accomplish this task, or if you could direct me to where there is material on the topic. Thanks!
You can place some restrictions on the output to eliminate the unwanted permutations. Lets say we want to permute the numbers 1, ..., N. To avoid some special cases assume that N > 2.
To eliminate simple rotations we can require that the first place is 1. This is true, because an arbitrary permutation can always be rotated into this form.
To eliminate reverses we can require that the number at the second place must be smaller than the number at the last place. This is true, because from the two permutations starting with 1 that are reverses of each other, exactly one has this property.
So a very simple algorithm could enumerate all permutations and leave out the invalid ones. Of course there are optimisations possible. For example permutations that do not start with 1 can easily be avoided during the generation step.
An uber-lazy way to check if a path is the same one that starts in a different point in the cycle (IE, the same loop or the reverse of the same loop is this:
1: Decide that by convention all cycles will start from the lowest vertex number and continue in the direction of the lower of the two adjacent ordinals.
Hence, all of the above paths would be described in the same way.
The second, other useful bit of information here:
If you would like to check that two paths are the same you can concatenate one with its self and check if it contains either the second path or the reverse of the second path.
That is,
1 2 3 1 2 3
contains all of the above paths or their reverses. Since the process of finding all hamiltonian cycles seems much, much slower than the slight inefficiency this algorithm has, I felt I could throw it in :)

Concurrent binary chop algorithm

Is there a way (or is it even theoretically possible) to implement a binary search algorithm concurrently? I'm guessing the answer may well be no for two reasons:
Despite lots of Googling I haven't found a concurrent implementation anywhere
Each iterative cycle of the binary chop depends on the values from the previous one, so even if each iteration was a separate thread it would have to block until the previous one completed, making it sequential.
However, I'd like some clarification on this front (and if it is possible, any links or examples?)
At first, it looks like binary search is completely nonparallel. But notice that there are only three possible outcomes:
You hit the element
The element searched for is before the element you hit
The element is after
So we start three parallel processes:
Hit the element
Assume the element is before, search here
Assume the element is after, search there
As soon as we know the result from the first of these, we can kill the one which is not going to find the element. But at the same time, the process that searched in the right spot, has doubled the search rate, that is current speedup is 2 out of a possible 3.
Naturally, this approach can be generalized if you have more than 3 cores at your disposal. An important aside is that this way of thinking is what is done inside hardware. Look up carry-lookahead adders for instance.
I think you can figure the answer! To parallelize, there must be some work that can be divided. In case of the bin-search, there is nothing that could possibly be divided and parallelized. bin-search gets into the middle of an array of values. This work cannot be divided. Etc.. until it find the solution.
What in your opinion could be parallelized?
If you have n worker threads, you can split the array in n segments and run n binary searches concurrently, combining the results when they are ready. Apart from this cheap trick, I can see no obvious way to introduce parallelism.
You could always try a not-quite-binary search, essentially if you have n cores then you can split the array into n+1 pieces. From there you search each of the "cut-points" and see whether the value is larger or smaller than the cut point, this results in you having a fifth of the original search space as opposed to half, as you will be able to select a smaller section.

Efficient way to sort a concatenation of lists (STL), merge sort hint, partially sorted

I have a situation where I get a list of values that are already partially sorted. There are N blocks in my final list, each block is sorted. So I end up having a list of data like this (slashes are just for emphasis):
1 2 3 4 5 6 7 8 / 1 2 3 4 5 / 2 3 4 5 6 7 8 9 / 1 2 3 4
I have these in a vector as a series of pointers to the objects. Currently I just use std::sort with a custom comparator to the sorting. I would guess this is sub-optimal as my sequence is some degenerate case.
Are there any other stl functions, hints, or otherwise that I could use to provide an optimal sort of such data? (Boost libraries are also fine).
Though I can't easily break up the input data I certainly can determine where the sub-sequences start.
You could try std::merge, although this algorithm can only merge two sorted collections at a time, so you would have to call it in a loop. Also note that std::list provides merge as a member function.
EDIT Actually std::inplace_merge might be an even better candidate.
This calls for a “multiway merge”. The standard library doesn’t have an appropriate algorithm for that. However, the parallel extension of the GCC standard library does:
__gnu_parallel::multiway_merge.
you can iterate on all of the lists at once, keeping and index for each list. and comparing only items in that index.
this can be significantly faster than regular sort : O(n) vs O(n*log(n)) where n is the number of items in all the lists.
see the wikipedia article.
C++ has std::merge for it, but it will not handle multiple lists at once so you may want to craft your own version which does.
If you can spare the memory, mergesort will perform very well for this. For best results, merge the smallest two chains at a time, until you only have one.

How to traverse a binary tree in a thread safe way?

I need a way to traverse a binary tree, using multiple threads and store elements that matches a criteria into a list.
How do I do that, in a thread safe way?
As SDG points out the answer depends a lot on the exact nature of your problem. If you want to decompose the traversal (i.e. traverse in parallel) then you can have threads acting on different sub-trees after say level 2. Each thread can then append to its own list, which can then be merged/concatenated at a join point. The simplest thing to do is to prevent mods to the tree while doing a traversal.
I just have to add that you don't keep firing off threads after you reach your level. You only do it once. So at level 2 you fire of a maximum of 4 threads. Each traveral thread treats it's subtree as its own rooted tree. You also don't do this unless you have a buttload of nodes, and a reasonably balanced tree. Buttload is a technical term meaning "measure". The part of the traversal up to the splitting point is traversed by the UI thread. If this was my problem I would think long and hard about what it was I needed to achieve, as it may make all the difference.
Let me add one more thing (is this becoming a Monty Python sketch?). You don't really need to concat or merge the result lists into a new list if all you need is to process the results. Even if you need the results ordered then it is still better to sort each list seperately (perhaps in parallel) and then "merging" them in a GetNextItem pull fashion. That way you don't need much additional memory. You can merge 4 lists at once in this fashion by having two "buffers" (can be pointers /indices to the actual entries). I'm trying to find a way to explain it without drawing a pic.
0 1 2 3 4 5 6 7 8 9
L1(0): 4 4 4 5 5 6 8
B1[L2,3] \
L2[1]: 3 4 5 5 6 7 7 8 9
\
L3[1]: 2 2 4 4 5 5 6 8
B2[L3,2] /
L4[0]: 2 4 5 5 6 7 7 8 9
You keep pulling from whichever list satisfies the order you need. If you pull from B2, then you only need to update B2 and its sublists (in this case we pulled 2 from L3 and moved the L3's index to the next entry).
You miss a few points that would help with an answer.
If the multiple threads are all read-only in their traversal, and the tree does not change for the duration of their traversal, and they all are putting those found matches into lists that those traversal threads own, then you should have no worries at all.
As you relax any of those constraints, you will need to either add in locking, or other appropriate means of making sure they play nicely together.
The easiest way would be to lock the entry points of the binary tree class, and assume it's locked on the recursive traversal functions (for insertion, lookup, deletion).
If you have many readers and fewer writers, you can use ReaderLocks and WriterLocks to allow concurrent lookups, but completely lock on mutations.
If you want something finer grained, it will get much more complicated. You'll have to define what you really need from "thread safety", you might have to pare down your binary tree API, and you'll probably need to lock nodes individually, possibly for the duration of a sub-tree traversal.