Parallel "insertions" into a binary trie in Haskell - concurrency

I have a list of n-bit "words"
type BitWord = [Bool]
and a trie which stores the word from the top to bottom:
data Tree = Bs Tree Tree -- Bs (zero_bit) (one_bit)
| X -- incomplete word
| B -- final bit of word
I have a function:
seenPreviously :: BitWord -> Tree -> (Tree,Bool)
The function steps through the bits in the BitWord, while descending through the Tree going left at a zero bit and vice versa. We return a new tree with this BitWord "merged in", along with True if we had to add subtrees at some point (i.e. the BitWord was not in the trie already) and False otherwise.
I map this function over a [BitWord], passing the Tree as state.
My question is this: Could this benefit from parallelism offered by Control.Parallel? And if so, how can I reason about laziness and evaluation only to weak head normal form, etc.?
My instinct is that I can be inserting (actually building a subtree) down a left branch while doing the same thing down the right branch, as two independent threads. Something like:
parallelMapSt :: [ BitWords ] -> Tree -> [Bool]
parallelMapSt [] _ = []
parallelMapSt (w:ws) t = let (b,t') = seenPreviously w t
bs = parralelMapSt ws t'
in t' `par` bs `pseq` (b:bs)
The thread evaluating b is dependent on some previously sparked threads (the ones belonging to the BitWords that share some common prefix with w), but not all of them, so it would seem that there is opportunity to do work in parallel here, but I'm really not sure.

Looks like a great candidate for the use of par when traversing the tree... Much like the binary-trees benchmark. Try writing some programs on this type, and measuring the effect of par.

Returning whether a word was in the trie unnecessarily sequentializes your program. If you really do need this information, it will probably be difficult to parallelize efficiently.
However, if we can rephrase the problem a bit such that order and disposition of insertions doesn't matter, the problem is pretty straightforward:
import Control.Parallel
data Tree = Bs Bool -- ^ is an empty word inserted here?
(Maybe Tree) -- ^ '0' subtree
(Maybe Tree) -- ^ '1' subtree
deriving Show
insertMany :: [[Bool]] -> Maybe Tree
insertMany [] = Nothing
insertMany xss = hasEnd `par` fs `par` ts `pseq` Just (Bs hasEnd fs ts)
where
hasEnd = any null xss
fs = insertMany [ xs | False : xs <- xss]
ts = insertMany [ xs | True : xs <- xss]
I don't have multiple cores at the moment, so I can't test this, but it should scale well. We've basically got a parallel radix sort in just a few lines -- not too shabby!

Why don't you just try it and see? Time the execution of the program with 1 thread and several, and see if there's a difference. Sparks in Haskell are really very cheap, so don't worry if you create a lot of them.

Related

How can I implement Tail Recursion on list in Haskell to generate sublists?

I want to make the biggersort function recall itself recursively on the tail of the list. My code below works and gives me the following output :
[6,7,8]
I want to continue by starting next with 3 then 1 .. to the last element.
What I want is something like :
[[6,7,8],[3,5,7,8],[1,5,7,8]..]
My code:
import Data.List
import System.IO
biggersort :: Ord a => [a] -> [a]
biggersort [] = []
biggersort (p:xs) = [p] ++ (biggersort greater)
where
greater = filter (>= p) xs
main = do
print $ biggersort [6,3,1,5,2,7,8,1]
You can access the list starting from each position using tails, so that a particularly concise implementation of the function you want in terms of the function you have is:
biggersorts = map biggersort . tails
There is one very noticeable downside (and one less noticeable downside) to this implementation, though. The noticeable downside is that computation on shorter lists is repeated when processing longer lists, leading to O(n^2) best-case time; the less obvious downside is that there is no memory sharing between elements of the result list, leading to O(n^2) worst-case memory usage. These bounds can be improved to O(n) average-case/O(n^2) worst-case time* and O(n) worst-case memory usage.
The idea is to start from the end of the list, and build towards the front. At each step, we look at all the rest of the results to see if there's one we can reuse. So:
biggersorts :: Ord a => [a] -> [[a]]
biggersorts [] = [[]]
biggersorts (a:as) = (a:as') : rec where
as' = fromMaybe [] (find bigger rec)
rec = biggersorts as
bigger (a':_) = a <= a'
bigger _ = False -- True is also fine
Beware that because of the sharing, it can be tricky to compare performance of these two fairly. The usual tricks for fully evaluating the output of this function don't play super nicely with sharing; so it's tricky to write something that obviously fully evaluates the outputs but also visits each subterm O(1) times (I think it's less important that this latter property be obvious). The most obvious way to do it (to me) involves rewriting both functions using some mildly advanced techniques, so I will elide that here.
* It seems like it ought to be possible to improve this to O(n*log(n)) worst-case time. During the recursion, build a cache :: Map a [a] to go with the rec :: [[a]]; the intuition for cache is that it tells which element of rec to use at each existing "boundary" a value. Updating the cache at each step involves splitting it at the current value and throwing away the bottom half.
I find the average-case time analysis harder for this variant; is it still O(n)? It seems plausible that the Map has O(1) average size during the run of this variant, but I wasn't able to convince myself of this guess. If not, is there another variant that achieves O(n) average-case/O(n*log(n)) best-case? ...is there one that does O(n) worst-case? (Probably the same counting argument used to bound the runtime of sorting rules this out...?)
Dunno!
You can make a function that will use biggersort as a helper function, so:
biggersort :: Ord a => [a] -> [a]
biggersort [] = []
biggersort (p:xs) = p : biggersort (filter (>= p) xs)
biggersorts :: Ord a => [a] -> [[a]]
biggersorts [] = []
biggersorts xa#(_:xs) = biggersort xa : biggersorts xs
main = print (biggersorts [6,3,1,5,2,7,8,1])
This then prints:
Prelude> biggersorts [6,3,1,5,2,7,8,1]
[[6,7,8],[3,5,7,8],[1,5,7,8],[5,7,8],[2,7,8],[7,8],[8],[1]]

Check if two lists of lists are the same size, including all elements of the inner lists

Let's say, I have two lists of lists like this:
["hello", "hi", "hey"] and ["apple", "an", "hey"]
And I want to check if the two lists contain the same amount of lists and the corresponding inner lists contain the same amount of elements. In this case, this function should return True. It should also work for endless lists (so in a case like [[1..], [1..]] and [[1..], [3,4,5]] it should return False, since only the first list's second element is endless, the second list's second elemnt isn't).
So far, I've come up with this:
listsSize (x:xs) (y:ys)
| length x == length y && length xs == length ys = True
| otherwise = False
But it only checks if the lists contain the same number of lists, it doesn't evaluate the inner lists.
A simple version that's not quite fully featured is to start by implementing a simpler function that just checks whether two lists have the same shape:
sameShape :: [a] -> [b] -> Bool
Since length never finishes for infinite lists, you'll need to use pattern matching directly:
sameShape [] [] = ...
sameShape [] (y:ys) = ...
sameShape (x:xs) [] = ...
sameShape (x:xs) (y:ys) = ...
The caveat is that sameShape can't really ever return True if both lists are infinite; it should be able to get the right answer in a finite amount of time for all other kinds of inputs, though.
To check that a nested list has the same shape, you can check that the outer list has the same shape and that each pairing of inner lists has the same shape:
sameShapes :: [[a]] -> [[b]] -> Bool
sameShapes xss yss = and (sameShape xss yss : zipWith sameShape xss yss)
This is a nice compositional solution, but it suffers from a problem, which is that it's pretty hard to know which order to make the calls to sameShape in when some lists are infinite and others not. Here's the possibilities, and an example of an input where they could in principle compute the answer but spin forever:
Check the outer lists' shapes first. (This is what sameShapes above does.) Then checking the lists repeat [] and "abc" : repeat [] will never finish.
Check the first elements' shapes first. Checking the lists [repeat 'a', ""] and [repeat 'a', "abc"] will never finish.
In fact, the previous bullet generalizes: if you pick any particular element position's shapes first, there's a counterexample. If you look at index n first, then checking the lists formed by putting repeat 'a' at index n in both inputs and ""/"abc" at index n+1 will never finish.
So, unfortunately, these checks need to be interleaved with each other. There's a fairly standard trick called "diagonalization" that makes this kind of thing possible. Think of the outer list being the collection of rows and each inner list being a collection of column values for that row:
1 2 3 4 . . .
a a1 a2 a3 a4 . . .
b b1 b2 b3 b4 . . .
c c1 c2 c3 c4 . . .
d d1 d2 d3 d4 . . .
. . . . . .
. . . . . .
. . . . . .
You can make sure you visit every position by ordering them according to how far away they are from the top left, so in this order:
_/ _/ _/
|/_ _/ _/
_/ _/
_/ _/
|/_ _/
_/
_/
|/_
(etc.) These are the diagonals that justify the name "diagonalization". If our two nested lists have different shapes, this guarantees that eventually we will try to look at an index that is out of bounds for one but not the other.
A very efficient implementation is kind of complicated. Fortunately, an inefficient version isn't too hard. We'll slowly look at more and more rows, tracking the visible ones in xss and yss and the ones we haven't started looking at in xss' and yss'. Each time we take a look, we'll look at one column in each row. You can see some more description of a similar solution to another problem in another answer of mine. Here's how it looks for this problem*:
sameShapesDiagonal :: [[a]] -> [[b]] -> Bool
sameShapesDiagonal = go [] [] where
go xss yss xss' yss' = done || noMismatchYet where
done = and ([null xss', null yss'] ++ map null xss ++ map null yss)
noMismatchYet = True
&& null xss' == null yss'
&& and (zipWith (\xs ys -> null xs == null ys) xss yss)
&& go (xssh ++ map (drop 1) xss) (yssh ++ map (drop 1) yss) xsst ysst
(xssh, xsst) = splitAt 1 xss'
(yssh, ysst) = splitAt 1 yss'
If given any mismatching shapes, this function will return False in finite time. If given matching finite shapes, this function will return True in finite time. If given matching shapes with some infinite list, it will never return; but it is not possible to both return and satisfy the previous two correctness properties in such a case.
Here's some examples of using it:
> sameShapesDiagonal ["hello", "hi", "hey"] ["apple", "an", "hey"]
True
> sameShapesDiagonal [[1..], [1..]] [[1..], [3,4,5]]
False
> sameShapesDiagonal [[1..], [1..]] [[3,4,5], [1..]]
False
> sameShapesDiagonal (repeat []) ("abc" : repeat [])
False
> sameShapesDiagonal [repeat 'a', ""] [repeat 'a', "abc"]
False
* Lest you copy and paste this as your own solution to a school exercise, let me assure you that there are multiple things about my implementation that would raise the eyebrows of somebody expecting to grade beginners' code. To avoid suspicion, you will need to understand this answer, then write your own implementation (either of the simple first algorithm or the more complicated second algorithm) using only techniques that are comfortable to you and your peers.
You can't* check whether a list is endless without restricting how you can make endless lists. Assuming this, it is also impossible to check if two lists are both endless - let's say we had such a function f :: [a] -> [a] -> Bool. Then I could write isEndless xs = f xs xs to implement the "is this list endless" check.
If we restrict the question to be "at most one of the lists can be endless", then it becomes possible, via the same method that you would use to check whether they have the same length, which is to walk both of them simultaneously until one of them runs out. This will, as expected, loop if you give it two infinite lists.
P.S.:
if b then True else False
is the same as just b. Similarly, your guard can be converted to just length x == length y && length xs == length ys
* Assuming I had an isEndless :: [a] -> Bool which only takes finite time, I could solve the halting problem. Sketch: assume I have some types Program and Step, respectively representing a program and a single step of its execution (e.g. in a turing machine). Then I could write steps :: Program -> [Step] giving me a list of the "steps" of execution a program will make, because I can simulate the program running. Note how the result of steps is potentially infinite. Then,
halts :: Program -> Bool
halts p = not (isEndless (steps p))
telling me for an arbitrary program whether it halts.

C stack overflow HASKELL , while handling strings

I want to find all the possible partitions of a string into a list of non-empty strings.
For example if i give as input "sun",
i want to create this output : [["s","u","n"], ["s","un"], ["su","n"], ["sun"]].
I have created a simple function with recursion but it prints this overflow error i can't fix it please i need help:
partition :: String->[[String]]
partition w = [[(head w)]:fix | fix <- partition (tail w)]
++[((head w):fix):fixfix | (fix:fixfix)<-partition (tail w)]
The essential problem is that you're missing the base case for the recursion, so you have an infinite loop.
The simple thing is just to replace the head/tail mess with pattern matching, which will solve that problem as a side effect.
partition [] = [[]]
partition (w:ws) =
[[w]:fix | fix <- partition ws] ++
[(w:fix):fixfix | (fix:fixfix)<-partition ws]
This turns out to work okay, somewhat to my surprise. Why was I surprised? I figured that, with optimization, GHC would use common subexpression elimination to rewrite it to
partition [] = [[]]
partition (w:ws) =
[[w]:fix | fix <- partitionws] ++
[(w:fix):fixfix | (fix:fixfix)<-partitionws]
where partitionws = partition ws
That would be bad: it would save the entire partition ws calculation across the ++, using lots of memory. But it seems GHC is clever enough these days not to do that.
To be more confident, you can avoid the common subexpression, by accumulating a "continuation" explaining how you'll process each element you produce.
part :: ([[a]] -> [b]) -> [a] -> [b]
part f [] = f []
part f (w:ws) =
part (\fix -> f ([w]:fix)) ws ++
part (\q -> case q of
[] -> []
fix:fixfix -> f ((w:fix):fixfix)) ws
partition = part (:[])
For reasons I don't know, this version is a couple times faster than the simple one.
If you don't care about the order in which the elements are produced, you can avoid the space leak risk much more simply (and perhaps even faster) by doing something like this:
partition [] = [[]]
partition (w:ws) =
[ q
| m <- partition ws
, q <- ([w]:m) : [(w:fix):fixfix | fix:fixfix <- [m]]]
This is almost as simple as the simplest solution.

How to re-arrange a list based on a set of steps?

I'm trying to re-arrange a list based on these steps:
First move every operator (+,-,*) 1 index to the left by switching it with the element to the left.
Then find any '+' or '-' two indexes ahead of a *, and move the '+' or '-' to the index before the *.
Example
["a","-","2","*","b","+","c"]
["-","a","*","2","+","b","c"]
["-","a","+","*","2","b","c"]
I have an imperative programming background, so my initial idea was to have an iterator as an argument, and keep track of the position in the index like that, but I could not get it to work. My second idea was to embrace Haskell and use list comprehension with generators, but I struggled there as well. Any ideas or solutions are appreciated!
You can make use of explicit recursion. You can for example move elements one position to the left with:
isOperator :: String -> Bool
isOperator "+" = True
isOperator "-" = True
isOperator "*" = True
isOperator _ = False
stepOne :: [String] -> [String]
stepOne (x:xs#(o:xs'))
| isOperator o = o : x : stepOne xs'
| otherwise = x : stepOne xs
stepOne xa#[_] = xa
stepOne [] = []
Here the (x:xs#(o:xs')) pattern matches with lists with two or more elements. THe first element is x, the second is o, the remaining elements is stored in the xs' variable. xs is the tail of the "outer" cons. We check if o is an operator, if that is the case we swap with x and recurse on the tail xs'. If o is not an operator, we recurse on the tail xs'.
For the given sample data, we get:
Prelude> stepOne ["a","-","2","*","b","+","c"]
["-","a","*","2","+","b","c"]
I leave step two as an exercise.
That being said, one of the success stories of Haskell is probably parsing. Several libraries and tools exist like parsec [Hackage] and attoparsec [Hackage]. happy [haskell.org] is a compiler compiler that can construct a parser in Haskell based on a grammer. You thus do not per se need to perform infix-to-prefix conversion, but let tools do the work for you.

Generate all permutations of a list including diferent sizes and repeated elements

I wanted to create the function genAllSize ::[a] -> [[a]], that receives a list l and generates all the lists sorted by size that can be built with the elements of the list l; i.e.
> genAllSize [2,4,8]
[[],[2],[4],[8],[2,2],[4,2],[8,2],[2,4],[4,4],[8,4],[2,8],[4,8],[8,8],[2,2,2],[4,2,2],[8,2,2], ...
How would you do it? I came up with a solution using permutations from Data.List but I do not want to use it.
Given an input list xs, select a prefix of that in a non deterministic way
For each element in the prefix, replace it with any element of xs, in a non deterministic way
Result:
> xs = [2,4,8]
> inits xs >>= mapM (const xs)
[[],[2],[4],[8],[2,2],[2,4],[2,8],[4,2],[4,4],[4,8],[8,2],[8,4],
[8,8],[2,2,2],[2,2,4],[2,2,8],[2,4,2],[2,4,4],[2,4,8],[2,8,2],
[2,8,4],[2,8,8],[4,2,2],[4,2,4],[4,2,8],[4,4,2],[4,4,4],[4,4,8],
[4,8,2],[4,8,4],[4,8,8],[8,2,2],[8,2,4],[8,2,8],[8,4,2],[8,4,4],
[8,4,8],[8,8,2],[8,8,4],[8,8,8]]
The other answers seem sort of complicated. I'd do it this way:
> [0..] >>= flip replicateM "abc"
["","a","b","c","aa","ab","ac","ba","bb","bc","ca","cb","cc","aaa","aab",...
Hmm I guess you a need a lazy infinite list of cycling subsequences. One naive way could be like
Prelude> take 100 $ nub . subsequences . cycle $ [2,4,8]
[[],[2],[4],[2,4],[8],[2,8],[4,8],[2,4,8],[2,2],[4,2],[2,4,2],[8,2],[2,8,2],[4,8,2],[2,4,8,2],[4,4],[2,4,4],[8,4],[2,8,4],[4,8,4],[2,4,8,4],[2,2,4],[4,2,4],[2,4,2,4],[8,2,4],[2,8,2,4],[4,8,2,4],[2,4,8,2,4],[8,8],[2,8,8],[4,8,8],[2,4,8,8],[2,2,8],[4,2,8],[2,4,2,8],[8,2,8],[2,8,2,8],[4,8,2,8],[2,4,8,2,8],[4,4,8],[2,4,4,8],[8,4,8],[2,8,4,8],[4,8,4,8],[2,4,8,4,8],[2,2,4,8],[4,2,4,8],[2,4,2,4,8],[8,2,4,8],[2,8,2,4,8],[4,8,2,4,8],[2,4,8,2,4,8],[2,2,2],[4,2,2],[2,4,2,2],[8,2,2],[2,8,2,2],[4,8,2,2],[2,4,8,2,2],[4,4,2],[2,4,4,2],[8,4,2],[2,8,4,2],[4,8,4,2],[2,4,8,4,2],[2,2,4,2],[4,2,4,2],[2,4,2,4,2],[8,2,4,2],[2,8,2,4,2],[4,8,2,4,2],[2,4,8,2,4,2]]
A simple and highly efficient option:
genAllSize [] = [[]]
genAllSize [a] = iterate (a:) []
genAllSize xs =
[] : [x:q|q<-genAllSize xs,x<-xs]
(Thanks to Will Ness for a small but very nice simplification.)
This solution takes advantage of the fact that a valid solution list is either empty or an element of the argument list consed onto a shorter valid solution list. Unlike Daniel Wagner's solution, this one doesn't resort to counting. My tests suggest that it performs extremely well under typical conditions.
Why do we need a special case for a one-element list? The general case performs extremely badly for that, because it maps over the same list over and over with no logarithmic slowdown.
But what's the deal with that call to genAllSizes with the very same argument? Wouldn't it be better to save the result to increase sharing?
genAllSize [] = [[]]
genAllSize xs = p
where
p = [] : [x:q|q<-p,x<-xs]
Indeed, on a theoretical machine with unlimited constant-time memory, this is optimal: walking the list takes worst-case O(1) time for each cons. In practice, it's only a good idea if a great many entries will be realized and retained. Otherwise, there's a problem: most of the list entries will be retained indefinitely, dramatically increasing memory residency and the amount of work the garbage collector needs to do. The non-bold sharing version above still offers amortized O(1) time per cons, but it needs very little memory (logarithmic rather than linear).
Examples
genAllSize "ab" =
["","a","b","aa","ba"
,"ab","bb","aaa","baa"
,"aba","bba","aab","bab"
,"abb","bbb","aaaa",...]
genAllSize "abc" =
["","a","b","c","aa","ba"
,"ca","ab","bb","cb","ac"
,"bc","cc","aaa","baa"
,"caa","aba","bba","cba"
,"aca",...]
An explicit option
You can also use two accumulators:
genAllSize [] = [[]]
genAllSize [a] = iterate (a:) []
genAllSize (x:xs) = go ([], []) where
go (curr, remain) = curr : go (step curr remain)
step [] [] = ([x], [xs])
step (_:ls) ((r:rs):rss) =
(r:ls, rs:rss)
step (_:ls) ([] : rs) =
(x : ls', xs : rs')
where
!(ls', rs') = step ls rs
This version keeps track of the current "word" and also the remaining available "letters" in each position. The performance seems comparable in general, but a bit better with regard to memory residency. It's also much harder to understand!
This produces the elements in a different order within each length than your example, but it meets the definition of the text of your question. Changing the order is easy - you have to replace <*> with a slightly different operator of your own making.
import Control.Applicative
import Control.Monad
rinvjoin :: Applicative both => both a -> both (both a)
rinvjoin = fmap pure
extendBranches options branches = (<|>) <$> options <*> branches
singletonBranchExtensions = rinvjoin
genAllSize [] = []
genAllSize xs = join <$> iterate (extendBranches extensions) $ initialBranches
where extensions = singletonBranchExtensions xs
initialBranches = pure empty