stack overflow when generating large sequence of letters in ocaml - ocaml

Given an alphabet ["a"; "b"; "c"] I want to dump all sequences of length 25 to a file. (Letters can repeat in a sequence; it's not a permutation.) The problem is, I get a Stack overflow during evaluation (looping recursion?) when I try using the following code:
let addAlphabetToPrefix alphabet prefix =
List.map (function letter -> (prefix ^ letter)) alphabet;;
let rec generateWords alphabet counter words =
if counter > 25 then
words
else
let newWords = List.flatten(List.map (function word -> addAlphabetToPrefix alphabet word) words) in
generateWords alphabet (counter + 1) newWords;;
generateWords ["a"; "b"; "c"] 0 [""];; (* Produces a stack overflow. *)
Is there a better way of doing this? I was thinking of generating the entire list first, and then dumping the entire list to a file, but do I have to repeatedly generate partials lists and then dump? Would making something lazy help?
Why exactly is a stack overflow occurring? AFAICT, my generateWords function is tail-recursive. Is the problem that the words list I'm generating is getting too big to fit into memory?

Your functions are being compiled as tailcalls. I confirmed from the linearized code; obtained from the -dlinear option in the native compiler, ocamlopt[.opt].
The fact of the matter is, your heap is growing exponentially, and 25 words is unsustainable in this method. Trying with 11 works fine (and is the highest I could deal with).
Yes, there is a better way to do this. You can generate the combinations by looking up the index of the combination in lexicographical order or using grey codes (same page). These would only require storage for one word, can be run in parallel, and will never cause a segmentation fault --you might overflow the using the index method though, in which case you can switch to the big integers but will sacrifice speed, or grey codes (which may be difficult to parallelize, depending on the grey code).

OCaml optimizes tail recursion, so your code should work, except: the standard library's List.map function is, unfortunately, not tail-recursive. The stack overflow is potentially occurring in one of those calls, as your lists get rather large.
Batteries Included and Jane Street's Core library both provide tail-recursive versions of map. Try one of those and see if it fixes the problem.

Related

How can I calculate the length of a list containing lists in OCAML

i am a beginner in ocaml and I am stuck in my project.
I would like to count the number of elements of a list contained in a list.
Then test if the list contains odd or even lists.
let listoflists = [[1;2] ; [3;4;5;6] ; [7;8;9]]
output
l1 = even
l2 = even
l3 = odd
The problem is that :
List.tl listoflists
Gives the length of the rest of the list
so 2
-> how can I calculate the length of the lists one by one ?
-> Or how could I get the lists and put them one by one in a variable ?
for the odd/even function, I have already done it !
Tell me if I'm not clear
and thank you for your help .
Unfortunately it's not really possible to help you very much because your question is unclear. Since this is obviously a homework problem I'll just make a few comments.
Since you talk about putting values in variables you seem to have some programming experience. But you should know that OCaml code tends to work with immutable variables and values, which means you have to look at things differently. You can have variables, but they will usually be represented as function parameters (which indeed take different values at different times).
If you have no experience at all with OCaml it is probably worth working through a tutorial. The OCaml.org website recommends the first 6 chapters of the OCaml manual here. In the long run this will probably get you up to speed faster than asking questions here.
You ask how to do a calculation on each list in a list of lists. But you don't say what the answer is supposed to look like. If you want separate answers, one for each sublist, the function to use is List.map. If instead you want one cumulative answer calculated from all the sublists, you want a fold function (like List.fold_left).
You say that List.tl calculates the length of a list, or at least that's what you seem to be saying. But of course that's not the case, List.tl returns all but the first element of a list. The length of a list is calculated by List.length.
If you give a clearer definition of your problem and particularly the desired output you will get better help here.
Use List.iter f xs to apply function f to each element of the list xs.
Use List.length to compute the length of each list.
Even numbers are integrally divisible by two, so if you divide an even number by two the remainder will be zero. Use the mod operator to get the remainder of the division. Alternatively, you can rely on the fact that in the binary representation the odd numbers always end with 1 so you can use land (logical and) to test the least significant bit.
If you need to refer to the position of the list element, use List.iteri f xs. The List.iteri function will apply f to two arguments, the first will be the position of the element (starting from 0) and the second will be the element itself.

a fully backtracking star operator in parsec

I am trying to build a real, fully backtracking + combinator on parsec.
That is, one that receives a parser, and tries to find one or more instances of the given combinator.
That would mean that parse_one_or_more foolish_a would be able to match nine chars a in a row, for example. (see code below for context)
As far as I understand it, the reason why it does not currently do so is that, after foolish_a finds a match (the first 2 as) the many1 (try p1) never gives up on that match.
Is this possible in parsec? Pretty sure it would be very slow (this simple example is already exponential!) but I wonder if it can be done. It is for a programming challenge that runs without time limit -- I would not want to use it in the wild
import Text.Parsec
import Text.Parsec.String (Parser)
parse_one_or_more :: Parser String -> Parser String
parse_one_or_more p1 = (many1 (try p1)) >> eof >> return "bababa"
foolish_a = parse_one_or_more (try (string "aa") <|> string "aaa")
good_a = parse_one_or_more (string "aaa")
-- |
-- >>> parse foolish_a "unused triplet" "aaaaaaaaa"
-- Left...
-- ...
-- >>> parse good_a "unused" "aaaaaaaaa"
-- Right...
You are correct - Parsec-like libraries can't do this in a way that works for any input. Parsec's implementation of (<|>) is left-biased and commits to the left parser if it matches, regardless of anything that may happen later in the grammar. When the two arguments of (<|>) overlap, such as in (try (string "aa") <|> string "aaa"), there is no way to cause parsec to backtrack into there and try the right side match if the left side succeeded.
If you want to do this, you will need a different library, one that doesn't have a (<|>) operator that's left-biased and commits.
Yes, since Parsec produces a recursive-descent parser, you would rather want to make an unambiguous guess first to minimize the need for backtracking. So if your first guess is "aa" and that happens to overlap with a later guess "aaa", backtracking is necessary. Sometimes a grammar is LL(k) for some k > 1 and you want to use backtracking out of pure necessity.
The only time I use try is when I know that the backtracking is quite limited (k is low). For example, I might have an operator ? that overlaps with another operator ?//; I want to parse ? first because of precedence rules, but I want the parser to fail in case it's followed by // so that it can eventually reach the correct parse. Here k = 2, so the impact is quite low, but also I don't need an operator here that lets me backtrack arbitrarily.
If you want a parser combinator library that lets you fully backtrack all the time, this may come at a severe cost to performance. You could look into Text.ParserCombinators.ReadP's +++ symmetric choice operator that picks both. This is an example of what Carl suggested, a <|> that is not left-biased and commits.

Does using Haskell's (++) operator to append to list cause multiple list traversals?

Does appending to a Haskell list with (++) cause lists to be traversed multiple times?
I tried a simple experiment in GHCI.
The first run:
$ ghci
GHCi, version 7.8.4: http://www.haskell.org/ghc/ :? for help
Prelude> let t = replicate 9999999 'a' ++ ['x'] in last t
'x'
(0.33 secs, 1129265584 bytes)
The second run:
$ ghci
GHCi, version 7.8.4: http://www.haskell.org/ghc/ :? for help
Prelude> let t = replicate 9999999 'a' in last t
'a'
(0.18 secs, 568843816 bytes)
The only difference is the ++ ['x'] to append a last element to a list. It causes the runtime to increase from .18s to .33s, and the memory to increase from 568MB to 1.12GB.
So it seems that indeed it does cause multiple traversals. Can someone confirm on more theoretical grounds?
You can't conclude from these numbers whether the first run does two traversals, or one traversal in which each step takes more time and allocates more memory than the single traversal in the second run.
In fact, it's the latter that is happening here. You can think of the two evaluations like this:
in the second expression let t = replicate 9999999 'a' in last t, in each step but the last one, last evaluates its argument, which causes replicate to allocate a cons cell and decrement a counter, and then the cons cell is consumed by last.
in the first expression let t = replicate 9999999 'a' ++ ['x'] in last t, in each step but the last one, last evaluates its argument, which causes (++) to evaluate its first argument, which causes replicate to allocate a cons cell and decrement a counter, and then that cons cell is consumed by (++) and (++) allocates a new cons cell, and then that new cons cell is consumed by last.
So the first expression is still a single traversal, it's just one that does more work per step.
Now if you wanted to you could divide up all this work into "the work done by last" and "the work done by (++)" and call those two "traversals"; and that can be a useful approach for understanding the total amount of work done by your program. But due to Haskell's laziness, the two "traversals" are really interleaved as described above, so most people would say that the list is traversed just once.
I'd like to talk a bit about what happens when we enable optimizations, because it can transform the performance characteristics of the program pretty radically. I'll be looking at the Core output produced by ghc -O2 Main.hs -ddump-simpl -dsuppress-all. Also, I run the compiled programs with +RTS -s to get info about memory usage and running time.
With GHC 7.8.4 the two versions of the code run in the same amount of time and with the same amount of heap allocation. That's because replicate 9999999 'a' and ++ ['x'] is replaced with a genlist 9999999, where genlist looks like the following (not exactly the same, as I employ liberal translation from the original Core):
genlist :: Int -> [Char]
genlist n | n <= 1 = "ax"
| otherwise = 'a' : genList (n - 1)
Since we do generation and concatenation in a single step, we allocate each list cell just once.
With GHC 7.10.1, we get fancy new optimizations for list processing. Now both of our programs allocate about as much memory as a print $ "Hello World" program (about 52 Kb on my machine). This is because we skip the list creation entirely. Now last is fused away too; we instead get a call getlast 9999999, with getlast being:
getlast :: Int -> Char
getlast 1 = 'x'
getlast n = getlast (n - 1)
In the executable we'll have a small machine code loop that counts down from 9999999 to 1. GHC is not quite smart enough to skip all computation and go straight to returning 'x', but it does a good job nevertheless, and in the end it gives us something rather different to the original code.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

Could these patterns be matched by regular expression or context free grammar?

This is an exercise of compiler. We are asked if it's possible to match the following patterns with regular expression or context free grammar:
n 'a' followed by n 'b', like 'aabb'
palindrome, like 'abbccbba'
n 'a', then n 'b', then n 'c', like 'aabbcc'
Note that n could be any positive integer. (Otherwise it's too simple)
Only 3 character 'abc' could appear in the text to parse.
I'm confused because as far as I can see, non of these patterns can be described by regular expression and context free grammar.
The critical question is: how much and what kind of memory do you need?
In the case of problem 1, you need to somehow keep track of the number of a terminals as you are parsing the b terminals. Since you know you need one for one, a stack is clearly sufficient (you can put the a on the stack and pop one off with every b). Since a pushdown automaton is equivalent to a CFG in expressive power, you can create a CFG for problem 1.
In the case of problem 2, the technique that a PDA uses in problem 1 should be suggestive of a technique you could use for problem 2. PDAs can build a stack of the first half of the input, then pop it off as its reverse comes in.
In the case of problem 3, if you use the stack technique for counting the number of a terminals and b terminals, that's all well and good, but what happened to your stack memory? Was it sufficient? No, you would have needed to store the number of as somewhere else, so a CFG cannot express this grammar.
Here's an attempt at a simple CFG for problem 2 (it validates an empty input, but you'll get the idea):
S -> a S a
S -> b S b
S -> c S c
S -> ɛ