dart performance for iterables using where - list

i'm trying to get better performance from a pattern check in a really wide list of strings.
i need the 5 first occurences that would match a given pattern.
i was wondering if
list.where(pattern in string).take(5)
was lazily computed and stops after 5 occurences found or
does it compute all the where and then takes the 5 first ? ( in that case, is there a whereXfirstOccurences method where X is a number ? )
thank you,
Edit:
i did some investigation
myList.where((element) {bool isSuggestion = the conditions ;
if (isSuggestion) index++;
return isSuggestion;
})
.take(x)
.toList();
print(index);
the index is always at most equal to x so i guess it's lazy evaluation as mentionned below, Thank you :)

Iterables are lazy.
If you do list.where(computation).take(5), it:
Doesn't do anything at all, until you start iterating.
It doesn't do anything except when you call moveNext on the iterator.
And it stops doing anything once moveNext has returned false, which it does after five elements here, because of the take(5).
If you just use for (var v in list.where(...).take(5)) ... you won't see those steps, but they are still there. The loop stops after finding five values, and no further elements are looked at than the ones needed to find the first five satisfying the where condition.
That might still be a lot of strings looked at, if the condition is very picky. If there are only four matching strings in the input, you will go through all of the input when looking for the first five matches.
Optimizing the pattern itself can definitely be valuable as well.

Related

Constructing palindrome from a list of words

Recently I was looking through some interview questions, and found some interesting one:
You are given a list of word. Find if two words can be joined to-gather to form a palindrome. eg Consider a list {bat, tab, cat} Then bat and tab can be joined to gather to form a palindrome.
Expecting a O(nk) solution where n = number of works and k is length
There can be multiple pairs, just return true if found one.
Also, in the comments one of the approaches was this:
1) Add the first word to the trie ( A B)
2) Take the second word (D E E D B A) and reverse it (A B D E E D)
3) See how many letters in the reversed word you can match in the trie (the first 2)
4) Take the rest of the string (D E E D) see if it is a palindrome if it is you are done return true
5) add the second word to the trie (D E E D B A)
6) go back to step 2 with the next word
7) when out of words return false
But in my opinion this is not an O(nk) solution.
Can anyone suggest a solution?? Or explain why the algorithm described above is O(nk)??
The algorithms is correct, or at least it gets quite close. There are minor technical issues. In step 4. one should save the proposition of a solution if it's better than the current one, and in step 7. return it, or say it was impossible to make a palindrome.
The main idea is to process words into cores and prefixes. If a core is a palindrome, then we need to match the prefix with other word. Trie serves as a "database" for processed strings, so with each new word, one can check all possible extensions. If words were kept separately one would need to compare prefixes of each word separately.
(Edit: I think there still is a small loophole, in case there are two words in a trie which starts the same, and the incoming one would make a palindrome with the shorter one, but not the longer, but I won't go into details. Handling it would complicate the algo but wouldn't affect complexity.)
It also is O(n*k). Adding and checking a prefix vs a trie takes number of steps proportional to the number of characters. So in this case this is bound by k. Just like tree operations are O(h) where h is the height of the tree. So in conclusion:
k steps.
takes k steps.
also takes at most k steps.
also takes less than k steps but we can bound it by k.
also takes k steps.
Steps 2 to 5 are done n-1 times.
Of course each step has a different dominant operation, so it is hard to specify the exact constant, but all of them are bound by k so the complexity is O(c*(n-1)*k) which essentially is O(n*k).
There's a really interesting discussion of this in an article from Dr. Dobbs, way back in 2004. The full explanation is a little long, but the general idea is:
Suppose you start with Lion, where the pivot is left of the actual word. I can calculate the center of the string, which is position two. The pivot is at zero, so the string is too heavy on the right, but at the moment, Lion qualifies as a partial palindrome. The "dot" at the pivot point matches the dot at the pivot point, so there is at least one correct character, albeit the same character. You now wish to prepend words that end with noil, attempting to convert the string to noil.Lion. I use to mean any string of characters. If you're successful, then you need to locate words starting with so that they can be appended to the string.
Note that he defines a partial palindrome as:
A string is a partial palindrome if, working from the pivot point outwards, either the left or right end of the string is encountered before a mismatch occurs.

Rules of regex engines. Greediness, eagerness and laziness of regexes

As we all know, regex engine use two rules when it goes about its work:
Rule 1: The Match That Begins Earliest Wins or regular expressions
are eager.
Rule 2: Regular expressions are greedy.
These lines appear in tutorial:
The two of these rules go hand in hand.
It's eager to give you a result, so what it does is it tries to just
keep letting that first one do all the work.
While we're already in the middle of it, let's keep going, get to the
end of the string and then when it doesn't work out, then it will
backtrack and try another one.
It doesn't backtrack back to the beginning; it doesn't try all sorts
of other combinations.
It's still eager to get you a result, so it says, what if I just gave
back one?
Would that allow me to give a result back?
If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking
for some kind of a better match or match that's further along.
I don't quite understand these lines (especially 2nd ("While we're...") and last ("It doesn't have to keep backtracking") sentences).
And these lines about lazy mode.
It still defers to the overall match just like the greedy one does
clearly.
I don't understand the following analogy:
It's not necessarily any faster or slower to choose a lazy strategy or
a greedy strategy, but it will probably match different things.
Now as far as is faster or slower, it's a little bit like saying, if
you've lost your car keys and your sunglasses inside your house, is it
better to start looking in the kitchen or to start looking in the
living room?
You don't know which one's going to yield the best result, and you
don't know which one's going to find the sunglasses first or the keys
first; it's just about different strategies of starting the search.
So you will likely get different results depending on where you start,
but it's not necessarily faster to start in one place or the other.
What 'faster or slower' means?
I'm going to draw scheme how it work (in both case). So I will contemplate this questions until I find out what's going on around here!)
I need understand it exactly and unambiguously.
Thanks.
Let's try by the exemple
for an input of this is input for test input on regex and a regex like /this.*input/
The match will be this is input for test input
What will be done is
starting to examine the string and it will get a match with this is input
But now its at the middle of the string, it will continue to see if it could match more on it (this is the While we're already in the middle of it, let's keep going )
It will match till this is input for test input and continue till the end of the string
at the end, there's things wich are not part of the match, so the interpreter "backtrack" to the last time it matches.
For the last part its more about the ored regexes
Consider input string as cdacdgabcdef and the regex (ab|a).*
A common mistake is thinking it will return the more precise one (in this case 'abcdef') but it will return 'acdgabcdef' because the a match is the first one to match.
what happens here is: There's something matching this part, let's continue to the next part of the pattern and forget about the other options in this part.
For the lazy and greedy questions, the link of #AvinashRaj is clear enough, I won't repeat it here.

Exclude elements from vector based on regular expression pattern

I have some data which I want to clean up using a regular expression in R.
It is easy to find how to get elements that contain certain patterns, or do not contain certain words (strings), but I can't find out how to do this for excluding cells containing a pattern.
How could I use a general function to only keep those elements from a vector which do not contain PATTERN?
I prefer not to give an example, as this might lead people to answer using other (though usually nice) ways than the intended one: excluding based on a regular expression. Here goes anyway:
How to exclude all the elements that contain any of the following characters:
'pyfgcrl
vector <- c("Cecilia", "Cecily", "Cecily's", "Cedric", "Cedric's", "Celebes",
"Celebes's", "Celeste", "Celeste's", "Celia", "Celia's", "Celina")
The result would be an empty vector in this case.
Edit: From the comments, and with a little testing, one would find that my suggestion wasn't correct.
Here are two correct solutions:
vector[!grepl("['pyfgcrl]", vector)] ## kohske
grep("['pyfgcrl]", vector, value = TRUE, invert = TRUE) ## flodel
If either of them wants to re-post and accept credit for their answer, I'm more than happy to delete mine here.
Explanation
The general function that you are looking for is grepl. From the help file for grepl:
grepl returns a logical vector (match or not for each element of x).
Additionally, you should read the help page for regex which describes what character classes are. In this case, you create a character class ['pyfgcrl], which says to look for any character in the square brackets. You can then negate this with !.
So, up to this point, we have something that looks like:
!grepl("['pyfgcrl]", vector)
To get what you are looking for, you subset as usual.
vector[!grepl("['pyfgcrl]", vector)]
For the second solution, offered by #flodel, grep by default returns the position where a match is made, and the value = TRUE argument lets you return the actual string value instead. invert = TRUE means to return the values that were not matched.

How to find longest palindrome [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Write a function that returns the longest palindrome in a given string
I have a C++ assignment which wants me write a program that finds the longest palindrome in a given text. For example, the text is this: asdqerderdiedasqwertunut, my program should find tunut in the index of 19. However if input is changed into this astunutsaderdiedasqwertunutit should find astunutsa in the index of 0 instead of tunutin index of 22.
So, my problem is this. But I am a beginner at the subject, i know just string class, loops, ifs. It would be great if you could help me on this.
Thanks in advance.
The idea is very simple:
Write a function is_palindrome(string) that takes a string, and returns true if it is a palindrome and false if it is not
With that function in hand, write two nested loops cutting out different substrings from the original string. Pass each substring to is_palindrome(string), and pick the longest one among the strings returning true.
You can further optimize your program by examining longest substrings ahead of shorter ones. If you examine substrings from longest to shortest, you'll be able to return as soon as you find the first palindrome.
Dasblinkenlight's idea is pretty good, but it's faster this way:
A palindrome has either an even number of letters or odd, so you have two situations. Let's start with the even. You need to find two consecutive identical letters, and then check whether the immediately previous letter is identical to the next letter. The same in the other situation, except at first you only need one letter. I don't speak English that well, so I hope you understood. :)

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++