How to check if any element of array in another array? - d

I have got string aabbccddffgg I need to check if it have at last one element from array: ["bb", "cc"].
What is the best way to do it in D?

find and canFind support a variadic number of needles, so using them is the easiest:
"aabbccddffgg".canFind("bb", "cc");
Learn more about std.algorithm.searching.canFind here.
If you don't know the number of needles at compile-time, it depends a bit on how much you know about the string, but the naive approach would be to loop multiple times over the string:
auto eles = ["bb", "cc"];
eles.any!(e => "aabbccddffgg".canFind(e)))
If you know more about the sub elements, there are better approaches.
For example, if you know that all needles have all the length n, you could create a sliding window of size n and check whether your needles occur in one of the sliding windows:
auto eles = ["bb", "cc"];
"aabbccddffgg".slide(2).canFind!(e => eles.canFind!equal(e));
Learn more about std.range.slide here.
The same idea also works in the general case:
auto eles = ["bb", "cc"];
string s = "aabbccddffgg";
s.enumerate
.map!(e => s.drop(e.index))
.canFind!(e => eles.canFind!(reverseArgs!startsWith)(e));
Note that drop uses slicing and happens lazily in O(1) without any memory allocation.
Of course, there are still more efficient approaches with more advanced string matching algorithms.

Related

Replace all occurrences based on map

This question might have already been answered but I haven't found it yet.
Let's say I have a std::map<string,string> which contains string pairs of <replace_all_this, to_this>.
I checked Boost's format library, which is close but not perfect:
std::map<string,string> m;
m['$search1'] = 'replace1';
m['$search2'] = 'replace2';
format fmter1("let's try to %1% and %2%.");
fmter % 36; fmter % 77;
for(auto r : m) {
fmter % r.second;
}
// would print "let's try to replace1 and replace2
This would work, but I lose control of what.
Actually I'd like to have this as result:
format fmter1("let's try to $search2 and $search1 even if their order is different in the map.");
...
//print: "let's try to replace2 and replace1 even if their order is different in the map".
Please note: map can contain more items, and items can occur multiple times in the formatter.
What is the way to go for this in 2020, I'd like it to be effective and fast, so I'd avoid iterating over the map multiple times.
There may be new libraries but there is no new algorithm to do that faster than what we have so far.
Assuming your format implies a $<name> for your variables, you can search for the first '$', read the <name> search for that in the map, then do the replace. This allows you to either skip the replacement or process it too (i.e. make it recursive where a variable can reference another).
I don't think that doing it the other way around would be any faster: i.e. go through the map and search for the names in the string means you'd be parsing the strings many times and if you have many variables, it will be a huge waste if most are not likely part of your string. Also if you want to prevent some level of recursivity, it's very complicated.
Where you can eventually optimize is in calculating the size of the resulting string and allocate that buffer once instead of using += which is not unlikely going to be slower.
I have such an implementation in my snaplogger. The variable has to be between brackets and it can include multiple parameters to further tweak the data. There is documentation here about what is supported by that function and as written, you can easily extend the class to add more features. It's probably not a one to one match to what you're looking for, but it shows you that there is not 20 ways of implementing that function.

String storage optimization

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>
in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>
this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety
If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
hello world
hello winter
hell
[2]
/
h-e-l-l-o-' '-w-o-r-l-d-[0]
\
i-n-t-e-r-[1]
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)
libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html
SGI documentation: http://www.sgi.com/tech/stl/Rope.html
But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.
For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!
First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.
This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.
The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:
Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.
For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.
This is just an example, what the key is and what the value is depend on how you are going to access these strings.
Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

How to speed up calculation of length of longest common substring?

I have two very large strings and I am trying to find out their Longest Common Substring.
One way is using suffix trees (supposed to have a very good complexity, though a complex implementation), and the another is the dynamic programming method (both are mentioned on the Wikipedia page linked above).
Using dynamic programming
The problem is that the dynamic programming method has a huge running time (complexity is O(n*m), where n and m are lengths of the two strings).
What I want to know (before jumping to implement suffix trees): Is it possible to speed up the algorithm if I only want to know the length of the common substring (and not the common substring itself)?
These will make it run faster, though it'll still be O(nm).
One optimization is in space (which might save you a little allocation time) is noticing that LCSuff only depends on the previous row -- therefore if you only care about the length, you can optimize the O(nm) space down to O(min(n,m)).
The idea is to keep only two rows -- the current row that you are processing, and the previous row that you just processed, and throw away the rest.
Will it be faster in practice? Yes. Will it be faster regarding Big-Oh? No. The dynamic programming solution is always O(n*m).
The problem that you might run into with suffix trees is that you trade the suffix tree's linear time scan for a huge penalty in space. Suffix trees are generally much larger than the table you'd need to implement for a dynamic programming version of the algorithm. Depending on the length of your strings, it's entirely possible that dynamic programming will be faster.
Good luck :)
Here's a simple algorithm which can finishes in O((m+n)*log(m+n)), and much easier to implement compared to suffix tree algorithm which is O(m+n) runtime.
let it start with min common length (minL) = 0, and max common length (maxL) = min(m+n)+1.
1. if (minL == maxL - 1), the algorithm finished with common len = minL.
2. let L = (minL + maxL)/2
3. hash every substring of length L in S, with key = hash, val = startIndex.
4. hash every substring of length L in T, with key = hash, val = startIndex. check if any hash collision in to hashes. if yes. check whether whether they are really common substring.
5. if there're really common substring of length L, set minL = L, otherwise set maxL = L. goto 1.
The remaining problem is how to hash all substring with length L in time O(n). You can use a polynomial formula as follow:
Hash(string s, offset i, length L) = s[i] * p^(L-1) + s[i+1] * p^(L-2) + ... + s[i+L-2] * p + s[i+L-1]; choose any constant prime number p.
then Hash(s, i+1, L) = Hash(s, i, L) * p - s[i] * p^L + s[i+L];
Myer's bit vector algorithm can help you. It works by using bit manipulation and is a much faster approach.