Parent string of two given strings - c++

Given 2 strings, we have to find a string smallest in length such that the given strings are subsequences to the string. In other words, we need to find a string such that deleting some characters result in the given strings. was thinking of brute force and LCS, but in vain.
12345 and 11234 should result in 112345
WWA and WWS have a answer WWAS
LCS is pretty memory inefficient ( the DP one ) and brute force is just childish. What should I do?

Perhaps you could do a global alignment with Needleman-Wunsch and a high mismatch penalty, to prefer indels. At the end, merge the alignment into a "parent string" by taking letters from matching positions, and then a letter from either of the inserted letters, e.g.:
WW-A
||
WWS-
WWSA
Or:
-12345
||||
11234-
112345
Memory is O(nm), but a modification narrows that down to O(min(n,m)).

There's a well defined algorithm in standard library which would serve your purpose.
set_union ();
Condition is your input ranges must be sorted.

Related

String storage optimization

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>
in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>
this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety
If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
hello world
hello winter
hell
[2]
/
h-e-l-l-o-' '-w-o-r-l-d-[0]
\
i-n-t-e-r-[1]
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)
libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html
SGI documentation: http://www.sgi.com/tech/stl/Rope.html
But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.
For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!
First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.
This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.
The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:
Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.
For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.
This is just an example, what the key is and what the value is depend on how you are going to access these strings.
Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

quickly find a short string that isn't a substring in a given string

I've been trying to serialize some data with a delimiter and ran into issues.
I'd like to be able to quickly find a string that isn't a substring of given string if it contains a delimiter, so that I can use that for a delimiter.
If I didn't care about size the quickest way to find it would be to check a character in the given string, and pick a different character, make a string of the given string's length of only that character.
There may be a way to do some sort of check, testing first the middle characters, then the middle of the first and last segment... but I didn't see a clear algorithm there.
My current idea, which is fairly quick but non optimal is
initialize a hash with all characters as keys and 0 as a count
Read string characters as bytes using the hash to count.
walk the keys finding the smallest number of characters. stopping immediately if I find one that has zero characters.
Use that number of characters plus one as the delimiter.
I believe that is O(n), though obviously non the shortest. But the delimiter will always be no more than n/256 + 1 characters.
I could also try some sort of trie based construction, but I'm not quite sure how to implement that and thats 0(n^2) right?
https://cs.stackexchange.com/questions/21896/algorithm-request-shortest-non-existing-substring-over-given-alphabet
may be helpful.
Your counting of characters method isn't sufficient because you're only talking about the current string. The whole point of a delimiter is that in theory you're separating multiple strings, and therefore you'd need to count all of them.
I see two potential alternative solutions
Pick a delimiter and escape that delimiter in the strings.
Can use URI::Escape to escape a specific character, say &, and use that as the delimiter.
Specify the size of your string before you send it. That way you know exactly how many characters to pull. Essentially pack and unpack
And because I'm already on the train of alternative solutions, might as well propose all of the other serialization modules out there: Comparison of Perl serialization modules
I like the theory behind a task like this, but rings too much like an XY Problem
I agree with #Miller that your best bet is to pick a character and escape that in the text.
However, this is not what you asked, so I'll attempt to answer the question.
I take it these strings are long, so finding the delimiter is time-sensitive.
In straight Perl, the hash idea may well be as fast as you can get. As a native C extension, you can do better. I say this because my experience is that Perl array access is pretty slow for some reason, and this algorithm uses arrays to good effect:
int n_used_chars = 0;
int chars[256], loc_of_char[256];
for (int i = 0; i < 256; i++) used_chars[i] = loc_of_char[i] = i;
for (int i = 0; i < string_length; i++) {
char c = string[i];
int loc = loc_of_char[c];
if (loc >= n_used_chars) {
// Character c has not been used before. Swap it down to the used set.
chars[loc] = chars[n_used_chars];
loc_of_char[chars[loc]] = loc;
chars[n_used_chars] = c;
loc_of_chars[c] = n_used_chars++;
}
}
// At this point chars[0..n_used_chars - 1] contains all the used chars.
// and chars[n_used_chars..255] contains the unused ones!
This will be O(n) and very fast in practice.
What if all the characters are used? Then things get interesting... There are 64K two-byte combinations. We could use the trick above, and both arrays would be 64K. Initialization and memory would be expensive. Would it be worthwhile? Perhaps not.
If all characters are used, I would use a randomized approach: guess a delimiter and then scan the string to verify it's not contained.
How to make the guess in a prudent way?

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

fastest way to do keyword search in large text in C/C++

Say I have 100 keywords (that can include spaces) and I need to find out how many times they occur in a big piece of text. What would the fast way be to accomplish this?
My current idea is as follows:
turn the keywords into a suffix tree
walk through the text following the nodes and whenever a char does not occur (i.e. node->next == NULL) in the suffix tree, skip to next word and search again
The suffix tree struct would look something like this:
struct node {
int count; //number of occurences (only used at leaf node)
/* for each lower-case char, have a pointer to either NULL or next node */
struct node *children[26];
};
I am sure there is a faster way to do this, but what is it? Space efficiency is not really a big deal for this case (hence the children array for faster lookup), but time efficiency really is. Any suggestions?
The problem with the suffix tree approach is that you have to start the suffix search for each letter of the text to be searched. I think the best way to go would be to arrange a search for each keyword in the text, but using some fast search method with precomputed values, such as Boyer-Moore.
EDIT:
OK, You may be sure the trie may be faster. Boyer-Moore is very fast in the average case. Consider, for example, that strings have a mean length of m. BM can be as fast as O(n/m) for "normal" strings. That would make 100*O(n/m). The trie would be O(n*m) in mean (but it is true it can be much faster in real life), so if 100 >> m then the trie would win.
Now for random ideas on optimization. In some compression algorithms that have to do backward searchs, I've seen partial hash tables indexed by two characters of the string. That is, if the string to check is the sequence of characters c1, c2, and c3, you can check wether:
if (hash_table[c1 * 256 + c2] == true) check_strings_begining with [c1,c2]
then for c2 and c3, and so on. It is surprising how many cases you avoid by doing this simple check, as this hash will be true only each 100/65536 times (0.1%).
This is what I would do.
Put all your keywords in a hash table of key-value pairs, with the number of occurences of the keyword as the value and the keyword as the (you guessed it) key.
check each word in the text blob against the hash table. if the word is in the hash table, increment the occurrence count associated with it.
This is a good way because hash table look up is (or should be) amortized O(1) time. The whole algorithm has linear complexity :).
EDIT: If your keywords can contain spaces, you would need to make a sort of DFA. Scan the file until you find a word which one of your key "phrases" starts with. If the second (or however many) next words are part of the "key phrase", then increment the occurence count.
You seem to be groping your way towards http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
To quote:
The complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).
If it is a industrial application, use Boost Regex
It is tested, fast and chances are that it will save you a lot of pain.

Sorting names with numbers correctly

For sorting item names, I want to support numbers correctly. i.e. this:
1 Hamlet
2 Ophelia
...
10 Laertes
instead of
1 Hamlet
10 Laertes
2 Ophelia
...
Does anyone know of a comparison functor that already supports that?
(i.e. a predicate that can be passed to std::sort)
I basically have two patterns to support: Leading number (as above), and number at end, similar to explorer:
Dolly
Dolly (2)
Dolly (3)
(I guess I could work that out: compare by character, and treat numeric values differently. However, that would probably break unicode collaiton and whatnot)
That's called alphanumeric sorting.
Check out this link: The Alphanum Algorithm
i think u can use a pair object and then make vector > and then sort this vector.
Pairs are compared based on their first elements. So, this way you can get the sort you desire.