Sorting names with numbers correctly - c++

For sorting item names, I want to support numbers correctly. i.e. this:
1 Hamlet
2 Ophelia
...
10 Laertes
instead of
1 Hamlet
10 Laertes
2 Ophelia
...
Does anyone know of a comparison functor that already supports that?
(i.e. a predicate that can be passed to std::sort)
I basically have two patterns to support: Leading number (as above), and number at end, similar to explorer:
Dolly
Dolly (2)
Dolly (3)
(I guess I could work that out: compare by character, and treat numeric values differently. However, that would probably break unicode collaiton and whatnot)

That's called alphanumeric sorting.
Check out this link: The Alphanum Algorithm

i think u can use a pair object and then make vector > and then sort this vector.
Pairs are compared based on their first elements. So, this way you can get the sort you desire.

Related

C++ poker handrange combo removal

the calling ranges are stored in a function like this where KK+ means pocket pairs KK and better
const char* prw_preflop_ICMrange[] = {
"AA+",
"KK+",
"KK+,AKs",
}
I would like to write a function that would remove the card combos of my own holding from opponents range e.g.:
I hold the "Ac8d" and opponents range is "KK+,AKs" then the function should loop through his calling range and remove 3 combos from AA and 1 combo from AKs. Likewise it should loop through all Ax and 8x hand possibilities and remove the combos that involve the Ac and 8d.
The function should return an integer with my opponents actual calling range e.g.:
(16 combos - 4 combos)/(1326 combos - 101 combos) = 0.0098
(his calling range of 0.0121 is actually more narrow due to card removal)
How could you achieve something like this?
Create a function that generates all combinations then use this function to populate instance of std::vector<std::string> with the values generated.
Use the pre-populated vector as input to functions that return the counts of combinations. The functions should not remove anything from the list, only count the occurrences where the conditions are met, since you are not interested in all combinations, rather in the number of the combinations.
In cases where you really want to have the combinations, the function should build a new vector and return it rather than deleting values from the vector.

Parent string of two given strings

Given 2 strings, we have to find a string smallest in length such that the given strings are subsequences to the string. In other words, we need to find a string such that deleting some characters result in the given strings. was thinking of brute force and LCS, but in vain.
12345 and 11234 should result in 112345
WWA and WWS have a answer WWAS
LCS is pretty memory inefficient ( the DP one ) and brute force is just childish. What should I do?
Perhaps you could do a global alignment with Needleman-Wunsch and a high mismatch penalty, to prefer indels. At the end, merge the alignment into a "parent string" by taking letters from matching positions, and then a letter from either of the inserted letters, e.g.:
WW-A
||
WWS-
WWSA
Or:
-12345
||||
11234-
112345
Memory is O(nm), but a modification narrows that down to O(min(n,m)).
There's a well defined algorithm in standard library which would serve your purpose.
set_union ();
Condition is your input ranges must be sorted.

Exclude elements from vector based on regular expression pattern

I have some data which I want to clean up using a regular expression in R.
It is easy to find how to get elements that contain certain patterns, or do not contain certain words (strings), but I can't find out how to do this for excluding cells containing a pattern.
How could I use a general function to only keep those elements from a vector which do not contain PATTERN?
I prefer not to give an example, as this might lead people to answer using other (though usually nice) ways than the intended one: excluding based on a regular expression. Here goes anyway:
How to exclude all the elements that contain any of the following characters:
'pyfgcrl
vector <- c("Cecilia", "Cecily", "Cecily's", "Cedric", "Cedric's", "Celebes",
"Celebes's", "Celeste", "Celeste's", "Celia", "Celia's", "Celina")
The result would be an empty vector in this case.
Edit: From the comments, and with a little testing, one would find that my suggestion wasn't correct.
Here are two correct solutions:
vector[!grepl("['pyfgcrl]", vector)] ## kohske
grep("['pyfgcrl]", vector, value = TRUE, invert = TRUE) ## flodel
If either of them wants to re-post and accept credit for their answer, I'm more than happy to delete mine here.
Explanation
The general function that you are looking for is grepl. From the help file for grepl:
grepl returns a logical vector (match or not for each element of x).
Additionally, you should read the help page for regex which describes what character classes are. In this case, you create a character class ['pyfgcrl], which says to look for any character in the square brackets. You can then negate this with !.
So, up to this point, we have something that looks like:
!grepl("['pyfgcrl]", vector)
To get what you are looking for, you subset as usual.
vector[!grepl("['pyfgcrl]", vector)]
For the second solution, offered by #flodel, grep by default returns the position where a match is made, and the value = TRUE argument lets you return the actual string value instead. invert = TRUE means to return the values that were not matched.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

Deduplicating an array of keywords (but not based on EXACT match)

I have a list of a few thousand terms. There is significant overlap in those terms, but in different forms. For example (ruby, a_ruby), (triathlon, triathlete, triathletes), (nonprofit, non_profit, non_profits).
Most of these have significant number of character overlap, but not exactly in the same form. For example, (nonprofit and non_profit)
What regex sequence will be the best for this? I know that i can use stemming as well, but wondering how i can combine that with the regex.
For a single list of a few thousand items, I'd consider an alternate approach.
Sort the list alphabetically then manually remove the duplicates. Whatever regex and subsequent processing you end up with will probably take as much time if not more than going through the list manually.
Of course, I'm assuming this is a one-time proposition. I defer to regex experts for a programmatic solution.
I agree with Bob Kaufman that you should do a first pass to eliminate actual duplicates. After that, you have a problem that regex cannot solve for you; you will need to look into measurements of edit distance to get anywhere with it.
My usual strategy in this situation, which is not perfectly reliable, is as follows:
1) Remove all nonalphanumeric characters.
2) Make all strings lowercase.
3) Put all of the strings in a HashSet (this will remove duplicates).
4) Check for any cases where word and word+"s" are both in the set, and remove the plural one.
5) Output the strings in alphabetical order, and do a quick manual search for duplicates. If any are found, define new rules accordingly.
Other rules you may need:
Replace & with and.
Remove all instances of "inc"
Replace all instances of television with TV.