Having trouble groking multiple exclusive combinations in C++ - c++

I have an ordered string that I need to present to a user:
ABCCDDCBBBCBBDDBCAAA
Objects represented by 'B' are tagged, such that 2 Bs will have a '~' after them.
AB~CCDDCB~BBCBBDDBCAAA
AB~CCDDCBB~BCBBDDBCAAA
AB~CCDDCBBB~CBBDDBCAAA
and so on...
I've used the combinations library by Howard Hinnant and it works well for this simple case. My test code uses a vector of locations as ints that was sent through for_each_combination.
However, I'm lost as to what to do when I have multiple tags for B.
For example, 4Bs total need to be tagged, 2 by '~' and 2 by '#'
AB~CCDDCB~B#B#CBBDDBCAAA
AB#CCDDCB~B~B#CBBDDBCAAA
AB~CCDDCB#B~B#CBBDDBCAAA
AB#CCDDCB#B~B~CBBDDBCAAA
ABCCDDCB~B~B#CB#BDDBCAAA
and so on...
The pseudocode I've written out is a cascade. After the first for_each_combination, for each of the resulting combinations, copy every other location to another vector and do another for_each_combination.
Considering the number of combinations I will be working with, I'm hoping there's a better way.

this sounds like homework, so first of all i'm not going to just give you code, and secondly i added the [homework] tag.
now, with four markers to place, a reasonable solution is a set of nested for loops, four of them
the markers are restricted to 7 positions, so just count from 0 to 6, inclusive
then translate that to positions in the string
then output the string with the markers
i've verified that this is simple to do
if you need further help with it, just ask a new question (and show what you've got so far)

I believe I've answered my own question after a bit of looking.
First, I switched from using Howard's library to Hervé's combinations library. The main draw is that using next_combination allows me to chain the combination calculations together, like so:
do {
do {
cout << values << endl;
} while (next_combination(values.begin() + 3, values.begin() + 5, values.end()));
} while (next_combination(values.begin(), values.begin() + 3, values.end()));
I need to massage this into an iterator, but this is exactly what I need.

Related

How to get all cells that appear more than 5 times?

enter image description here
I have a table in OpenOffice that contains a column with region's codes (column J). Using table functions, how to get all codes that appear more than 5 times and write them in one cell?
Normally I would recommend breaking this problem down into smaller parts using helper columns. Or better yet, move the data into LibreOffice Base which can easily work with distinct values.
However, I managed to come up with a rather large formula that seems to do what you asked. Enter it as an array formula.
=TEXTJOIN(",";1;IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5;IF(ROW(исходник.J2:J552)=MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1;исходник.J2:J552;"")))
I can't test this on your actual data since your example is only an image, but let's say that there are six of both 77 and 37. Then this would show 77,37 as the result.
Here is a breakdown. Look up the functions in LibreOffice Online Help for more information.
=TEXTJOIN(",";1; — Join all results into a single cell, separated by commas.
IF(COUNTIF(исходник.J$2:J$552;исходник.J2:J552)>5; — Find codes that occur more than 5 times. This is the same as what you wrote.
IF(ROW(исходник.J2:J552)= — Compare the next result to the row number that we are currently looking at.
MATCH(исходник.J2:J552;исходник.J$2:J$552;0)+ROW(J$2)-1; — Determine the first row that has this code. We do this to get unique results instead of 6 or more of each code in the result.
исходник.J2:J552;""))) — Return the code. (Your formula simply returns 1 here, which doesn't seem to be what you want.) If it doesn't match, return an empty string rather than 0, because TEXTJOIN ignores empty strings.

How to generate all possible 4 letter words in C++

I want to write a program that can give me all 4 letter words (from the dictionary or outside the dictionary). I code in C++. And by far, I've reached nowhere.
I'm simply a beginner in C++, I can apply the logic but I'm not introduced to advanced features in C++. It doesn't matter if it takes a long time for this program to end execution, I just want the solution.
For example:
abcd
king
ngik
cbda
play
lpay
payl
and so on (just a few of the millions of outputs I hope this program to output).
NOTE: The words generated need not make sense and I do not want to discard any combinations, I want it all.
I suggest looping say i from 0 to 26^4 - 1, each time outputting 'A' + i / (26*26*26), 'A' + i / (26*26) % 26, 'A' + i / 26 % 26, and 'A' + i % 26, then a newline.
Make a array that has all the possible letters in it(add numbers and symbols if you want).
Then use four nested for loops that loop a number from 0 to the length of the array.
Lets say that the loop number variables are a,b,c,d.
In the inner loop(the last one) you can output it as array[a] + array[b] + array[c] + array[d]
This gives all the possible combinations where you can add numbers and symbols aswell.
use recursion
explore DFS a 26-ary tree and output the letter every time you go down on the corresponding branch. depth = 4.
P.S. recursive algorithms eat stack memory like hell... so be sure your machine let programs have enough stack mem. You don't wanna run this on a pic micro with only 3-level call stack :-)

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Complex regex to check for two words and a quanity

Ok what I'm trying to do is to check for the presence of
"TestItem-1"
a number greater then 1
one of the possible words in the list of "KG. Kg, kg, Kilo(s) or Kilogram(s)"
Where any of the items could be in any order and within a 6 word limit of each other.
Has to be done in regex as there is no access to the underlying scripting engine
This is what I've got as there a way of checking greater then I decided to use a range of 1-999 for the number check.
\b(?:[T|t]estItem-1\W+(?:\w+\W+){1,6}(^[0-9]|[1-9][0-9]|[1-9][0-9][0-9])$)\W+(?:\w+\W+){1,6}[K|k]il[o|os]|[K|k][[G|GS]|[g|gs]]|[|K|k]ilogra[m|ms]\b
Examples of what I need to find would be like -
"TestItem-1 is unstable in quanties above 12 Kilograms"
"1 Kilogram of TestItem-1"
While I wouldn't want to find -
"15 units of TestItem-1"
I know that what I got isn't working each section appears to work independently of each other but not together.
I pass this over to far greater minds then mine :)
You can try something like this:
\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b(?:\S+\s){0,6}\bTestItem-1\b|\bTestItem-1\b(?:\S+\s){0,6}\b(?:[2-9]|\d\d+)\b\s\b(?:KG.|Kg,|kg,|Kilos?|Kilograms?)\b
Not ideal with the duplication but without lookarounds that's the best I could think of. I'll try and improve it in a bit.

Similar String algorithm

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.