counting of the occurrence of substrings - c++

Is there an efficient algorithm to count the total number of occurrence of a sub-string X in a longer string Y ?
To be more specific, what I want is, the total number of ways of selecting A.size() elements from B such that there exists a permutation of the selected elements that matches B.
An example is as follows: search the total number of occurrence of X=AB in string Y=ABCDBFGHIJ ?
The answer is 2 : first A and second B, and first A and 5-th B.
I know we can generate all permutations of the long string (which will be N! length N strings Y) and use KMP algorithm to search/count the occurrence of X in Y.
Can we do better than that ?
The original problem I try to solve is as follows: let's say we have a large matrix M of size r by c (r and c in the range of 10000's). Given a small matrix P of size a by b (a and b are in the range of 10's). Find the total number of different selections of a rows and b columns of M (this will give us an a by b "submatrix" H) so that there exists a permutation of the rows and columns of H that gives us a matrix which matches P.
I think once I can solve 1-D case, 2-D may follow the solution.
After research, I find out that this is a sub-graph isomorphism problem and it is NP-hard. There are some algorithms solve this efficiently. One can google it and see many papers on this.

After having read, then re-read the question (at #Charlie 's suggestion), I have concluded that these answers are not addressing the real issue. I have concluded also that I still do not know exactly what the issue is, but if OP answer's my questions and clarifies the issue, then I will come back and make a better attempt at addressing it. For now, I will leave this as a place holder...
To find occurrences of a letter or other character:
char buf[]="this is the string to search";
int i, count=0, len;
len = strlen(buf);
for(i=0;i<len;i++)
{
if(buf[i] == 's') count++;
}
or, using strtok(), find occurrences of a sub-string:
Not pretty, brute force method.
// strings to search
char str1[]="is";
char str2[]="s";
int count = 0;
char buf[]="this is the string to search";
char *tok;
tok = strtok(buf, str1);
while(tok){
count++;
tok = strtok(NULL, str1);
}
tok = strtok(buf, str2);
while(tok){
count++;
tok = strtok(NULL, str2);
}
count should contain the total of occurrences of "s", + occurrences of "is"
[EDIT]
First, let me ask for a technical clarification of your question, given A = "AR", B = "START", the solutions would be "A", "R" and "AR", in this case all found in the 3rd and 4th letters of B. Is that correct?. If so, that's easy enough. You can do that with some small modifications and additions to what I have already done above. And if you have questions about that code, I would be happy to address them if I can.
The second part is your real question: Searching with better than, or at least with the same efficiency as the KMP algorithm - that's the real trick. If choosing the best approach is the real question, then some Google searching is in order. Because once you find, and settle on the best approach (efficiency >= KPM) to solving the sub-string search, then the implementation will be a set of simple steps (if you give it enough time), possibly, but not necessarily using some of the same components of C used above. (Pointer manipulation will be faster than using the string functions I think.) But these techniques are just implementation, and should always follow a good design. Here are a few Google searches to help you get started with a search... (you may have already been to some of these)
Validating KMP
KMP - Can we do better?
KMP - Defined
KMP - Improvements using Fibonacci String
If once you have made your algorithm selection, and begin to implement your design, you have questions about techniques, or coding suggestions, Post them. My guess is there are several people here who would enjoy helping with such a useful algorithm.

If X is a substring in Y, then each character of X must be in Y. So we first iterate through X and find the counts of each character, in an array counts.
Then for each character that has count >= 1, we count the number of times it appears in Y which can be done trivially in O(n).
From here the answer should just be the multiplication of the combinations C(count(Y),count(X)).
If after the 3rd time reading your question I finally understand it correctly.

Related

About time complexity of the algorithm

I am trying to solve the following question from LeetCode.com:
Given an input string, reverse the string word by word. Thus, "the sky is blue" should become "blue is sky the".
I came up with the following code snippet:
class Solution {
public:
void reverseWords(string &s) {
if(s.empty()) return;
istringstream iss(s);
string data, ans;
while(iss>>data) {
ans.insert(0, data+" ");
}
s=ans.substr(0,ans.size()-1);
}
};
and I was wondering about the time complexity of the same. I think that it is O(n^2) where n is the number of words in the input string. Could someone please confirm?
Thanks. (^_^)
I believe this algorithm's complexity is a bit more (haha) complex than other answers assume, intuitively down to the fact that we're prepending (not appending) and looping, and also because there's a bit of oversimplifying going on.
To be formal (and correct), the other analyses here aren't using enough variables - let's call w the number of words in the sentence and l the maximum length of a word in this sentence.
Then iss >> data is O(w) ("at most as expensive as the longest word"). Over l iterations of the loop, this is O(lw).
ans.insert(0, data + " ") is more complicated - insert is O(x + y) for x the length of the existing content and y the length of the new content. As the length of the existing content keeps growing (adding at most w each time), the complexity of this function isn't entirely obvious.
The cost of performing l prepends is at most w + 2w + 3w + ... + lw - each iteration we have to pay for all the words we've previously added as well as the word we're just adding now. This has a closed form expression:
w(1+2+...+l) = w * l(l+1)/2, and this is O(w*l^2).
Putting it together, the cost of the loop is O(wl + w*l^2), which is just O(w*l^2). It's informally "quadratic", but it depends on more than just one variable n, so it's best to classify it as a function of all the relevant ones.
Ps. One of the easy mistakes to make with big O notation is to always just talk about n - but what is n? In this example, we depend on more than just one variable, so using n can be misleading. insert is O(n) where n is the new length - but if you're already talking about n with regards to some other parameter (like the number of words), mistakes will happen.
PPs. Please point out mistakes/corrections in my analysis!
PPPs. insert isn't guaranteed to be O(x + y) as I claimed above - but it's safe to assume this complexity.
The complexity is O(N), as correctly explained by J.Doe
istringstream uses an O(n) and appending at the beginning (and resulting in consequent moves) uses another O(n)
See here the explanation, why the constant 2 is not relevant:
https://en.wikipedia.org/wiki/Big_O_notation#Example
If f(x) is a product of several factors, any constants (terms in the product that do not depend on x) can be omitted.

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Topic mining algorithm in c/c++

I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.

Levenshtein algorithm: How do I meet this text editing requirements?

I'm using levenshtein algorithm to meet these requirements:
When finding a word of N characters, the words to suggest as correction in my dictionary database are:
Every dictionary word of N characters that has 1 character of difference with the found word.
Example:
found word:bearn, dictionary word: bears
Every dictionary word of N+1 characters that has N characters equal to the found word.
Example:
found word:bear, dictionary word: bears
Every dictionary word of N-1 characters that has N-1 characters equal to the found word.
Example:
found word: bears, dictionary word: bear
I'm using this implementation of Levenshtein algorithm in C++ to find when a word has a Levenshtein number of 1 (which is the Levenshtein number for all three cases), but then how do I choose the word to suggest? I read about Boyer-Moore-Horspool and Knuth-Morris-Pratt but I'm not sure on how either of them can be helpful.
#include <string>
#include <vector>
#include <algorithm>
using namespace std;
int levenshtein(const string &s1, const string &s2)
{
string::size_type N1 = s1.length();
string::size_type N2 = s2.length();
string::size_type i, j;
vector<int> T(N2+1);
for ( i = 0; i <= N2; i++ )
T[i] = i;
for ( i = 0; i < N1; i++ ) {
T[0] = i+1;
int corner = i;
for ( j = 0; j < N2; j++ ) {
int upper = T[j+1];
if ( s1[i] == s2[j] )
T[j+1] = corner;
else
T[j+1] = min(T[j], min(upper, corner)) + 1;
corner = upper;
}
}
return T[N2];
}
You may also want to add Norvig's excellent article on spelling correction to your reading.
It's been a while since I've read it but I remember it being very similar to what your writing about.
As I've said elsewhere, Boyer-Moore isn't really apt for this. Since you want to search for multiple stings simultanously, the algorithm of Wu and Manber should be more to your liking.
I've posted a proof of concept C++ code in answer to another question. Heed the caveats mentioned there.
Why restrict the suggestion to a single word, why not include a set of words? If you are restricted to a single word, you can order your results by some pre-calculated frequency of usage or something. This frequency could be updated based on what users select from the suggestion.
Also, in the case where there isn't a spelling error in the original word, you might want to prioritize the N+1 cases, which would be more like an autocomplete. Anyway I don't think there is one correct way to do it, maybe if your requirements are more specific, it would be easier to narrow down.
Also, you don't need to know Python to understand the algorithms described in Norvig's article.
If I understand you correctly, then there is no correct answer to your question. You are identifying up to three suggestions for a given word using Levenshtein - it is up to you to come up with a rule to decide which one to use and which ones to filter out. Or perhaps you should use them all?
Just as a matter of interest, the Damerau extension to Levenshtein might be of interest to you, where two swapped characters are also considered to give a score of 1, instead of 2, which is what vanilla Levenshtein returns.