Z-Function and unique substrings: broken algorithm parroted everywhere? - c++

I am not a huge math nerd so I may easily be missing something, but let's take the algorithm from https://cp-algorithms.com/string/z-function.html and try to apply it to, say, string baz. This string definitely has a substring set of 'b','a','z', 'ba', 'az', 'baz'.
Let's see how z function works (at leas how I understand it):
we take an empty string and add 'b' to it. By definition of the algo z[0] = 0 since it's undefined for size 1;
we take 'b' and add 'a' to it, invert the string, we have 'ab'... now we calculate z-function... and it produces {0, 0}. First element is "undefined" as is supposed, second element should be defined as:
i-th element is equal to the greatest number of characters starting from the position i that coincide with the first characters of s.
so, at i = 1 we have 'b', our string starts with a, 'b' doesn't coincide with 'a' so of course z[i=1]=0. And this will be repeated for the whole word. In the end we are left with z-array of all zeroes that doesn't tell us anything despite the string having 6 substrings.
Am I missing something? There are tons of websites recommending z function for count of distinct substrings but it... doesn't work? Am I misunderstanding the meaning of distinct here?
See test case: https://pastebin.com/mFDrSvtm

When you add a character x to the beginning of a string S, all the substrings of S are still substrings of xS, but how many new substrings do you get?
The new substrings are all prefixes of xS. There are length(xS) of these, but
max(Z(xS)) of these are already substrings of S, so
You get length(xS) - max(Z(xS)) new ones
So, given a string S, just add up all the length(P) - max(Z(P)) for every suffix P of S.
Your test case baz has 3 suffixes: z, az, and baz. All the letters are distinct, so their Z functions are zero everywhere. The result is that the number of distinct substrings is just the sum of the suffix lengths: 3 + 2 + 1 = 6.
Try baa: The only non-zero in the Z functions is Z('aa')[1] = 1, so the number of unique substrings is 3 + 2 - 1 + 1 = 5.
Note that the article you linked to mentions that this is an O(n2) algorithm. That is correct, although its overhead is low. It's possible to do this in O(n) time by building a suffix tree, but that is quite complicated.

Related

Count how many substrings exist in a Fibonacci string

The problem is this:
You are given an integer N and a substring SUB
The Fibonacci String follows the following rules:
F\[0\] = 'A'
F\[1\] = 'B'
F\[k\] = F\[k - 2\] + F\[k - 1\]
(Meaning F\[2\] = 'AB', F\[3\] = 'BAB', F\[4\] = 'ABBAB',...)
Task: Count how many times substring SUB appears in F\[n\]
Sample cases:
Input
Output
4 AB
2
6 BAB
4
(N <= 5 * 10^3, 1 <= SUB.length() <= 50)
I had an overall understanding of the problem and wanting to find a more optimal way to solve that problem
My approach is following the fomula F\[k\] = F\[k - 2\] + F\[k - 1\] and then run loop tills it reaches (F\[k\].length - 1), each loop I extract a substring from F\[k\] at i with the same length as SUB (call it F_sub), then I check whether F_sub equals to SUB or not, if yes I increase count (Yes, this approach is not optimal enough for the big tests)
I am also thinking whether Dynamic Programing is suited for this problem or not
Starting with the first 2 strings that are at least as long as SUB, you should switch the representation of the strings F[n]. Instead of remembering the complete string, you only need to remember 3 numbers:
occurrences: the number of times SUB occurs within the string
prefix: The length of the longest prefix of the string that is a proper suffix of SUB
suffix: The length of the longest suffix of the string that is a proper prefix of SUB
Given o, p, an s for F[k] and F[k+1], you can calculate them for the concatenation F[k+2]:
F[k+2].p = F[k].p
F[k+2].s = F[k+1].s
F[k+2].o = F[k].o + F[k+1].o + JOIN(F[k].s,F[k+1].p)
The function JOIN(a,b) calculates the number of occurrences of SUB within the first a characters of SUB joined to the last b characters of SUB. There are only |SUB|2 values. In fact, since all the values for p and s are copied from the first 2 strings, there are only 4 values of this function that will be used. You can calculate them in advance.
F[N].o is the answer you are looking for.
A straightforward implementation of this takes O(N + |SUB|2), assuming constant time mathematical operations. Since |SUB| <= 50, this is quite efficient.
If the constraint on N was much larger, there's an optimization using matrix exponentiation that could bring the complexity down to O(log N + |SUB|2), but that's not necessary under the given constraints.

how to find the time at which modified string will be equal to original string

How to solve a question where the string is first circularly rotated by 1 letter, then by 2 letter and soon , at what time the modified string will be equal to original string? these strings are made up using 'a' and 'b' only.
for eg: aabaab is the string on first letter rotation it will become abaaba on second rotation it will become aabaab so answer is 2.
I tried to solve this question but could only do this only by brute force.
https://pasteboard.co/HwWR6WZ.png
Any help will be appreciated.
Let s be the original string, what you want is the smallest index i > 0 such that s is a substring at index i in the string ss. You can construct the suffix tree of ss then search s in this tree. This algorithm runs in O(n) time.
For example, consider s = abab, the suffix tree of ss, i.e. abababab looks like ($ represents the end of a string)
root
ab/ \b
/ \
ab/\$ $/\ab
/ \ / \
* 6 7 $/\ab
ab/\$ / \
/ \ 5 $/\ab$
ab$/\$ 4 / \
/ \ 3 1
0 2
After searching abab we reach the * node, and there are three leaves representing indices 0,2,4 in its subtree. The answer is the smallest positive index among them, i.e. 2.
The suffix tree can be constructed using suffix array and LCP array in O(n) time.
Since the string contains only 'a' and 'b' you can represent them as bits with 1 and 0 s. Then circular shifting and doing a bit-wise operation like XOR might give you better performance.
I don't know what you mean by brute force but I'd do it this way:
Concatenate original with itself to simulate cyclic bounduary conditions.
Search original within it's double, starting from the 2nd character.
It should work in the linear time, It's going to be O(2*N) tops. I've checked with clang's substring find it's going to go through searched string (i.e. cycled one) at most it's size.
It will find a match for sure at most at the s.size() because the string is doubled.
#include <iostream>
#include <string>
using namespace std;
auto steps(const string& s) {
string cycled = s+s;
return cycled.find(s, 1);
}
int main() {
string s{"aabaab"};
cout << steps(s) << '\n';
return 0;
}
I guess you could optimize for memory and some time by avoiding the copy and working only on the original string but you'd need to provide a custom iterator, that would cycle over the string. Then, it should be possible to fairly easy rewrite the algorithm from std::basic_string.

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Given a 2D matrix of characters we have to check whether the given word exist in it or not

Given a 2D matrix of characters we have to check whether the given word exist in it or not.
eg
s f t
d a h
r y o
we can find "rat in it
(top down , straight ,diagonal or anypath).. even in reverse order. with least complexiety.
my approach is
While traversing the 2d matrix ( a[][] ) row wise.
If ( a[i][j] == first character of given word ) {
search for rest of the letters in 4 directions i.e. right, right diagonally down, down and left diagonally down.
} else if( a[i][j] == last character of the given word ) {
search for remaining characters in reverse order in 4 directions i.e. left, right diagonally up, up, left diagonally up.
}
is there any better approach?
Let me describe a very cool data structure for this problem.
Go ahead and look up Tries.
It takes O(k) time to insert a k-length word into the Trie, and O(k) to look-up the presence of a k-length word.
Video tutorial
If you have problems understanding the data structure, or implementing it, I'll be happy to help you there.
I think I would do this in two phases:
1) Iterate over the array, looking for instances of the first letter in the word.
2) Whenever you find an instance of the first letter, call a function that examines all adjacent cells (e.g. up to 9 of them) to see if any of them are the second letter of the word. For any second-letter-matches that are found, this function would call itself recursively and look for third-letter matches in cells adjacent to that (and so on). If the recursion ever gets all the way to the final letter of the word and finds a match for it, then the word exists in the array. (Note that if you're not allowed to use a letter twice you'll need to flag cells as 'already used' in order to prevent the algorithm from re-using them. Probably the easiest way to do that would be to pass-by-value a vector of already-used-cell-coordinates in to the recursive function, and have the recursive function ignore the contents of any cells that are in that list)
In fact you have 16 sequences here:
sft
dah
ryo
sdr
fay
tho
sao
rat
tfs
had
oyr
rds
yaf
oht
oas
tar
(3 horizontal + 3 vertical + 2 diagonals) * 2 (reversed) = 16. Let n be a size of a matrix. In your example n = 3. Number of sequences = (n + n + 2) * 2 = 4n + 4.
Now you need to determine whether a sequence is a word or not. Create a hash set (unordered_set in C++, HashSet in Java) with words from dictionary (found on the internet). You can check one sequence in O(1).
Look for the first letter or your word using a simple loop and when you find it use the following recursive function.
The function will get as input 5 parameters: the word you are looking for str, your current position of the letter in the word str you look for in your array k, i and j as the position in your array to search for the letter and direction d.
The stop conditions will be:
-if k > strlen(str); return 1;
-if arr[i][j] != str[k]; return 0;
If none of the upper statements are true you increment your letter counter k++; update your i and j acording to your value of d and call again your function via return func(str, k);