Count how many substrings exist in a Fibonacci string - c++

The problem is this:
You are given an integer N and a substring SUB
The Fibonacci String follows the following rules:
F\[0\] = 'A'
F\[1\] = 'B'
F\[k\] = F\[k - 2\] + F\[k - 1\]
(Meaning F\[2\] = 'AB', F\[3\] = 'BAB', F\[4\] = 'ABBAB',...)
Task: Count how many times substring SUB appears in F\[n\]
Sample cases:
Input
Output
4 AB
2
6 BAB
4
(N <= 5 * 10^3, 1 <= SUB.length() <= 50)
I had an overall understanding of the problem and wanting to find a more optimal way to solve that problem
My approach is following the fomula F\[k\] = F\[k - 2\] + F\[k - 1\] and then run loop tills it reaches (F\[k\].length - 1), each loop I extract a substring from F\[k\] at i with the same length as SUB (call it F_sub), then I check whether F_sub equals to SUB or not, if yes I increase count (Yes, this approach is not optimal enough for the big tests)
I am also thinking whether Dynamic Programing is suited for this problem or not

Starting with the first 2 strings that are at least as long as SUB, you should switch the representation of the strings F[n]. Instead of remembering the complete string, you only need to remember 3 numbers:
occurrences: the number of times SUB occurs within the string
prefix: The length of the longest prefix of the string that is a proper suffix of SUB
suffix: The length of the longest suffix of the string that is a proper prefix of SUB
Given o, p, an s for F[k] and F[k+1], you can calculate them for the concatenation F[k+2]:
F[k+2].p = F[k].p
F[k+2].s = F[k+1].s
F[k+2].o = F[k].o + F[k+1].o + JOIN(F[k].s,F[k+1].p)
The function JOIN(a,b) calculates the number of occurrences of SUB within the first a characters of SUB joined to the last b characters of SUB. There are only |SUB|2 values. In fact, since all the values for p and s are copied from the first 2 strings, there are only 4 values of this function that will be used. You can calculate them in advance.
F[N].o is the answer you are looking for.
A straightforward implementation of this takes O(N + |SUB|2), assuming constant time mathematical operations. Since |SUB| <= 50, this is quite efficient.
If the constraint on N was much larger, there's an optimization using matrix exponentiation that could bring the complexity down to O(log N + |SUB|2), but that's not necessary under the given constraints.

Related

Z-Function and unique substrings: broken algorithm parroted everywhere?

I am not a huge math nerd so I may easily be missing something, but let's take the algorithm from https://cp-algorithms.com/string/z-function.html and try to apply it to, say, string baz. This string definitely has a substring set of 'b','a','z', 'ba', 'az', 'baz'.
Let's see how z function works (at leas how I understand it):
we take an empty string and add 'b' to it. By definition of the algo z[0] = 0 since it's undefined for size 1;
we take 'b' and add 'a' to it, invert the string, we have 'ab'... now we calculate z-function... and it produces {0, 0}. First element is "undefined" as is supposed, second element should be defined as:
i-th element is equal to the greatest number of characters starting from the position i that coincide with the first characters of s.
so, at i = 1 we have 'b', our string starts with a, 'b' doesn't coincide with 'a' so of course z[i=1]=0. And this will be repeated for the whole word. In the end we are left with z-array of all zeroes that doesn't tell us anything despite the string having 6 substrings.
Am I missing something? There are tons of websites recommending z function for count of distinct substrings but it... doesn't work? Am I misunderstanding the meaning of distinct here?
See test case: https://pastebin.com/mFDrSvtm
When you add a character x to the beginning of a string S, all the substrings of S are still substrings of xS, but how many new substrings do you get?
The new substrings are all prefixes of xS. There are length(xS) of these, but
max(Z(xS)) of these are already substrings of S, so
You get length(xS) - max(Z(xS)) new ones
So, given a string S, just add up all the length(P) - max(Z(P)) for every suffix P of S.
Your test case baz has 3 suffixes: z, az, and baz. All the letters are distinct, so their Z functions are zero everywhere. The result is that the number of distinct substrings is just the sum of the suffix lengths: 3 + 2 + 1 = 6.
Try baa: The only non-zero in the Z functions is Z('aa')[1] = 1, so the number of unique substrings is 3 + 2 - 1 + 1 = 5.
Note that the article you linked to mentions that this is an O(n2) algorithm. That is correct, although its overhead is low. It's possible to do this in O(n) time by building a suffix tree, but that is quite complicated.

Find the character at the `k`th location in the infinite string

I am trying to solve a problem:
Given two strings, s and t, we can form a string x of infinite length, as:
a. Append s to x 1 time;
b. Append t to x 2 times;
c. Append s to x 3 times;
d. Append t to x 4 times;
and so on...
Given k, find the kth character (1 indexed) in the resultant infinite string x.
For e.g., if s = a, t = bc and k = 4, then output: b (x=abcbc). s and t can contain anywhere from 1 to 100 characters, while 1<=k<=10^16.
The brute force way of actually constructing string x is trivial but too slow. How do I optimize it further?
In C++, the brute force solution would look like this:
#include <iostream>
using namespace std;
int main() {
int repeat=1, k=4;
string s="a", t="bc", x;
bool appendS=true;
while(x.size()<k) {
for(int i=1; i<=repeat; i++)
if(appendS) x+=s;
else x+=t;
appendS=!appendS;
repeat++;
}
cout<<x[k-1];
return 0;
}
But how do I optimize it, given huge k?
The string looks like
sttsssttttsssssttttttssssssstttttttt...
Group the string into substrings like
(stts)(ssttttss)(sssttttttsss)(ssssttttttttssss)(sssss...
Let
len(s) = a
len(t) = b
len(s+t) = c
Group 1: stts -> length = 2*c.
Group 2: ssttttss -> length = 4*c.
Group 3: sssttttttsss -> length = 6*c.
Continuing the pattern, it is easy to see that the length of ith group will be 2*i*c.
Let the kth character be in group n.
Total length of first n groups =
2*c + 4*c + 6*c .... + 2*n*c = (2*c)*(1+2+3...+n) = c*n*(n+1)
Since total length of n groups has to be greater than or equal to k,
c*n*(n+1) >= k
n*(n+1) >= k/c
Finding the largest value of n that satisfies this inequality is a trivial task. Now, the nth group looks something like
ss...(n times) + tttt...(2*n times) + ss...(n times)
Now, you just need to find the position of k mod n in this block, which is a simple task.
The search position is located at the nth append, where n is calculated for the last integer where this sum inequality holds, where l1 and l2 are the two strings lengths:
Hence, the search position is at the nth append, at the mod(n+1,2)+1 string, in the Delta k character of that string.
Explanation:
The sum is the addition of all lengths. Since you know all the substrings, you know the whole sum. And since both are integer algebraic expression, you have to do a little linear search over that, to find that integer n, as in any simple integer equation.
Having the nth append number, and the sum, the position is trivially obtained as Delta k.
Moreover, note that Delta k is a closed expression. You do not need any loop to calculate it, just to evaluate, since the sums of the i terms are the sum of the first floor(n/2) even and the first floor(n/2)+1 odd integers.

how to find the minimum number of primatics that sum to a given number

Given a number N (<=10000), find the minimum number of primatic numbers which sum up to N.
A primatic number refers to a number which is either a prime number or can be expressed as power of prime number to itself i.e. prime^prime e.g. 4, 27, etc.
I tried to find all the primatic numbers using seive and then stored them in a vector (code below) but now I am can't see how to find the minimum of primatic numbers that sum to a given number.
Here's my sieve:
#include<algorithm>
#include<vector>
#define MAX 10000
typedef long long int ll;
ll modpow(ll a, ll n, ll temp) {
ll res=1, y=a;
while (n>0) {
if (n&1)
res=(res*y)%temp;
y=(y*y)%temp;
n/=2;
}
return res%temp;
}
int isprimeat[MAX+20];
std::vector<int> primeat;
//Finding all prime numbers till 10000
void seive()
{
ll i,j;
isprimeat[0]=1;
isprimeat[1]=1;
for (i=2; i<=MAX; i++) {
if (isprimeat[i]==0) {
for (j=i*i; j<=MAX; j+=i) {
isprimeat[j]=1;
}
}
}
for (i=2; i<=MAX; i++) {
if (isprimeat[i]==0) {
primeat.push_back(i);
}
}
isprimeat[4]=isprimeat[27]=isprimeat[3125]=0;
primeat.push_back(4);
primeat.push_back(27);
primeat.push_back(3125);
}
int main()
{
seive();
std::sort(primeat.begin(), primeat.end());
return 0;
}
One method could be to store all primatics less than or equal to N in a sorted list - call this list L - and recursively search for the shortest sequence. The easiest approach is "greedy": pick the largest spans / numbers as early as possible.
for N = 14 you'd have L = {2,3,4,5,7,8,9,11,13}, so you'd want to make an algorithm / process that tries these sequences:
13 is too small
13 + 13 -> 13 + 2 will be too large
11 is too small
11 + 11 -> 11 + 4 will be too large
11 + 3 is a match.
You can continue the process by making the search function recurse each time it needs another primatic in the sum, which you would aim to have occur a minimum number of times. To do so you can pick the largest -> smallest primatic in each position (the 1st, 2nd etc primatic in the sum), and include another number in the sum only if the primatics in the sum so far are small enough that an additional primatic won't go over N.
I'd have to make a working example to find a small enough N that doesn't result in just 2 numbers in the sum. Note that because you can express any natural number as the sum of at most 4 squares of natural numbers, and you have a more dense set L than the set of squares, so I'd think it rare you'd have a result of 3 or more for any N you'd want to compute by hand.
Dynamic Programming approach
I have to clarify that 'greedy' is not the same as 'dynamic programming', it can give sub-optimal results. This does have a DP solution though. Again, i won't write the final process in code but explain it as a point of reference to make a working DP solution from.
To do this we need to build up solutions from the bottom up. What you need is a structure that can store known solutions for all numbers up to some N, this list can be incrementally added to for larger N in an optimal way.
Consider that for any N, if it's primatic then the number of terms for N is just 1. This applies for N=2-5,7-9,11,13,16,17,19. The number of terms for all other N must be at least two, which means either it's a sum of two primatics or a sum of a primatic and some other N.
The first few examples that aren't trivial:
6 - can be either 2+4 or 3+3, all the terms here are themselves primatic so the minimum number of terms for 6 is 2.
10 - can be either 2+8, 3+7, 4+6 or 5+5. However 6 is not primatic, and taking that solution out leaves a minimum of 2 terms.
12 - can be either 2+10, 3+9, 4+8, 5+7 or 6+6. Of these 6+6 and 2+10 contain non-primatics while the others do not, so again 2 terms is the minimum.
14 - ditto, there exist two-primatic solutions: 3+11, 5+9, 7+7.
The structure for storing all of these solutions needs to be able to iterate across solutions of equal rank / number of terms. You already have a list of primatics, this is also the list of solutions that need only one term.
Sol[term_length] = list(numbers). You will also need a function / cache to look up some N's shortest-term-length, eg S(N) = term_length iif N in Sol[term_length]
Sol[1] = {2,3,4,5 ...} and Sol[2] = {6,10,12,14 ...} and so on for Sol[3] and onwards.
Any solution can be found using one term from Sol[1] that is primatic. Any solution requiring two primatics will be found in Sol[2]. Any solution requiring 3 will be in Sol[3] etc.
What you need to recognize here is that a number S(N) = 3 can be expressed Sol[1][a] + Sol[1][b] + Sol[1][c] for some a,b,c primatics, but it can also be expressed as Sol[1][a] + Sol[2][d], since all Sol[2] must be expressible as Sol[1][x] + Sol[1][y].
This algorithm will in effect search Sol[1] for a given N, then look in Sol[1] + Sol[K] with increasing K, but to do this you will need S and Sol structures roughly in the form shown here (or able to be accessed / queried in a similar manner).
Working Example
Using the above as a guideline I've put this together quickly, it even shows which multi-term sum it uses.
https://ideone.com/7mYXde
I can explain the code in-depth if you want but the real DP section is around lines 40-64. The recursion depth (also number of additional terms in the sum) is k, a simple dual-iterator while loop checks if a sum is possible using the kth known solutions and primatics, if it is then we're done and if not then check k+1 solutions, if any. Sol and S work as described.
The only confusing part might be the use of reverse iterators, it's just to make != end() checking consistent for the while condition (end is not a valid iterator position but begin is, so != begin would be written differently).
Edit - FYI, the first number that takes at least 3 terms is 959 - had to run my algorithm to 1000 numbers to find it. It's summed from 6 + 953 (primatic), no matter how you split 6 it's still 3 terms.

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Find the number of strings w of length n over the alphabet {a, b, c}

I'm trying to figure out how to calculate the number of all strings of length n such that any substring of length 4 of string w, all three letters a, b, c occur. For example, abbcaabca should be printed when n = 9, but aabbcabac should not be included.
I was trying to make a math formula like
3^N - 3 * 2^N + 3 or (3^(N-3))*N!
Can it work this way or do I have to generate them and count them? I'm working with large numbers like 100, and I don't think I can generate them to count them.
You should probably be able to work your way up and start with let's say all possible words of length 4 and then add just one letter and count the possible allowed resulting words. Then you can iteratively go up to high numbers without having to explore all 3^N possibilities.
const unsigned w = 4;
unsigned n = 10;
vector<string> before,current;
// obtain all possible permutations of the strings "aabc", "abbc" and "abcc"
string base = "aabc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
base = "abbc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
base = "abcc";
before.emplace_back(base);
while(std::next_permutation(base.begin(),base.end())) before.emplace_back(base);
// iteratively add single letters to the words in the collection and add if it is a valid word
size_t posa,posb,posc;
for (unsigned k=1;k<n-w;++k)
{
current.clear();
for (const auto& it : before)
{
posa = it.find("a",k);
posb = it.find("b",k);
posc = it.find("c",k);
if (posb!= string::npos && posc!= string::npos) current.emplace_back(it+"a");
if (posa!= string::npos && posc!= string::npos) current.emplace_back(it+"b");
if (posa!= string::npos && posb!= string::npos) current.emplace_back(it+"c");
}
before = current;
}
for (const auto& it : current) cout<<it<<endl;
cout<<current.size()<<" valid words of length "<<n<<endl;
Note that with this you will still however run into the exponential wall pretty quickly... In a more efficient implementation I would represent words as integers (NOT vectors of integers, but rather integers in a base 3 representation), but the exponential scaling would still be there. If you are just interested in the number, #Jeffrey's approach is surely better.
The trick is to break down the problem. Consider:
Would knowing how many such strings, of length 50, ending in each pair of letter, help ?
Number of 50-string, ending in AA times
Number of 50-string, starting with B or C
+
Number of 50-string, ending in AB times
Number of 50-string, starting with C
+
All other combinations gives you the number of 100-long strings.
Continue breaking it down, recursively.
Look up dynamic programming.
Also look up large number libraries.