Search for a substring in an another string using hashing

Search for a substring in an another string using hashing - c++

I wrote code to find a substring in another string using hashing, but it's giving me a wrong result.
A description of how the code works:
Store the first n powers of p=31 in array pows.
Store hashes for each substring s[0..i] in the array h.
Calculate the hash for each substring of length 9 using the h array and store it in a set.
Hash the string t and store its hash.
Compare the hash of t and hashes in the set.
The hash h[n2-1] should exist in the set but it does not. Could you help me find the bug in the code?
Note: When I use the modular inverse instead of multiplying by pows[i-8] the code runs well.
#include <bits/stdc++.h>
#define m 1000000007
#define N (int)2e6 + 3
using namespace std;
long long pows[N], h[N], h2[N];
set<int> ss;
int main() {
string s = "www.cplusplus.com/forum";
// powers array
pows[0] = 1;
int n = s.length(), p = 31;
for (int i = 1; i < n; i++) {
pows[i] = pows[i - 1] * p;
pows[i] %= m;
}
// hash from 0 to i array
h[0] = s[0] - 'a' + 1;
for (int i = 1; i < n; i++) {
h[i] = h[i - 1] + (s[i] - 'a' + 1) * pows[i];
h[i] %= m;
}
// storing each hash with 9 characters in a set
ss.insert(h[8]);
for (int i = 9; i < n; i++) {
int tp = h[i] - h[i - 9] * pows[i - 8];
tp %= m;
tp += m;
tp %= m;
ss.insert(tp);
}
// print hashes with 9 characters
set<int>::iterator itr = ss.begin();
while (itr != ss.end()) {
cout << *(itr++) << " ";
}
cout << endl;
// t is the string that i want to check if it is exist in s
string t = "cplusplus";
int n2 = t.length();
h2[0] = t[0] - 'a' + 1;
for (int i = 1; i < n2; i++) {
h2[i] = h2[i - 1] + (t[i] - 'a' + 1) * pows[i];
h2[i] %= m;
}
// print t hash
cout << h2[n2 - 1] << endl;
return 0;
}

I can see two problems with your code:
When you're computing hashes for substrings of length 9, you're storing the intermediate result (of type long long) in an int variable. This could cause integer overflow and the hash you computed would probably be incorrect.
Given a string s = {s[0], s[1], ..., s[n-1]}, the way you're computing the hash is: h = ∑ s[i] * p^i. In this case, given the prefix hash stored in h, the hash for a substring s[l..r] (inclusive) should be (h[r] - h[l - 1]) / p^(r-l+1), instead of what you wrote. This is also why using modular inverse (which is required to perform division under modulo) is correct.
I think a more common way to compute hashes is the other way around, i.e. h = ∑ s[i] * p^(n-i-1). This allows you to compute the substring hash as h[r] - h[l - 1] * p^(r-l+1), which does not require computing modular inverses.

Related

Substrings of equal length comparison using hashing

On an assignment that I have, for a string S, I need to compare two substrings of equal lengths. Output should be "Yes" if they are equal, "No" if they are not equal. I am given the starting indexes of two substrings (a and b), and the length of the substrings L.
For example, for S = "Hello", a = 1, b = 3, L = 2, the substrings are:
substring1 = "el" and substring2 = "lo", which aren't equal, so answer will be "No".
I think hashing each substring of the main string S and writing them all to memory would be a good aproach to take. Here is the code I have written for this (I have tried to implement what I learned about this from the Coursera course that I was taking):
This function takes any string, and values for p and x for hashing thing, and performs a polynomial hash on the given string.
long long PolyHash(string str, long long p, int x){
long long res = 0;
for(int i = str.length() - 1; i > -1; i--){
res = (res * x + (str[i] - 'a' + 1)) % p;
}
return res;
}
The function below just precomputes all hashes, and fills up an array called ah, which is initialized in the main function. The array ah consists of n = string length rows, and n = string length columns (half of which gets wasted because I couldn't find how to properly make it work as a triangle, so I had to go for a full rectangular array). Assuming n = 7, then ah[0]-ah[6] are hash values for string[0]-string[6] (meaning all substrings of length 1). ah[7]-ah[12] are hash values for string[0-1]-string[5-6] (meaning all substrings of length 2), and etc. until the end.
void PreComputeAllHashes(string str, int len, long long p, int x, long long* ah){
int n = str.length();
string S = str.substr(n - len, len);
ah[len * n + n - len] = PolyHash(S, p, x);
long long y = 1;
for(int _ = 0; _ < len; _++){
y = (y * x) % p;
}
for(int i = n - len - 1; i > -1; i--){
ah[n * len + i] = (x * ah[n * len + i + 1] + (str[i] - 'a' + 1) - y * (str[i + len] - 'a' + 1)) % p;
}
}
And below is the main function. I took p equal to some large prime number, and x to be some manually picked, somewhat "random" prime number.
I take the text as input, initialize hash array, fill the hash array, and then take queries as input, to answer all queries from my array.
int main(){
long long p = 1e9 + 9;
int x = 78623;
string text;
cin >> text;
long long* allhashes = new long long[text.length() * text.length()];
for(int i = 1; i <= text.length(); i++){
PreComputeAllHashes(text, i, p, x, allhashes);
}
int queries;
cin >> queries;
int a, b, l;
for(int _ = 0; _ < queries; _++){
cin >> a >> b >> l;
if(a == b){
cout << "Yes" << endl;
}else{
cout << ((allhashes[l * text.length() + a] == allhashes[l * text.length() + b]) ? "Yes" : "No") << endl;
}
}
return 0;
}
However, one of the test cases for this assignment on Coursera is throwing an error like this:
Failed case #7/14: unknown signal 6 (Time used: 0.00/1.00, memory used: 29396992/536870912.)
Which, I have looked up online, and means the following:
Unknown signal 6 (or 7, or 8, or 11, or some other).This happens when your program crashes. It can be
because of division by zero, accessing memory outside of the array bounds, using uninitialized
variables, too deep recursion that triggers stack overflow, sorting with contradictory comparator,
removing elements from an empty data structure, trying to allocate too much memory, and many other
reasons. Look at your code and think about all those possibilities.
And I've been looking at my code the entire day, and still haven't been able to come up with a solution to this error. Any help to fix this would be appreciated.
Edit: The assignment states that the length of the input string can be up to 500000 characters long, and the number of queries can be up to 100000. This task also has 1 second time limit, which is pretty small for going over characters one by one for each string.

So, I did some research as to how I can reduce the complexity of this algorithm that I have implemented, and finally found it! Turns out there is a super-simple way (well, not if you count the theory involved behind it) to get hash value of any substring, given the prefix hashes of the initial string!
You can read more about it here, but I will try to explain it briefly.
So what do we do - We precalculate all the hash values for prefix-substrings.
Prefix substrings for a string "hello" would be the following:
h
he
hel
hell
hello
Once we have hash values of all these prefix substrings, we can collect them in a vector such that:
h[str] = str[0] + str[1] * P + str[2] * P^2 + str[3] * P^3 + ... + str[N] * P^N
where P is any prime number (I chose p = 263)
Then, we need a high value that we will take everything's modulo by, just to keep things not too large. This number I will choose m = 10^9 + 9.
First I am creating a vector to hold the precalculated powers of P:
vector<long long> p_pow (s.length());
p_pow[0] = 1;
for(size_t i=1; i<p_pow.size(); ++i){
p_pow[i] = (m + (p_pow[i-1] * p) % m) % m;
}
Then I calculate the vector of hash values for prefix substrings:
vector<long long> h (s.length());
for (size_t i=0; i<s.length(); ++i){
h[i] = (m + (s[i] - 'a' + 1) * p_pow[i] % m) % m;
if(i){
h[i] = (m + (h[i] + h[i-1]) % m) % m;
}
}
Suppose I have q queries, each of which consist of 3 integers: a, b, and L.
To check equality for substrings s1 = str[a...a+l-1] and s2 = str[b...b+l-1], I can compare the hash values of these substrings. And to get the hash value of substrings using the has values of prefix substrings that we just created, we need to use the following formula:
H[I..J] * P[I] = H[0..J] - H[0..I-1]
Again, you can read about the proof of this in the link.
So, to address each query, I would do the following:
cin >> a >> b >> len;
if(a == b){ // just avoid extra calculation, saves little time
cout << "Yes" << endl;
}else{
long long h1 = h[a+len-1] % m;
if(a){
h1 = (m + (h1 - h[a-1]) % m) % m;
}
long long h2 = h[b+len-1] % m;
if(b){
h2 = (m + (h2 - h[b-1]) % m) % m;
}
if (a < b && h1 * p_pow[b-a] % m == h2 % m || a > b && h1 % m == h2 * p_pow[a-b] % m){
cout << "Yes" << endl;
}else{
cout << "No" << endl;
}
}

Your approach is very hard and complex for such a simple task. Assuming that you only need to do this operation once. You can compare the substrings manually with a for loop. No need for hashing. Take a look at this code:
for(int i = a, j = b, counter = 0 ; counter < L ; counter++, i++, j++){
if(S[i] != S[j]){
cout << "Not the same" << endl;
return 0;
}
}
cout << "They are the same" << endl;

C26451: Arithmetic overflow using operator '+' on a 4 byte value then casting the result to 8 byte value

i am trying to write a program that searches through a movie script using two different string searching algorithms. However the Warning C26451: Arithmetic overflow using operator '+' on a 4 byte value then casting the result to 8 byte value keeps on coming up in the calculate hash part of the rabin karp, is there anyway to fix this? Any help would be greatly appreciated.
#define d 256
Position rabinkarp(const string& pat, const string& text) {
int M = pat.size();
int N = text.size();
int i, j;
int p = 0; // hash value for pattern
int t = 0; // hash value for txt
int h = 1;
int q = 101;
// The value of h would be "pow(d, M-1)%q"
for (i = 0; i < M - 1; i++)
h = (h * d) % q;
// Calculate the hash value of pattern and first
// window of text
for (i = 0; i < M; i++)
{
p = (d * p + pat[i]) % q;
t = (d * t + text[i]) % q;
}
// Slide the pattern over text one by one
for (i = 0; i <= N - M; i++)
{
// Check the hash values of current window of text
// and pattern. If the hash values match then only
// check for characters on by one
if (p == t)
{
/* Check for characters one by one */
for (j = 0; j < M; j++)
{
if (text[i + j] != pat[j])
break;
}
// if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]
if (j == M)
return i;
}
// Calculate hash value for next window of text: Remove
// leading digit, add trailing digit
if (i < N - M)
{
t = (d * (t - text[i] * h) + text[i + M]) % q;// <---- warning is here
[i + M
// We might get negative value of t, converting it
// to positive
if (t < 0)
t = (t + q);
}
}
return -1;
}
context for the error

You're adding two int which is 4 bytes in your case, whereas std::string::size_type is probably 8 bytes in your case. Said conversion happens when you do:
text[i + M]
Which is a call to std::string::operator[] taking a std::string::size_type as parameter.
Use std::string::size_type, which is usually the same as size_t.
gcc does not give any warning for that, even with -Wall -Wextra -pedantic, so I guess you activated really every warning you can, or something similar

C++ - Code Optimization

I have a problem:
You are given a sequence, in the form of a string with characters ‘0’, ‘1’, and ‘?’ only. Suppose there are k ‘?’s. Then there are 2^k ways to replace each ‘?’ by a ‘0’ or a ‘1’, giving 2^k different 0-1 sequences (0-1 sequences are sequences with only zeroes and ones).
For each 0-1 sequence, define its number of inversions as the minimum number of adjacent swaps required to sort the sequence in non-decreasing order. In this problem, the sequence is sorted in non-decreasing order precisely when all the zeroes occur before all the ones. For example, the sequence 11010 has 5 inversions. We can sort it by the following moves: 11010 →→ 11001 →→ 10101 →→ 01101 →→ 01011 →→ 00111.
Find the sum of the number of inversions of the 2^k sequences, modulo 1000000007 (10^9+7).
For example:
Input: ??01
-> Output: 5
Input: ?0?
-> Output: 3
Here's my code:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <string.h>
#include <math.h>
using namespace std;
void ProcessSequences(char *input)
{
int c = 0;
/* Count the number of '?' in input sequence
* 1??0 -> 2
*/
for(int i=0;i<strlen(input);i++)
{
if(*(input+i) == '?')
{
c++;
}
}
/* Get all possible combination of '?'
* 1??0
* -> ??
* -> 00, 01, 10, 11
*/
int seqLength = pow(2,c);
// Initialize 2D array of integer
int **sequencelist, **allSequences;
sequencelist = new int*[seqLength];
allSequences = new int*[seqLength];
for(int i=0; i<seqLength; i++){
sequencelist[i] = new int[c];
allSequences[i] = new int[500000];
}
//end initialize
for(int count = 0; count < seqLength; count++)
{
int n = 0;
for(int offset = c-1; offset >= 0; offset--)
{
sequencelist[count][n] = ((count & (1 << offset)) >> offset);
// cout << sequencelist[count][n];
n++;
}
// cout << std::endl;
}
/* Change '?' in former sequence into all possible bits
* 1??0
* ?? -> 00, 01, 10, 11
* -> 1000, 1010, 1100, 1110
*/
for(int d = 0; d<seqLength; d++)
{
int seqCount = 0;
for(int e = 0; e<strlen(input); e++)
{
if(*(input+e) == '1')
{
allSequences[d][e] = 1;
}
else if(*(input+e) == '0')
{
allSequences[d][e] = 0;
}
else
{
allSequences[d][e] = sequencelist[d][seqCount];
seqCount++;
}
}
}
/*
* Sort each sequences to increasing mode
*
*/
// cout<<endl;
int totalNum[seqLength];
for(int i=0; i<seqLength; i++){
int num = 0;
for(int j=0; j<strlen(input); j++){
if(j==strlen(input)-1){
break;
}
if(allSequences[i][j] > allSequences[i][j+1]){
int temp = allSequences[i][j];
allSequences[i][j] = allSequences[i][j+1];
allSequences[i][j+1] = temp;
num++;
j = -1;
}//endif
}//endfor
totalNum[i] = num;
}//endfor
/*
* Sum of all Num of Inversions
*/
int sum = 0;
for(int i=0;i<seqLength;i++){
sum = sum + totalNum[i];
}
// cout<<"Output: "<<endl;
int out = sum%1000000007;
cout<< out <<endl;
} //end of ProcessSequences method
int main()
{
// Get Input
char seq[500000];
// cout << "Input: "<<endl;
cin >> seq;
char *p = &seq[0];
ProcessSequences(p);
return 0;
}
the results were right for small size input, but for bigger size input I got time CPU time limit > 1 second. I also got exceeded memory size. How to make it faster and optimal memory use? What algorithm should I use and what better data structure should I use?, Thank you.

Dynamic programming is the way to go. Imagine You are adding the last character to all sequences.
If it is 1 then You get XXXXXX1. Number of swaps is obviously the same as it was for every sequence so far.
If it is 0 then You need to know number of ones already in every sequence. Number of swaps would increase by the amount of ones for every sequence.
If it is ? You just add two previous cases together
You need to calculate how many sequences are there. For every length and for every number of ones (number of ones in the sequence can not be greater than length of the sequence, naturally). You start with length 1, which is trivial, and continue with longer. You can get really big numbers, so You should calculate modulo 1000000007 all the time. The program is not in C++, but should be easy to rewrite (array should be initialized to 0, int is 32bit, long in 64bit).
long Mod(long x)
{
return x % 1000000007;
}
long Calc(string s)
{
int len = s.Length;
long[,] nums = new long[len + 1, len + 1];
long sum = 0;
nums[0, 0] = 1;
for (int i = 0; i < len; ++i)
{
if(s[i] == '?')
{
sum = Mod(sum * 2);
}
for (int j = 0; j <= i; ++j)
{
if (s[i] == '0' || s[i] == '?')
{
nums[i + 1, j] = Mod(nums[i + 1, j] + nums[i, j]);
sum = Mod(sum + j * nums[i, j]);
}
if (s[i] == '1' || s[i] == '?')
{
nums[i + 1, j + 1] = nums[i, j];
}
}
}
return sum;
}
Optimalization
The code above is written to be as clear as possible and to show dynamic programming approach. You do not actually need array [len+1, len+1]. You calculate column i+1 from column i and never go back, so two columns are enough - old and new. If You dig more into it, You find out that row j of new column depends only on row j and j-1 of the old column. So You can go with one column if You actualize the values in the right direction (and do not overwrite values You would need).
The code above uses 64bit integers. You really need that only in j * nums[i, j]. The nums array contain numbers less than 1000000007 and 32bit integer is enough. Even 2*1000000007 can fit into 32bit signed int, we can make use of it.
We can optimize the code by nesting loop into conditions instead of conditions in the loop. Maybe it is even more natural approach, the only downside is repeating the code.
The % operator is, as every dividing, quite expensive. j * nums[i, j] is typically far smaller that capacity of 64bit integer, so we do not have to do modulo in every step. Just watch the actual value and apply when needed. The Mod(nums[i + 1, j] + nums[i, j]) can also be optimized, as nums[i + 1, j] + nums[i, j] would always be smaller than 2*1000000007.
And finally the optimized code. I switched to C++, I realized there are differences what int and long means, so rather make it clear:
long CalcOpt(string s)
{
long len = s.length();
vector<long> nums(len + 1);
long long sum = 0;
nums[0] = 1;
const long mod = 1000000007;
for (long i = 0; i < len; ++i)
{
if (s[i] == '1')
{
for (long j = i + 1; j > 0; --j)
{
nums[j] = nums[j - 1];
}
nums[0] = 0;
}
else if (s[i] == '0')
{
for (long j = 1; j <= i; ++j)
{
sum += (long long)j * nums[j];
if (sum > std::numeric_limits<long long>::max() / 2) { sum %= mod; }
}
}
else
{
sum *= 2;
if (sum > std::numeric_limits<long long>::max() / 2) { sum %= mod; }
for (long j = i + 1; j > 0; --j)
{
sum += (long long)j * nums[j];
if (sum > std::numeric_limits<long long>::max() / 2) { sum %= mod; }
long add = nums[j] + nums[j - 1];
if (add >= mod) { add -= mod; }
nums[j] = add;
}
}
}
return (long)(sum % mod);
}
Simplification
Time limit still exceeded? There is probably better way to do it. You can either
get back to the beginning and find out mathematically different way to calculate the result
or simplify actual solution using math
I went the second way. What we are doing in the loop is in fact convolution of two sequences, for example:
0, 0, 0, 1, 4, 6, 4, 1, 0, 0,... and 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,...
0*0 + 0*1 + 0*2 + 1*3 + 4*4 + 6*5 + 4*6 + 1*7 + 0*8...= 80
The first sequence is symmetric and the second is linear. It this case, the sum of convolution can be calculated from sum of the first sequence which is = 16 (numSum) and number from second sequence corresponding to the center of the first sequence, which is 5 (numMult). numSum*numMult = 16*5 = 80. We replace the whole loop with one multiplication if we are able to update those numbers in each step, which fortulately seems the case.
If s[i] == '0' then numSum does not change and numMult does not change.
If s[i] == '1' then numSum does not change, only numMult increments by 1, as we shift the whole sequence by one position.
If s[i] == '?' we add original and shiftet sequence together. numSum is multiplied by 2 and numMult increments by 0.5.
The 0.5 means a bit problem, as it is not the whole number. But we know, that the result would be whole number. Fortunately in modular arithmetics in this case exists inversion of two (=1/2) as a whole number. It is h = (mod+1)/2. As a reminder, inversion of 2 is such a number, that h*2=1 modulo mod. Implementation wisely it is easier to multiply numMult by 2 and divide numSum by 2, but it is just a detail, we would need 0.5 anyway. The code:
long CalcOptSimpl(string s)
{
long len = s.length();
long long sum = 0;
const long mod = 1000000007;
long numSum = (mod + 1) / 2;
long long numMult = 0;
for (long i = 0; i < len; ++i)
{
if (s[i] == '1')
{
numMult += 2;
}
else if (s[i] == '0')
{
sum += numSum * numMult;
if (sum > std::numeric_limits<long long>::max() / 4) { sum %= mod; }
}
else
{
sum = sum * 2 + numSum * numMult;
if (sum > std::numeric_limits<long long>::max() / 4) { sum %= mod; }
numSum = (numSum * 2) % mod;
numMult++;
}
}
return (long)(sum % mod);
}
I am pretty sure there exists some simple way to get this code, yet I am still unable to see it. But sometimes path is the goal :-)

If a sequence has N zeros with indexes zero[0], zero[1], ... zero[N - 1], the number of inversions for it would be (zero[0] + zero[1] + ... + zero[N - 1]) - (N - 1) * N / 2. (you should be able to prove it)
For example, 11010 has two zeros with indexes 2 and 4, so the number of inversions would be 2 + 4 - 1 * 2 / 2 = 5.
For all 2^k sequences, you can calculate the sum of two parts separately and then add them up.
1) The first part is zero[0] + zero[1] + ... + zero[N - 1]. Each 0 in the the given sequence contributes index * 2^k and each ? contributes index * 2^(k-1)
2) The second part is (N - 1) * N / 2. You can calculate this using a dynamic programming (maybe you should google and learn this first). In short, use f[i][j] to present the number of sequence with j zeros using the first i characters of the given sequence.

C++, moving a NaN to the end of the array, when output

So, i've made a program which is able to sort arrays, and i'm trying to sort an array containing double FP's, including 2-3 random ones i enter, pos inf, neg inf and a single NaN. so for this purpose i wish to sort the NaN.
So my code works, however when trying to sort the NaN, i'm unable to do so. What i'd like to do is sort it to the end, or have it put at the end of the sorted array. Is there anyway I can actually do this? Thanks in advance!!! code is as follows:
int main()
{
int start_s = clock();
int n, k = 4, j; // k is number of elements
double x = -0.0;
double i = 0;
double swap = 0;//used in the function as a place holder and used for swapping between other variables
double a[100] = { (1/x) + (1/i), 2.3, 1/x *0, 1/i };//array of double elements // 1/i * 0 is NaN
//(1 / i) * 0
for (n = 0; n < (k - 1); n++) // for loop consists of variables and statements in order to arrange contents of array
{
for (j = 0; j < k - n - 1; j++)
{
if (a[j] > a[j + 1])
{
swap = a[j];
a[j] = a[j + 1];
a[j + 1] = swap;
}
}
}
cout << "The list of sorted elements within the array, is: " << endl; /* Output message to user */
for (int i = 0; i < k; i++)// Loop up to number of elements within the array
{
cout << a[i] << " ";/* Output contents of array */
}
cout << endl; //new line
int stop_s = clock();
cout << "The execution time of this sort, is equal to: " << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000 << " milliseconds" << endl;
return 0;

Since you're in C++ land anyway, why not use it to the full. First, indeed, move the NaN's and then sort. I've taken out 'noise' from your code and produced this, it compiles and runs (edit: on gcc-4.4.3). The main difference is that the NaN's are at the beginning but they're easily skipped since you will get a pointer to the start of non-NaN's.
#include <iostream>
#include <algorithm>
#include <math.h>
int main()
{
int n, k = 4, j; // k is number of elements
double x = -0.0;
double i = 0;
double a[100] = { (1/x) + (1/i), 2.3, 1/x *0, 1/i };//array of double elements // 1/i * 0 is NaN]
double *ptr; // will point at first non-NaN double
// divide the list into two parts: NaN's and non-NaN's
ptr = std::partition(a, a+k, isnan);
// and sort 'm
// EDIT: of course, start sorting _after_ the NaNs ...
std::sort(ptr, a+k);
cout << "The list of sorted elements within the array, is: " << endl; /* Output message to user */
for (int i = 0; i < k; i++)// Loop up to number of elements within the array
{
cout << a[i] << " ";/* Output contents of array */
}
cout << endl; //new line
return 0;
}

Do a linear scan, find the NaNs, and move them to the end - by swapping.
Then sort the rest.
You can also fix your comparator, and check for NaN there.
For the actual check see: Checking if a double (or float) is NaN in C++

you can use isnan() in cmath to check for NaNs. So, you can just change your comparison line from:
if (a[j] > a[j + 1])
to:
if (!std::isnan(a[j + 1]) && std::isnan(a[j]) || (a[j] > a[j + 1]))
just a reminder, you need to have:
#include <cmath>
at the top of your code.

Longest prefix string length for all the suffixes

Find the length of the longest prefix string for all the suffixes of the string.
For example suffixes of the string ababaa are ababaa, babaa, abaa, baa, aa and a. The similarities of each of these strings with the string "ababaa" are 6,0,3,0,1,1 respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.
I wrote following code
#include <iostream>
#include <string.h>
#include <stdio.h>
#include <time.h>
int main ( int argc, char **argv) {
size_t T;
std::cin >> T;
char input[100000];
for ( register size_t i = 0; i < T; ++i) {
std::cin >> input;
double t = clock();
size_t len = strlen(input);
char *left = input;
char *right = input + len - 1;
long long sol = 0;
int end_count = 1;
while ( left < right ) {
if ( *right != '\0') {
if ( *left++ == *right++ ) {
sol++;
continue;
}
}
end_count++;
left = input; // reset the left pointer
right = input + len - end_count; // set right to one left.
}
std::cout << sol + len << std::endl;
printf("time= %.3fs\n", (clock() - t) / (double)(CLOCKS_PER_SEC));
}
}
Working fine, but for a string which is 100000 long and having same character i.e. aaaaaaaaaa.......a, it is taking long time , how can i optimize this one more.

You can use Suffix Array: http://en.wikipedia.org/wiki/Suffix_array

Let's say your ababaa is a pattern P.
I think you could use the following algorithm:
Create a suffix automata for all possible suffixes of P.
Walk the automata using P as input, count edges traversed so far. For each accepting state of the automata add the current edge count to total sum. Walk the automata until you either reach the end of the input or there are no more edges to go through.
The total sum is the result.

Use Z algorithm to calculate length of all substrings, which also prefixes in O(n) and then scan resulting array and sum its values.
Reference: https://www.geeksforgeeks.org/sum-of-similarities-of-string-with-all-of-its-suffixes/

From what I see, you are using plain array to evaluate the suffix and though it may turn out to be efficient for some data set, it would fail to be efficient for some cases, such as the one you mentioned.
You would need to implement a Prefix-Tree or Trie like Data Structure. The code for those aren't straightforward, so if you are not familiar with them, I would suggest you read a little bit about them.

I'm not sure whether a Trie gives you much performance gain.. but I would certainly think about it.
The other idea I had is to try to compress your string. I didn't really think about it, just a crazy idea...
if you have a string like this: ababaa compress it maybe to: abab2a. Then you have to come up with a technique where you can use your algorithm with those strings. The advantage is you can then compare long strings 100000a efficiently with each other. Or more importantly: you can calculate your sum very fast.
But again, I didn't think it through, maybe this is a very bad idea ;)

Here a java implementation:
// sprefix
String s = "abababa";
Vector<Integer>[] v = new Vector[s.length()];
int sPrefix = s.length();
v[0] = new Vector<Integer>();
v[0].add(new Integer(0));
for(int j = 1; j < s.length(); j++)
{
v[j] = new Vector<Integer>();
v[j].add(new Integer(0));
for(int k = 0; k < v[j - 1].size(); k++)
if(s.charAt(j) == s.charAt(v[j - 1].get(k)))
{
v[j].add(v[j - 1].get(k) + 1);
v[j - 1].set(k, 0);
}
}
for(int j = 0; j < v.length; j++)
for(int k = 0; k < v[j].size(); k++)
sPrefix += v[j].get(k);
System.out.println("Result = " + sPrefix);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Search for a substring in an another string using hashing - c++

Related

Substrings of equal length comparison using hashing

C26451: Arithmetic overflow using operator '+' on a 4 byte value then casting the result to 8 byte value

C++ - Code Optimization

C++, moving a NaN to the end of the array, when output

Longest prefix string length for all the suffixes

Categories

Resources