Modify hash value on replacing a single character in string (c++) - c++

I am using a polynomial hash function to calculate the hash value of a string (consisting of only lowercase english letters) as follows:
int SZ = 105, P = 31;
long long M = 1e12 + 9;
vector <long long> pw;
pw.resize(SZ, 1);
for(int i = 1; i < SZ; i++) {
pw[i] = (pw[i - 1] * P) % M;
}
long long calculateHash(string &s) {
long long h = 0;
for(int i = 0; i < s.length(); i++) {
h = (h + (s[i] - 'a' + 1) * pw[i]) % M;
}
return h;
}
I don't want to re-calculate the hash of the entire string in O(N) time when I have to replace just one character at any given position. So inorder to do this in O(1) time, I do the following operation:
long long h1 = calculateHash(s1);
long long h2 = calculateHash(s2);
// Only one character differs in `s1` and `s2` at index `idx`
// Modifying hash for h1 to incorporate s2[idx] and removing s1[idx]
h1 = (h1 + ((s2[idx] - s1[idx]) * pw[idx])) % M;
Now when I check h1 == h2, it should be equal ideally, right? It does work for smaller strings but fails at times, I get negative values for h1, not sure if this is an overflow issue or ((s2[idx] - s1[idx]) * pw[idx]) is more negative causing h1 to fall below zero.
Could anyone, suggest a way to re-calculate the hash in O(1) time when only one character is changed? Thank you in advance!

In principle your idea of changing the resulting value ist correct, but what you need is a modulo operator, which result is always positiv, also for negativ input numbers.
To emulate this behaviour with C++ modulo you could do the following:
long long tmp=(h1 + ((s2[idx] - s1[idx]) * pw[idx])) % M;
h1=(tmp+M)%M;
The first line is the same operation you have done, an the second line make the result positiv, because tmp could not be less than -M after the C++ modulo operation. The additional modulo is needed to assure that the number keeps smaller that M, even if tmp was already positiv.

Related

Implementing the Backward Nondeterministic Dawg Matching algorithm

I'm trying to implement the BNDM algorithm in my code, in order to perform a fast pattern search.
I found some code online and tried to adjust it for my use case:
I think that I did something wrong while changing the values, since the algorithm takes a few minutes to finish (I was expecting it to be faster).
Using std::search takes me 30 seconds (with wildcards).
This takes me around 4-5 minutes (without wildcards).
The reason I'm casting everything to (unsigned char) is because the program crashes otherwise, since both my data and pattern hold hex values.
What I'd like to know is, where did I go wrong with this implementation (why is it running so slow)? and how can I include the ability to search for a pattern that contains wildcards?
EDIT*
The issue with speed has been solved by switching build from debug to release.
Also changing the size of the B array to 256 made it even faster.
The only issue I currently have now is how to implement a way to use wildcards using this algorithm.
Current code:
vector<unsigned int> get_matches(const vector<char> & data, const string & pattern) {
vector<unsigned int> matches;
//vector<char>::const_iterator walk = data.begin();
std::array<std::uint32_t, 256> B{ 0 };
int m = pattern.size();
int n = data.size();
int i, j, s, d, last;
//if (m > WORD_SIZE)
// error("BNDM");
// Pre processing
//memset(B, 0, ASIZE * sizeof(int));
s = 1;
for (i = m - 1; i >= 0; i--) {
B[(unsigned char)pattern[i]] |= s;
s <<= 1;
}
// Searching phase
j = 0;
while (j <= n - m) {
i = m - 1; last = m;
d = ~0;
while (i >= 0 && d != 0) {
d &= B[(unsigned char)data[j + i]];
i--;
if (d != 0) {
if (i >= 0)
last = i + 1;
else
matches.emplace_back(j);
}
d <<= 1;
}
j += last;
}
return matches;
}
B is not big enough -- it is indexed by the bytes in the pattern so it must have 256 elements (assuming an 8-bit byte architecture.) But you define it as having pattern.size() elements, which is a much smaller number.
As a consequence, you are using memory outside of B's allocation, which is Undefined Behaviour.
I suggest you use std::array<std::uint32_t, 256>, since you don't ever need to resize B. (Or even better, std::array<std::uint32_t, std::numeric_limits<unsigned char>::max()+1>).
I'm not an expert on this particular search algorithm, but the preprocessing step appears to set bit p in element c of B if the character c matches pattern element p. Since a wildcard pattern element can match any character, it seems reasonable that every element of B should have the bits corresponding to wildcard characters set. In other words, instead of initialising every element of B to 0, initialise them to the mask of wildcard positions in the pattern.
I don't know if that is sufficient to get the algorithm to work with wildcards, but it could be worth a try.

Huge fibonacci modulo m C++

I'm trying to calculate Fn mod m, where Fn is the nth Fibonacci number. n may be really huge, so its really not efficient to calculate Fn in a straightforward way (matrix exponentiation would work, though). The problem statement asks us to do this without calculating Fn, using the distributive property of the modulo:
(a+b)mod m = [a mod m + b mod m] mod m
(Before anyone asks me, I looked up answers to this same problem. I'd like an answer to my specific question, however, since I'm not asking about the algorithm to solve this problem)
Using this and the fact that the nth Fibonacci number is just the sum of the previous two, I don't need to store Fibonacci numbers, but rather only the results of calculating successive modulo operations. In that sense, I should have an array F of size n which has in it stored the results of iteratively calculating Fn mod m using the above property. I have managed to solve this problem using the following code. However, upon reviewing it, I stumbled upon something that rather confused me.
long long get_fibonacci_huge_mod(long long n, long long m) {
long long Fib[3] = {0, 1, 1};
long long result;
long long index;
long long period;
long long F[n+1];
F[0] = 0;
F[1] = 1;
F[2] = 1;
for (long long i = 3; i <= n; i++) {
F[i] = (F[i-2] + F[i-1]) % m;
if (F[i] == 0 && F[i+1] == 1 && F[i+2] == 1) {
period = i;
break;
}
}
index = n % period;
result = F[index];
return result;
}
This solution outputs correct results for any n and m, even if they are quite large. It might get a little bit slow when n is huge, but I'm not worried about that right now. I'm interested in specifically solving the problem this way. I'll try solving it using matrix exponentiation or any other much faster algorithm later.
So my question is as follows. At the beginning of the code, I create an array F of size n+1. Then I iterate through this array calculating Fn mod m using the distributive property. One thing that confused me after writing this loop was the fact that, since F was initialized to all zeros, how is it correctly using F[i+2], F[i+1], if they haven't been calculated yet? I assume that they are being correctly used since the algorithm outputs correct results every time. Perhaps this assumption is wrong?
My question isn't about the algorithm per se, I'm asking about what's going on inside the loop.
Thank you
This is a faulty implementation of a correct algorithm. Let us look at the corrected version first.
long long get_fibonacci_huge_mod(long long n, long long m) {
long long result;
long long index;
long long period = n+1;
long long sz = min (n+1,m*m+1); // Bound for period
long long *F = new long long[sz];
F[0] = 0;
F[1] = 1;
F[2] = 1;
for (long long i = 3; i < sz; i++) {
F[i] = (F[i-2] + F[i-1]) % m;
if (F[i] == 1 && F[i-1] == 0) { // we have got back to where we started
period = i-1;
break;
}
}
index = n % period;
result = F[index];
delete[]F;
return result;
}
So why does the original code work? Because you got lucky. The checks for i+1 and i+2 never evaluated to true because of the lucky garbage the array was initialized to. As a result this reduced to the naive evaluation of F(n) without incorporating periodicity at all.

how to find distinct substrings?

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l?
The size of character set is also known. (denote it as s)
For example, given a string "PccjcjcZ", s = 4, l = 3,
then there are 5 distinct substrings:
“Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”
I try to use hash table, but the speed is still slow.
In fact I don't know how to use the character size.
I have done things like this
int diffPatterns(const string& src, int len, int setSize) {
int cnt = 0;
node* table[1 << 15];
int tableSize = 1 << 15;
for (int i = 0; i < tableSize; ++i) {
table[i] = NULL;
}
unsigned int hashValue = 0;
int end = (int)src.size() - len;
for (int i = 0; i <= end; ++i) {
hashValue = hashF(src, i, len);
if (table[hashValue] == NULL) {
table[hashValue] = new node(i);
cnt ++;
} else {
if (!compList(src, i, table[hashValue], len)) {
cnt ++;
};
}
}
for (int i = 0; i < tableSize; ++i) {
deleteList(table[i]);
}
return cnt;
}
Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.
Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.
The idea of using a hashtable is good. It should work well.
The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.
You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity
#include <iostream>
#include <string>
#include <unordered_set>
int main()
{
std::string test = "PccjcjcZ";
std::unordered_set<std::string> counter;
size_t substringSize = 3;
for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
{
counter.insert(test.substr(i, substringSize));
}
std::cout << counter.size();
std::cin.get();
return 0;
}
Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.
Hash function
Let X and Y are two adjacent substrings of length L, more precisely:
X = A[i, i + L - 1]
Y = B[i + 1, i + 1 + L - 1]
Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.
Let's define a hash function h now:
h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M
where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.
Algorithm
The crucial observation is the following:
If you have a hash computed for X, you can compute a hash for Y in
O(1) time.
Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:
h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M
Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].
Notice that we can precompute powers of P at the beginning and we can do it modulo M.
Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.
Collisions
Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.
If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.
Use Rolling hashes.
This will make the runtime expected O(n).
This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.
Suffix Automaton also can finish it in O(N).
It's easy to code, but hard to understand.
Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365
http://www.sciencedirect.com/science/article/pii/S0304397509002370

Calculate floor(pow(2,n)/10) mod 10 - sum of digits of pow(2,n)

This is also a math related question, but I'd like to implement it in C++...so, I have a number in the form 2^n, and I have to calculate the sum of its digits ( in base 10;P ). My idea is to calculate it with the following formula:
sum = (2^n mod 10) + (floor(2^n/10) mod 10) + (floor(2^n/100) mod 10) + ...
for all of its digits: floor(n/floor(log2(10))).
The first term is easy to calculate with modular exponentiation, but I'm in trouble with the others.
Since n is big, and I don't want to use my big integer library, I can't calculate pow(2,n) without modulo. A code snippet for the first term:
while (n--){
temp = (temp << 1) % 10;
};
but for the second I have no idea. I also cannot floor them individually, since it would give '0' (2/10). Is it possible to achieve this?
(http://www.mathblog.dk/project-euler-16/ for the easier solution.) Of course I will look for other way if it cannot be done with this method. (for example storing digits in byte array, as in the comment in the link).
Edit: Thanks for the existing answers, but I look for some way to solve it mathematically. I've just came up with one idea, which can be implemented without bignum or digit-vectors, I'm gonna test if it works.
So, I have the equation above for the sum. But 2^n/10^k can be written as 2^n/2^(log2 10^k) which is 2^(n-k*log2 10). Then I take it's fractional part, and its integer part, and do modular exponentiation on the integer part: 2^(n-k*log2 10) = 2^(floor(n-k*log2 10)) * 2^(fract(n-k*log2 10)). After the last iteration I also multiply it with the fractional modulo 10. If it won't work or if I'm wrong somewhere in the above idea, I stick to the vector solution and accept an answer.
Edit: Ok, it seems doing modular exponentiation with non-integer modulo is not possible(?) (or I haven't found anything about it). So, I'm doing the digit/vector based solution.
The code does NOT work fully!
It does not give the good value: (1390 instead of 1366):
typedef long double ldb;
ldb mod(ldb x, ldb y){ //accepts doubles
ldb c(0);
ldb tempx(x);
while (tempx > y){
tempx -= y;
c++;
};
return (x - c*y);
};
int sumofdigs(unsigned short exp2){
int s = 0;
int nd = floor((exp2) * (log10(2.0))) + 1;
int c = 0;
while (true){
ldb temp = 1.0;
int expInt = floor(exp2 - c * log2((ldb)10.0));
ldb expFrac = exp2 - c * log2((ldb)10.0) - expInt;
while (expInt>0){
temp = mod(temp * 2.0, 10.0 / pow(2.0, expFrac)); //modulo with non integer b:
//floor(a*b) mod m = (floor(a mod (m/b)) * b) mod m, but can't code it
expInt--;
};
ldb r = pow(2.0, expFrac);
temp = (temp * r);
temp = mod(temp,10.0);
s += floor(temp);
c++;
if (c == nd) break;
};
return s;
};
You could create a vector of the digits using some of the techniques mentioned in this other question (C++ get each digit in int) and then just iterate over that vector and add everything up.
In the link you mention, you have the answer which will work as is for any number with n <= 63. So... why do you ask?
If you have to program your own everything then you need to know how to calculate a binary division and handle very large numbers. If you don't have to program everything, get a library for large integer numbers and apply the algorithm shown in the link:
BigNumber big_number;
big_number = 1;
big_number <<= n;
int result = 0;
while(big_number != 0) {
result += big_number % 10;
big_number /= 10;
}
return result;
Now, implementing BigNumber would be fun. From the algorithm we see that you need assignment, shift to left, not equal, modulo and division. A BigNumber class can be fully dynamic and allocate a buffer of integers to make said big number fit. It can also be written with a fixed size (as a template for example). But if you don't have the time, maybe this one will do:
https://mattmccutchen.net/bigint/
I implemented this in JavaScript as below for finding the sum of digits of 2^1000: (Check out working CodePen)
function calculate(){
var num = 0, totalDigits = 1,exponent =0,sum=0,i=0,temp=0, carry;
var arr = ['1'];
//Logic to implement how we multiply in daily life using carry forward method
while(exponent<1000){ //Mention the power
carry=0;
for(var j=arr.length-1;j>=0;j--){
temp = arr[j]*2 + carry;
arr[j]= temp%10;
carry = parseInt(temp/10);
if(carry && !j){
arr = [carry].concat(arr); //if the last nth digit multiplication with 2 yields a carry, increase the space!
}
}
exponent++;
}
for(var i=0;i<arr.length;i++){
sum = sum+parseInt(arr[i]);
}
document.getElementById('result').value = sum; //In my HTML code, I am using result textbox with id as 'result'
//console.log(arr);
//console.log(sum);
}

Find two missing numbers

We have a machine with O(1) memory and we want to pass n numbers (one by one) in the first pass, and then we exclude the two numbers and we will pass n-2 numbers to the machine.
write an algorithm that finds missing numbers.
It can be done with O(1) memory.
You only need a few integers to keep track of some running sums. The integers do not require log n bits (where n is the number of input integers), they only require 2b+1 bits, where b is the number of bits in an individual input integer.
When you first read the stream add all the numbers and all of their squares, i.e. for each input number, n, do the following:
sum += n
sq_sum += n*n
Then on the second stream do the same thing for two different values, sum2 and sq_sum2. Now do the following maths:
sum - sum2 = a + b
sq_sum - sq_sum2 = a^2 + b^2
(a + b)(a + b) = a^2 + b^2 + 2ab
(a + b)(a + b) - (a^2 + b^2) = 2ab
(sum*sum - sq_sum) = 2ab
(a - b)(a - b) = a^2 + b^2 - 2ab
= sq_sum - (sum*sum - sq_sum) = 2sq_sum - sum*sum
sqrt(2sq_sum - sum*sum) = sqrt((a - b)(a - b)) = a - b
((a + b) - (a - b)) / 2 = b
(a + b) - b = a
You need 2b+1 bits in all intermediate results because you are storing products of two input integers, and in one case multiplying one of those values by two.
Assuming the numbers are ranging from 1..N and 2 of them are missing - x and y, you can do the following:
Use Gauss formula: sum = N(N+1)/2
sum - actual_sum = x + y
Use product of numbers: product = 1*2..*N = N!
product - actual_product = x * y
Resolve x,y and you have your missing numbers.
In short - go through the array and sum up each element to get the actual_sum, multiply each element to get actual_product. Then resolve the two equations for x an y.
It cannot be done with O(1) memory.
Assume you have a constant k bits of memory - then you can have 2^k possible states for your algorithm.
However - input is not limited, and assume there are (2^k) + 1 possible answers for (2^k) + 1 different problem cases, from piegeonhole principle, you will return the same answer twice for 2 problems with different answers, and thus your algorithm is wrong.
The following came to my mind as soon as I finished reading the question. But the answers above suggest that it is not possible with O(1) memory or that there should be a constraint on the range of numbers. Tell me if my understanding of the question is wrong. Ok, so here goes
You have O(1) memory - which means you have constant amount of memory.
When the n numbers are passed to you 1st time, just keep adding them in one variable and keep multiplying them in another. So at the end of 1st pass you have the sum and product of all the numbers in 2 variables S1 and P1. You have used 2 variable till now (+1 if you reading the numbers in memory).
When the (n-2) numbers are passed to you the second time, do the same. Store the sum and product of the (n-2) numbers in 2 other variables S2 and P2. You have used 4 variables till now (+1 if you reading the numbers in memory).
If the two missing numbers are x and y, then
x + y = S1 - S2
x*y = P1/P2;
You have two equations in two variables. Solve them.
So you have used a constant amount of memory (independent of n).
void Missing(int arr[], int size)
{
int xor = arr[0]; /* Will hold xor of all elements */
int set_bit_no; /* Will have only single set bit of xor */
int i;
int n = size - 2;
int x = 0, y = 0;
/* Get the xor of all elements in arr[] and {1, 2 .. n} */
for(i = 1; i < size; i++)
xor ^= arr[i];
for(i = 1; i <= n; i++)
xor ^= i;
/* Get the rightmost set bit in set_bit_no */
set_bit_no = xor & ~(xor-1);
/* Now divide elements in two sets by comparing rightmost set
bit of xor with bit at same position in each element. */
for(i = 0; i < size; i++)
{
if(arr[i] & set_bit_no)
x = x ^ arr[i]; /*XOR of first set in arr[] */
else
y = y ^ arr[i]; /*XOR of second set in arr[] */
}
for(i = 1; i <= n; i++)
{
if(i & set_bit_no)
x = x ^ i; /*XOR of first set in arr[] and {1, 2, ...n }*/
else
y = y ^ i; /*XOR of second set in arr[] and {1, 2, ...n } */
}
printf("\n The two repeating missing elements are are %d & %d ", x, y);
}
Please look at the solution link below. It explains an XOR method.
This method is more efficient than any of the methods explained above.
It might be the same as Victor above, but there is an explanation as to why this works.
Solution here
Here is the simple solution which does not require any quadratic formula or multiplication:
Let say B is the sum of two missing numbers.
The set of two missing numbers will be one from:
(1,B-1),(2,B-1)...(B-1,1)
Therefore, we know that one of those two numbers will be less than or equal to the half of B.
We know that we can calculate the B (sum of both missing number).
So, once we have B, we will find the sum of all numbers in the list which are less than or equal to B/2 and subtract that from the sum of (1 to B/2) to get the first number. And then, we get the second number by subtracting first number from B. In below code, rem_sum is B.
public int[] findMissingTwoNumbers(int [] list, int N){
if(list.length == 0 || list.length != N - 2)return new int[0];
int rem_sum = (N*(N + 1))/2;
for(int i = 0; i < list.length; i++)rem_sum -= list[i];
int half = rem_sum/2;
if(rem_sum%2 == 0)half--; //both numbers cannot be the same
int rem_half = getRemHalf(list,half);
int [] result = {rem_half, rem_sum - rem_half};
return result;
}
private int getRemHalf(int [] list, int half){
int rem_half = (half*(half + 1))/2;
for(int i = 0; i < list.length; i++){
if(list[i] <= half)rem_half -= list[i];
}
return rem_half;
}