Boyer Moore k-mismatches algorithm fails

Boyer Moore k-mismatches algorithm fails - c++

I've done a program for string comparison with one mismatch at a programming website. It gives me wrong answer. I've working on it extensively but, I couldn't find testcases where my code fails. Can somebody provide me test cases where my code fails. I've done the comparison using Boyer Moore Horspool k-mismatches algorithm as it's the fastest searching algorithm
The code is as such
int BMSearch_k(string text, string pattern, int tlen, int mlen,int pos)
{
int i, j=0,ready[256],skip2[256][mlen-1],neq;
for(i=0; i<256; ++i) ready[i] = mlen;
for(int a=0; a<256;a++) {
for(i = mlen;i>mlen-k;i--)
skip2[i][a] = mlen;
}
for(i = mlen-2;i>=1;i--) {
for(j=ready[pattern[i]]-1;j>=max(i,mlen-k);j--)
skip2[j][pattern[i]] = j-i;
ready[pattern[i]] = max(i,mlen-k);
}
j = mlen-1+pos;
//cout<<"\n--jafffa--\n"<<pos<<"+"<<mlen<<"="<<j<<endl;
while(j<tlen+k) {
//cout<<"\t--"<<j<<endl;
int h = j;
i=mlen-1;
int neq=0,shift = mlen-k;
while(i>=0&&neq<=k) {
//cout<<"\t--"<<i<<endl;
if(i>=mlen-k)
shift = min(shift,skip2[i][text[h]]);
if(text[h]!= pattern[i])
neq++;
i--;
h--;
}
if(neq<=k)
return j-1;
j += shift;
}
return -1;
}

You aren't initialising your arrays correctly,
int i, j=0,ready[256],skip2[256][mlen-1],neq;
for(i=0; i<256; ++i) ready[i] = mlen;
for(int a=0; a<256;a++) {
for(i = mlen;i>mlen-k;i--)
skip2[i][a] = mlen;
}
On the one hand, you declare skip2 as a 256×(mlen-1) array, on the other hand, you fill it as a (mlen+1)×256 array.
In the next loop,
for(i = mlen-2;i>=1;i--) {
for(j=ready[pattern[i]]-1;j>=max(i,mlen-k);j--)
skip2[j][pattern[i]] = j-i;
ready[pattern[i]] = max(i,mlen-k);
}
you use ready[pattern[i]] before it has been set. I don't know if those mistakes are what's causing the failing testcase, but it's easily imaginable that they do.

If Daniel's suggestions do not solve the problem, here are a couple more things that look odd:
return j-1; // I would expect "return j;" here
This seems odd as if you have k=0,mlen=1, then the highest value that j can take is tlen+k-1, and so the highest return value is tlen-2. In other words matching a pattern 'a' against a string 'a' will not return a match at position 0.
Another oddity is the loop:
for(i = mlen-2;i>=1;i--) // I would expect "for(i = mlen-2;i>=0;i--)" here
it seems odd that in the preprocessing you will never access the first character in your pattern (i.e. pattern[0] is not read).

Related

Return the index for the maximum value

I have written the following which gets the index value for the maximum number.
int TravellingSalesMan::getMaximum(double *arr){
double temp = arr[0];
int iterator = 0;
for(int i = 0; i < 30; i++){
if(arr[i] > temp){
iterator = i;
}
}
return iterator;
}
But the output keeps stepping into the conditional statement and keeps printing out 29. I am not sure why this is happening
I also tried using max_element() but with no luck
EDIT
The above function is invoked as following
static unsigned int chromosome = 30;
double value[chromosome]
for(int i = 0; i < chromosomes; i++){
value[i] = estimateFitness(currPopultaion[i]);
}
int best = 0;
best = getMaximum(value);
cout<<best<<endl; // this just prints out 29

Okay, so I didn't plan on writing the answer, but I just saw too many logical mistakes in the code for me to write in the comments section!
First of all, your use of the variable name iterator is very
wrong. It's not used for iteration over the list. Why create confusion. Best to use something like max_index or something like that.
Why start from i=0? Your temp value is arr[0], so there is no use. of checking with first element again. Start from i=1!
temp is pointless in that function. The maximum index should initially be 0, and set to i if ever there is some arr[i] that is greater than arr[max_index].
Passing the length separately to the function is better coding as it makes the code more clearer.
The content in arr is not modified, and as such better safe than sorry: make the pointer const.
Re-writing the code, it should be:
int TravellingSalesMan::getMaximum(const double *arr,int len)
{
int max_index = 0;
for(int i = 1; i < len; i++)
{
if(arr[i] > arr[max_index])
max_index = i;
}
return max_index;
}
Worth noting, but unchanged in the code above, len, i, and the function result should all be an unsigned integer type. There is no reason, to allow signed integer indexing, so make it a warning-condition from the caller if they do so by hard-specifying unsigned or just size_t as the indexing variable types.

You should be assigning a new value to temp when you find a new maximum.
int TravellingSalesMan::getMaximum(double *arr){
double temp = arr[0];
int iterator = 0;
for(int i = 0; i < 30; i++){
if(arr[i] > temp){
iterator = i;
temp = arr[i]; // this was missing
}
}
return iterator;
}
Without this you are finding the largest index of a value greater than the value at index zero.
A much better solution is to simply use std::max_element instead. Pointers can be used as iterators in most (if not all) algorithms requiring iterators.
#include <algorithm>
static unsigned int chromosomes = 30;
double value[chromosomes];
for (int i=0; i<chromosomes; ++i) {
value[I] = estimate_fitness(current_population[i]);
}
double *max_elm = std::max_element(&value[0], &value[chromosomes]);
int best = int(max_elm - &value[0]);
std::cout << best << std::endl;

C++ Part of brute-force knapsack

reader,
Well, I think I just got brainfucked a bit.
I'm implementing knapsack, and I thought about I implemented brute-force algorithm like 1 or 2 times ever. So I decided to make another one.
And here's what I chocked in.
Let us decide W is maximum weight, and w(min) is minimal-weighted element we can put in knapsack like k=W/w(min) times. I'm explaining this because you, reader, are better know why I need to ask my question.
Now. If we imagine that we have like 3 types of things we can put in knapsack, and our knapsack can store like 15 units of mass, let's count each unit weight as its number respectively. so we can put like 15 things of 1st type, or 7 things of 2nd type and 1 thing of 1st type. but, combinations like 22222221[7ed] and 12222222[7ed] will mean the same for us. and counting them is a waste of any type of resources we pay for decision. (it's a joke, 'cause bf is a waste if we have a cheaper algorithm, but I'm very interested)
As I guess the type of selections we need to go through all possible combinations is called "Combinations with repetitions". The number of C'(n,k) counts as (n+k-1)!/(n-1)!k!.
(while I typing my message I just spotted a hole in my theory. we will probably need to add an empty, zero-weighted-zero-priced item to hold free space it's probably just increases n by 1)
so, what's the matter.
https://rosettacode.org/wiki/Combinations_with_repetitions
as this problem is well-described up here^ I don't really want to use stack this way, I want to generate variations in single cycle, which is going from i=0 to i<C'(n,k).
so, If I can make it, how it works?
we have
int prices[n]; //appear mystically
int weights[n]; // same as previous and I guess we place (0,0) in both of them.
int W, k; // W initialized by our lord and savior
k = W/min(weights);
int road[k], finalroad[k]; //all 0
int curP = curW = maxP = maxW = 0;
for (int i = 0; i < rCombNumber(n, k); i ++) {
/*guys please help me to know how to generate this mask which is consists of indices from 0 to n (meaning of each element) and k is size of mask.*/
curW = 0;
for (int j = 0; j < k; j ++)
curW += weights[road[j]];
if (curW < W) {
curP = 0;
for (int l = 0; l < k; l ++)
curP += prices[road[l]];
if (curP > maxP) {
maxP = curP;
maxW = curW;
finalroad = road;
}
}
}
mask, road -- is an array of indices, each can be equal from 0 to n; and have to be generated as C'(n,k) (link about it above) from { 0, 1, 2, ... , n } by k elements in each selection (combination with repetitions where order is unimportant)
that's it. prove me wrong or help me. Much thanks in advance _
and yes, of course algorithm will take the hell much time, but it looks like it should work. and I'm very interesting in it.
UPDATE:
what do I miss?
http://pastexen.com/code.php?file=EMcn3F9ceC.txt

The answer was provided by Minoru here https://gist.github.com/Minoru/745a7c19c7fa77702332cf4bd3f80f9e ,
it's enough to increment only the first element, then we count all of the carries, set where we did a carry and count reset value as the maximum of elements to reset and reset with it.
here's my code:
#include <iostream>
using namespace std;
static long FactNaive(int n)
{
long r = 1;
for (int i = 2; i <= n; ++i)
r *= i;
return r;
}
static long long CrNK (long n, long k)
{
long long u, l;
u = FactNaive(n+k-1);
l = FactNaive(k)*FactNaive(n-1);
return u/l;
}
int main()
{
int numberOFchoices=7,kountOfElementsInCombination=4;
int arrayOfSingleCombination[kountOfElementsInCombination] = {0,0,0,0};
int leftmostResetPos = kountOfElementsInCombination;
int resetValue=1;
for (long long iterationCounter = 0; iterationCounter<CrNK(numberOFchoices,kountOfElementsInCombination); iterationCounter++)
{
leftmostResetPos = kountOfElementsInCombination;
if (iterationCounter!=0)
{
arrayOfSingleCombination[kountOfElementsInCombination-1]++;
for (int anotherIterationCounter=kountOfElementsInCombination-1; anotherIterationCounter>0; anotherIterationCounter--)
{
if(arrayOfSingleCombination[anotherIterationCounter]==numberOFchoices)
{
leftmostResetPos = anotherIterationCounter;
arrayOfSingleCombination[anotherIterationCounter-1]++;
}
}
}
if (leftmostResetPos != kountOfElementsInCombination)
{
resetValue = 1;
for (int j = 0; j < leftmostResetPos; j++)
{
if (arrayOfSingleCombination[j] > resetValue)
{
resetValue = arrayOfSingleCombination[j];
}
}
for (int j = leftmostResetPos; j != kountOfElementsInCombination; j++)
{
arrayOfSingleCombination[j] = resetValue;
}
}
for (int j = 0; j<kountOfElementsInCombination; j++)
{
cout<<arrayOfSingleCombination[j]<<" ";
}
cout<<"\n";
}
return 0;
}
thanks a lot, Minoru

String matching algorithm trying to correct it

I'm trying to do string matching algorithm a brute force method. but The algorithm is not working correctly, I get an out of bound index error.
here is my algorithm
int main() {
string s = "NOBODY_NOTICED_HIM";
string pattern="NOT";
int index = 0;
for (int i = 0; i < s.size();)
{
for (int j = 0; j < pattern.size();)
{
if(s[index] == pattern[j])
{
j++;
i++;
}
else
{
index = i;
j = 0;
}
}
}
cout<<index<<endl;
return 0;
}
FIXED VERSION
I fixed the out of bound exception. I don't know if the algorithm will work with different strings
int main() {
string s = "NOBODY_NOTICED_HIM";
string pattern="NOT";
int index = 0;
int i = 0;
while( i < s.size())
{
i++;
for (int j = 0; j < pattern.size();)
{
if(s[index] == pattern[j])
{
index++;
j++;
cout<<"i is " <<i << " j is "<<j <<endl;
}
else
{
index = i;
break;
}
}
}
cout<<i<<endl;
return 0;
}

Because the inner for loop has a condition to loop while j is less than pattern.size() but you are also incrementing i inside the body. When i goes out of bounds of s.size() then index also goes out of bounds and you'd get an OutOfBounds error.
The brute force method has to test the pattern with every possible subsequence. The main condition is the length, which has to be the same. All subsequence from s are:
['NOB', 'OBO', 'BOD', 'ODY', 'DY_', 'Y_N', 'NO', 'NOT', 'OTI', 'TIC',
'ICE', 'CED', 'ED', 'D_H', '_HI', 'HIM']
There are many ways to do it, you can do it char by char, or by using string operations like taking a substring. Both are nice excercises for learning.
Starting at zero in the s string you take the first three chars, compare to the pattern, and if equal you give the answer. Otherwise you move on to the char starting at one, etc.

Almost same code running much slower

I am trying to solve this problem:
Given a string array words, find the maximum value of length(word[i]) * length(word[j]) where the two words do not share common letters. You may assume that each word will contain only lower case letters. If no such two words exist, return 0.
https://leetcode.com/problems/maximum-product-of-word-lengths/
You can create a bitmap of char for each word to check if they share chars in common and then calc the max product.
I have two method almost equal but the first pass checks, while the second is too slow, can you understand why?
class Solution {
public:
int maxProduct2(vector<string>& words) {
int len = words.size();
int *num = new int[len];
// compute the bit O(n)
for (int i = 0; i < len; i ++) {
int k = 0;
for (int j = 0; j < words[i].length(); j ++) {
k = k | (1 <<(char)(words[i].at(j)));
}
num[i] = k;
}
int c = 0;
// O(n^2)
for (int i = 0; i < len - 1; i ++) {
for (int j = i + 1; j < len; j ++) {
if ((num[i] & num[j]) == 0) { // if no common letters
int x = words[i].length() * words[j].length();
if (x > c) {
c = x;
}
}
}
}
delete []num;
return c;
}
int maxProduct(vector<string>& words) {
vector<int> bitmap(words.size());
for(int i=0;i<words.size();++i) {
int k = 0;
for(int j=0;j<words[i].length();++j) {
k |= 1 << (char)(words[i][j]);
}
bitmap[i] = k;
}
int maxProd = 0;
for(int i=0;i<words.size()-1;++i) {
for(int j=i+1;j<words.size();++j) {
if ( !(bitmap[i] & bitmap[j])) {
int x = words[i].length() * words[j].length();
if ( x > maxProd )
maxProd = x;
}
}
}
return maxProd;
}
};
Why the second function (maxProduct) is too slow for leetcode?
Solution
The second method does repetitive call to words.size(). If you save that in a var than it working fine

Since my comment turned out to be correct I'll turn my comment into an answer and try to explain what I think is happening.
I wrote some simple code to benchmark on my own machine with two solutions of two loops each. The only difference is the call to words.size() is inside the loop versus outside the loop. The first solution is approximately 13.87 seconds versus 16.65 seconds for the second solution. This isn't huge, but it's about 20% slower.
Even though vector.size() is a constant time operation that doesn't mean it's as fast as just checking against a variable that's already in a register. Constant time can still have large variances. When inside nested loops that adds up.
The other thing that could be happening (someone much smarter than me will probably chime in and let us know) is that you're hurting your CPU optimizations like branching and pipelining. Every time it gets to the end of the the loop it has to stop, wait for the call to size() to return, and then check the loop variable against that return value. If the cpu can look ahead and guess that j is still going to be less than len because it hasn't seen len change (len isn't even inside the loop!) it can make a good branch prediction each time and not have to wait.

DP approach for minimum number of characters that should be added to make a string palindrome?

Find the minimum number of characters needed to make S a palindrome.For instance, if S = "fft", the string should be changed to the string "tfft", adding only 1 character.
Now, I used the dp approach for solving this problem which is as follows:
Let the given input string be S[1.....L]. Then for any substring S[i....j] of the input string, we can find the minimum insertions as:
min_insertions(S[i+1 ...... j-1]) [if S[i] is equal to S[j]]
min(min_insertions(S[i+1......j]), min_insertions(S[i....j-1])) + 1
I coded this as follows:
#include <iostream>
using namespace std;
int dp[100][100];
int main (void)
{
int n,i,j;
char arr[100];
cin>>arr;
n = strlen(arr);
//cout<<"You entered the string as "<<arr<<"\n";
for (i = 0; i < n; i++ )
dp[i][0] = 0;
for ( i = 0; i < n; i++ )
{
for ( j = 0; j < n; j++ )
{
if (arr[i] == arr[j])
dp[i][j] = dp[i+1][j-1];
else
dp[i][j] = min(dp[i+1][j],dp[i][j-1])+1;
}
// cout<<dp[0][n-1];
}
cout<<dp[0][n-1]<<"\n";
return 0;
}
However, this gives a wrong value. Why is it happening? For example, if I enter the string as abc, it outputs 1. What's wrong with this? Is there anything wrong with my logic?

You are not filling your array dp in the right order. For instance dp[0][2] will ask for dp[1][2] which has not been computed yet.
So the logic should be to have an assigning loop of the form :
for (int i = 0; i <n; i++) {
for (int h = 0; h < n-i; h++) {
dp[i][i+h] = .. // your part
}
}
You also need to be more careful about the case where h = 0 above, where you don't want to call dp[i+1][i] or dp[i][i-1], and h=1, arr[i]=arr[i+1] where you don't want dp[i+1][i] getting called.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Boyer Moore k-mismatches algorithm fails - c++

Related

Return the index for the maximum value

C++ Part of brute-force knapsack

String matching algorithm trying to correct it

Almost same code running much slower

DP approach for minimum number of characters that should be added to make a string palindrome?

Categories

Resources