Adapting Boyer-Moore Implementation - c++

I'm trying to adapt the Boyer-Moore c(++) Wikipedia implementation to get all of the matches of a pattern in a string. As it is, the Wikipedia implementation returns the first match. The main code looks like:
char* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) {
int i;
int delta1[ALPHABET_LEN];
int *delta2 = malloc(patlen * sizeof(int));
make_delta1(delta1, pat, patlen);
make_delta2(delta2, pat, patlen);
i = patlen-1;
while (i < stringlen) {
int j = patlen-1;
while (j >= 0 && (string[i] == pat[j])) {
--i;
--j;
}
if (j < 0) {
free(delta2);
return (string + i+1);
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
return NULL;
}
I have tried to modify the block after if (j < 0) to add the index to an array/vector and letting the outer loop continue, but it doesn't appear to be working. In testing the modified code I still only get a single match. Perhaps this implementation wasn't designed to return all matches, and it needs more than a few quick changes to do so? I don't understand the algorithm itself very well, so I'm not sure how to make this work. If anyone can point me in the right direction I would be grateful.
Note: The functions make_delta1 and make_delta2 are defined earlier in the source (check Wikipedia page), and the max() function call is actually a macro also defined earlier in the source.

Boyer-Moore's algorithm exploits the fact that when searching for, say, "HELLO WORLD" within a longer string, the letter you find in a given position restricts what can be found around that position if a match is to be found at all, sort of a Naval Battle game: if you find open sea at four cells from the border, you needn't test the four remaining cells in case there's a 5-cell carrier hiding there; there can't be.
If you found for example a 'D' in eleventh position, it might be the last letter of HELLO WORLD; but if you found a 'Q', 'Q' not being anywhere inside HELLO WORLD, this means that the searched-for string can't be anywhere in the first eleven characters, and you can avoid searching there altogether. A 'L' on the other hand might mean that HELLO WORLD is there, starting at position 11-3 (third letter of HELLO WORLD is a L), 11-4, or 11-10.
When searching, you keep track of these possibilities using the two delta arrays.
So when you find a pattern, you ought to do,
if (j < 0)
{
// Found a pattern from position i+1 to i+1+patlen
// Add vector or whatever is needed; check we don't overflow it.
if (index_size+1 >= index_counter)
{
index[index_counter] = 0;
return index_size;
}
index[index_counter++] = i+1;
// Reinitialize j to restart search
j = patlen-1;
// Reinitialize i to start at i+1+patlen
i += patlen +1; // (not completely sure of that +1)
// Do not free delta2
// free(delta2);
// Continue loop without altering i again
continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;
This should return a zero-terminated list of indexes, provided you pass something like a size_t *indexes to the function.
The function will then return 0 (not found), index_size (too many matches) or the number of matches between 1 and index_size-1.
This allows for example to add additional matches without having to repeat the whole search for the already found (index_size-1) substrings; you increase num_indexes by new_num, realloc the indexes array, then pass to the function the new array at offset old_index_size-1, new_num as the new size, and the haystack string starting from the offset of match at index old_index_size-1 plus one (not, as I wrote in a previous revision, plus the length of the needle string; see comment).
This approach will report also overlapping matches, for example searching ana in banana will find b*ana*na and ban*ana*.
UPDATE
I tested the above and it appears to work. I modified the Wikipedia code by adding these two includes to keep gcc from grumbling
#include <stdio.h>
#include <string.h>
then I modified the if (j < 0) to simply output what it had found
if (j < 0) {
printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
//free(delta2);
// return (string + i+1);
i += patlen + 1;
j = patlen - 1;
continue;
}
and finally I tested with this
int main(void)
{
char *s = "This is a string in which I am going to look for a string I will string along";
char *p = "string";
boyer_moore(s, strlen(s), p, strlen(p));
return 0;
}
and got, as expected:
Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along
If the string contains two overlapping sequences, BOTH are found:
char *s = "This is an andean andeandean andean trouble";
char *p = "andean";
Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble
To avoid overlapping matches, the quickest way is to not store the overlaps. It could be done in the function but it would mean to reinitialize the first delta vector and update the string pointer; we also would need to store a second i index as i2 to keep saved indexes from going nonmonotonic. It isn't worth it. Better:
if (j < 0) {
// We have found a patlen match at i+1
// Is it an overlap?
if (index && (indexes[index] + patlen < i+1))
{
// Yes, it is. So we don't store it.
// We could store the last of several overlaps
// It's not exactly trivial, though:
// searching 'anana' in 'Bananananana'
// finds FOUR matches, and the fourth is NOT overlapped
// with the first. So in case of overlap, if we want to keep
// the LAST of the bunch, we must save info somewhere else,
// say last_conflicting_overlap, and check twice.
// Then again, the third match (which is the last to overlap
// with the first) would overlap with the fourth.
// So the "return as many non overlapping matches as possible"
// is actually accomplished by doing NOTHING in this branch of the IF.
}
else
{
// Not an overlap, so store it.
indexes[++index] = i+1;
if (index == max_indexes) // Too many matches already found?
break; // Stop searching and return found so far
}
// Adapt i and j to keep searching
i += patlen + 1;
j = patlen - 1;
continue;
}

Related

My code won't print when submitted for codecheck even though it compiles without error

I've been assigned this question for my lab (and yes I understand there will be backlash because it's homework). I've been working on this question for a couple of days to no avail and I feel like I'm missing something glaringly obvious.
My code:
int processSuitors(vector<int>& currentSuitors, list<int>& rekt)
{
int sizeSuitors = currentSuitors.size();
int eliminated = 2;
while(sizeSuitors != 1)
{
rekt.push_back(currentSuitors[eliminated]);
currentSuitors.erase(currentSuitors.begin() + eliminated);
sizeSuitors--;
if(eliminated > sizeSuitors)
{
eliminated -= sizeSuitors;
}
}
return currentSuitors[0];
}
Prompt:
In an ancient land, the beautiful princess Eve had many suitors. She decided on the following procedure to determine which suitor she would marry. First, all of the suitors would be lined up one after the other and be assigned numbers. The first suitor would be number 1, the second number 2, and so on up to the last suitor, number n. Starting at the first suitor she would then count three suitors down the line (because of the three letters in her name) and the third suitor would be eliminated from winning her hand and he would be removed from the line. Eve would then continue, counting three more suitors and eliminating every third suitor. When she reached the end of the line she would continue counting from the beginning.
Write a function named processSuitors that takes as arguments an STL vector of type int containing the suitors, and an STL list of type int that will collect all the suitors that are eliminated. The function returns an int storing the position a suitor should stand in to marry the princess if there are n suitors. The function that calls processSuitors will send the vector already filled with n suitors (1, 2, 3... n), and an empty list that needs to be filled with the position number of the suitors that were eliminated, in the order they were eliminated.
Restrictions: You may not create any containers (no arrays, no vectors, etc.); you need to use the vector and the list that are passed as parameters.
Use ONLY the following STL functions:
vector::size
vector::erase
vector::begin
ist::push_back
vector::operator[ ]
The adjacent files are hidden since we are to rely on what is given. Any clean-up of my code would be extremely appreciated as well.
What do you think of this solution.
Keep another vector that marks whether an index in your currentSuitors vector has been removed. Then have a helper function that will always find the next free index.
Instead of trying to reduce currentSuitors, you just keep marking elements in the taken list.
size_t findNextFreeSlot(const vector<bool>& taken, size_t pos)
{
// increment to the next candidate position
pos = (pos + 1) % taken.size();
// search for the first free slot
for (size_t i = 0; i < taken.size(); i++)
{
if (taken[pos] == false)
{
return next;
}
pos = (pos + 1) % taken.size();
}
// assert(false); // we should never get here as long as there's one free slot index in taken
return -1;
}
int processSuitors(vector<int>& currentSuitors, list<int>& rekt)
{
size_t len = currentSuitors.size();
vector<bool> taken(len); // keep a vector of eliminated indices from current
size_t index = len; // initialize one past the last valid element
size_t eliminated = 0;
if (len == 0)
{
return -1;
}
while (eliminated < (len-1))
{
// advance the index three times to the next "untaken" index
index = findNextFreeSlot(taken, index);
index = findNextFreeSlot(taken, index);
index = findNextFreeSlot(taken, index);
taken[index] = true; // claim this index as taken
rekt.push_back(currentSuitors[index]); // add the value at this index to the eliminated list
eliminated++;
}
index = findNextFreeSlot(taken, index); // find the last free index
return currentSuitors[index];
}

Add character at every vowel

I had a similiar question earlier and managed to solve it. Now I'm trying to do it backwards, this is my code at the moment.
Basically I want the end result to "p5A5SSW5o5R5o5d", but when I run this it just find the first "o" which is in the first part of passworod, I need it to skip the the vowel it have already added a 5 before and after to.
I want every vowel (only included AOao in the current string as thats all the vowels appearing in my string), to have a 5 as prefix and suffix. It gets stuck at the first o and doesnt proceed to the next o. I have created a nested for loop which means it takes the first character in the encrypt-string, proceeds to the next for-loop and loops through every single vowel Ive included in the vowel string until it finds a match. Otherwise it restarts at the first for-loop but incremented by one. First go it should search the letter "p", second run it should search the letter "A" and so on.
Result: p5A5SSW55o55rod
Expected Result: p5A5SSW5o5R5o5d
In the end I will also want to rotate all the characters, but thats for another task, I think I can just use either if-statement or a switch to do that. If it ends up on a 5, do nothing, otherwise rotate.
I hope I made myself clear and provided you with all the relevant information, otherwise just holler in the comments.
Thanks in advance.
#include <string>
#include <sstream>
const string vowel = "AOao";
string encrypt, decrypt;
encrypt = "pASSWoRod";
decrypt = encrypt;
for (int i=0; i<encrypt.length(); i++){
for (int j=0; j<vowel.length(); j++){
if (encrypt[i] == vowel[j]){
decrypt.insert(decrypt.find(vowel[j]), 1, '5');
decrypt.insert(decrypt.find(vowel[j]) + 1, +1, '5');
}
}
}
return decrypt;
}
find, when not proved a starting point, always finds the first instance.
Searching and keeping track of the string length and where you've already inserted characters is much harder than it seems at first glance (as you've noticed).
Build the result from scratch instead of inserting characters into an initial string.
Also, implement this function (actual implementation left as an exercise):
bool is_vowel(char c);
and then
std::string encrypt = "pASSWoRod";
std::string decrypt;
for (auto c: encrypt)
{
if (is_vowel(c))
{
decrypt += '5';
decrypt += c;
decrypt += '5';
}
else
{
decrypt += c;
}
}
find with one argument starts from the beginning. You would need the other find.
However maintaing an index in decrypt, one no longer would need a find.
int encryptI = 0;
for (int i = 0; i < decrypt.length(); i++, encyptI++){
for (int j=0; j<vowel.length(); j++){
if (decrypt[i] == vowel[j]){
//encryptI = encrypt.find(vowel[j], encryptI);
encrypt.insert(encryptI, 1, '5');
encrypt.insert(encryptI + 1, +1, '5');
encryptI += 2;
break;
}
}
}
The code could be nicer.
string encrypt;
for (int i = 0; i < decrypt.length(); i++, decyptI++){
char ch = decrypt[i];
if (vowel.find(ch) != string::npos) {
encrypt.append('5');
encrypt.append(ch);
encrypt.append('5');
} else {
encrypt.append(ch);
}
}
Your find is stopping at the first instance of the vowel, 'o'. You should include a starting position for your find so you ignore the part of the decrypt string you've already analyzed.

Searching for an exact string match in a (arbitrary large) stream - C++

I am building a simple multi-server for string matching. I handle multiple clients at the same time by using sockets and select. The only job that the server does is this: a client connects to a server and sends a needle (of size less than 10 GB) and a haystack (of arbitrary size) as a stream through a network socket. Needle and haystack are an arbitrary binary data.
Server needs to search the haystack for all occurrences of the needle (as an exact string match) and sends a number of needle matches back to the client. Server needs to process clients on the fly and be able to handle any input in a reasonable time (that is a search algorithm have to have a linear time complexity).
To do this I obviously need to split the haystack into a small parts (possibly smaller than the needle) in order to process them as they are coming through the network socket. That is I would need a search algorithm that is able to handle a string, that is split into parts and search in it, the same way as strstr(...) does.
I could not find any standard C or C++ library function nor a Boost library object that could handle a string by parts. If I am not mistaken, algorithms in strstr(), string.find() and Boost searching/knuth_morris_pratt.hpp are only able to handle the search, when a whole haystack is in a continuous block of memory. Or is there some trick, that I could use to search a string by parts that I am missing? Do you guys know of any C/C++ library, that is able to cope with such a large needles and haystacks resp. that is able to handle haystack streams or search in haystack by parts?
I did not find any useful library by googling and hence I was forced to create my own variation of Knuth Morris Pratt algorithm, that is able to remember its own state (shown bellow). However I do not find it to be an optimal solution, as a well tuned string searching algorithm would surely perform better in my opinion, and it would be a less worry for a debugging later.
So my question is:
Is there some more elegant way to search in a large haystack stream by parts, other than creating my own search algorithm? Is there any trick how to use a standard C string library for this? Is there some C/C++ library that is specialized for a this kind of task?
Here is a (part of) code of my midified KMP algorithm:
#include <cstdlib>
#include <cstring>
#include <cstdio>
class knuth_morris_pratt {
const char* const needle;
const size_t needle_len;
const int* const lps; // a longest proper suffix table (skip table)
// suffix_len is an ofset of a longest haystack_part suffix matching with
// some prefix of the needle. suffix_len myst be shorter than needle_len.
// Ofset is defined as a last matching character in a needle.
size_t suffix_len;
size_t match_count; // a number of needles found in haystack
public:
inline knuth_morris_pratt(const char* needle, size_t len) :
needle(needle), needle_len(len),
lps( build_lps_array() ), suffix_len(0),
match_count(len == 0 ? 1 : 0) { }
inline ~knuth_morris_pratt() { free((void*)lps); }
void search_part(const char* haystack_part, size_t hp_len); // processes a given part of the haystack stream
inline size_t get_match_count() { return match_count; }
private:
const int* build_lps_array();
};
// Worst case complexity: linear space, linear time
// see: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
// see article: KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R., 1977, Fast pattern matching in strings
void knuth_morris_pratt::search_part(const char* haystack_part, size_t hp_len) {
if(needle_len == 0) {
match_count += hp_len;
return;
}
const char* hs = haystack_part;
size_t i = 0; // index for txt[]
size_t j = suffix_len; // index for pat[]
while (i < hp_len) {
if (needle[j] == hs[i]) {
j++;
i++;
}
if (j == needle_len) {
// a needle found
match_count++;
j = lps[j - 1];
}
else if (i < hp_len && needle[j] != hs[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
suffix_len = j;
}
const int* knuth_morris_pratt::build_lps_array() {
int* const new_lps = (int*)malloc(needle_len);
// check_cond_fatal(new_lps != NULL, "Unable to alocate memory in knuth_morris_pratt(..)");
// length of the previous longest prefix suffix
size_t len = 0;
new_lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
size_t i = 1;
while (i < needle_len) {
if (needle[i] == needle[len]) {
len++;
new_lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = new_lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
new_lps[i] = 0;
i++;
}
}
}
return new_lps;
}
int main()
{
const char* needle = "lorem";
const char* p1 = "sit voluptatem accusantium doloremque laudantium qui dolo";
const char* p2 = "rem ipsum quia dolor sit amet";
const char* p3 = "dolorem eum fugiat quo voluptas nulla pariatur?";
knuth_morris_pratt searcher(needle, strlen(needle));
searcher.search_part(p1, strlen(p1));
searcher.search_part(p2, strlen(p2));
searcher.search_part(p3, strlen(p3));
printf("%d \n", (int)searcher.get_match_count());
return 0;
}
You can have a look at BNDM, which has same performances as KMP:
O(m) for preprocessing
O(n) for matching.
It is used for nrgrep, the sources of which can be found here which containts C sources.
C source for BNDM algo are here.
See here for more information.
If I have well understood your problem, you want to search if a large std::string received part by part contains a substring.
If it is the case, I think you can store for each iteration the overlapping section between two contiguous received packets. And then you just have to check for each iteration that either the overlap or the packet contains the desired pattern to find.
In the example below, I consider the following contains() function to search a pattern in a std::string:
bool contains(const std::string & str, const std::string & pattern)
{
bool found(false);
if(!pattern.empty() && (pattern.length() < str.length()))
{
for(size_t i = 0; !found && (i <= str.length()-pattern.length()); ++i)
{
if((str[i] == pattern[0]) && (str.substr(i, pattern.length()) == pattern))
{
found = true;
}
}
}
return found;
}
Example:
std::string pattern("something"); // The pattern we want to find
std::string end_of_previous_packet(""); // The first part of overlapping section
std::string beginning_of_current_packet(""); // The second part of overlapping section
std::string overlap; // The string to store the overlap at each iteration
bool found(false);
while(!found && !all_data_received()) // stop condition
{
// Get the current packet
std::string packet = receive_part();
// Set the beginning of the current packet
beginning_of_current_packet = packet.substr(0, pattern.length());
// Build the overlap
overlap = end_of_previous_packet + beginning_of_current_packet;
// If the overlap or the packet contains the pattern, we found a match
if(contains(overlap, pattern) || contains(packet, pattern))
found = true;
// Set the end of previous packet for the next iteration
end_of_previous_packet = packet.substr(packet.length()-pattern.length());
}
Of course, in this example I made the assumption that the method receive_part() already exists. Same thing for the all_data_received() function. It is just an example to illustrate the idea.
I hope it will help you to find a solution.

Character pointers messed up in simple Boyer-Moore implementation

I am currently experimenting with a very simple Boyer-Moore variant.
In general my implementation works, but if I try to utilize it in a loop the character pointer containing the haystack gets messed up. And I mean that characters in it are altered, or mixed.
The result is consistent, i.e. running the same test multiple times yields the same screw up.
This is the looping code:
string src("This haystack contains a needle! needless to say that only 2 matches need to be found!");
string pat("needle");
const char* res = src.c_str();
while((res = boyerMoore(res, pat)))
++res;
This is my implementation of the string search algorithm (the above code calls a convenience wrapper which pulls the character pointer and length of the string):
unsigned char*
boyerMoore(const unsigned char* src, size_t srcLgth, const unsigned char* pat, size_t patLgth)
{
if(srcLgth < patLgth || !src || !pat)
return nullptr;
size_t skip[UCHAR_MAX]; //this is the skip table
for(int i = 0; i < UCHAR_MAX; ++i)
skip[i] = patLgth; //initialize it with default value
for(size_t i = 0; i < patLgth; ++i)
skip[(int)pat[i]] = patLgth - i - 1; //set skip value of chars in pattern
std::cout<<src<<"\n"; //just to see what's going on here!
size_t srcI = patLgth - 1; //our first character to check
while(srcI < srcLgth)
{
size_t j = 0; //char match ct
while(j < patLgth)
{
if(src[srcI - j] == pat[patLgth - j - 1])
++j;
else
{
//since the number of characters to skip may be negative, I just increment in that case
size_t t = skip[(int)src[srcI - j]];
if(t > j)
srcI = srcI + t - j;
else
++srcI;
break;
}
}
if(j == patLgth)
return (unsigned char*)&src[srcI + 1 - j];
}
return nullptr;
}
The loop produced this output (i.e. these are the haystacks the algorithm received):
This haystack contains a needle! needless to say that only 2 matches need to be found!
eedle! needless to say that only 2 matches need to be found!
eedless to say that eed 2 meed to beed to be found!
As you can see the input is completely messed up after the second run. What am I missing? I thought the contents could not be modified, since I'm passing const pointers.
Is the way of setting the pointer in the loop wrong, or is my string search screwing up?
Btw: This is the complete code, except for includes and the main function around the looping code.
EDIT:
The missing nullptr of the first return was due to a copy/paste error, in the source it is actually there.
For clarification, this is my wrapper function:
inline char* boyerMoore(const string &src, const string &pat)
{
return (const char*) boyerMoore((const unsigned char*) src.c_str(), src.size(),
(const unsigned char*) pat.c_str(), pat.size());
}
In your boyerMoore() function, the first return isn't returning a value (you have just return; rather than return nullptr;) GCC doesn't always warn about missing return values, and not returning anything is undefined behavior. That means that when you store the return value in res and call the function again, there's no telling what will print out. You can see a related discussion here.
Also, you have omitted your convenience function that calculates the length of the strings that you are passing in. I would recommend double checking that logic to make sure the sizes are correct - I'm assuming you are using strlen or similar.

Do I need more space?

I have code that is supposed to separate a string into 3 length sections:
ABCDEFG should be ABC DEF G
However, I have an extremely long string and I keep getting the
terminate called without an active exception
When I cut the length of the string down, it seems to work. Do I need more space? I thought when using a string I didn't have to worry about space.
int main ()
{
string code, default_Code, start_C;
default_Code = "TCAATGTAACGCGCTACCCGGAGCTCTGGGCCCAAATTTCATCCACT";
start_C = "AUG";
code = default_Code;
for (double j = 0; j < code.length(); j++) { //insert spacing here
code.insert(j += 3, 1, ' ');
}
cout << code;
return 0;
}
Think about the case when code.length() == 2. You're inserting a space somewhere over the string. I'm not sure but it would be okay if for(int j=0; j+3 < code.length(); j++).
This is some fairly confusing code. You are looping through a string and looping until you reach the end of the string. However, inside the loop you are not only modifying the string you are looping through, but you also change the loop variable when you say j += 3.
It happens to work for any string with a multiple of 3 letters, but you are not correctly handling other cases.
Here is a working example of the for loop that is a bit more clear it what it's doing:
// We skip 4 each time because we added a space.
for (int j = 3; j < code.length(); j += 4)
{
code.insert(j, 1, ' ');
}
You are using an extremely inefficient method to do such an operation. Every time you insert a space you are moving all the remaining part of the string forward and this means that the total number of operations you will need is in the order of o(n**2).
You can instead do this transormation with a single o(n) pass by using a read-write approach:
// input string is assumed to be non-empty
std::string new_string((old_string.size()*4-1)/3);
int writeptr = 0, count = 0;
for (int readptr=0,n=old_string.size(); readptr<n; readptr++) {
new_string[writeptr++] = old_string[readptr];
if (++count == 3) {
count = 0;
new_string[writeptr++] = ' ';
}
}
A similar algorithm can be written also to work "inplace" instead of creating a new string, simply you have to first enlarge the string and then work backward.
Note also that while it's true that for a string you don't need to care about allocation and deallocation still there are limits about the size of a string object (even if probably you are not hitting them... your version is so slow that it would take forever to get to that point on a modern computer).