Horspool algorithm for multiple occurrences of the same pattern - c++

I've implemented in C++ the Horspool algorithm (depending on the Introduction to the Design and Analysis of Algorithms by Anany Levitin, 2nd edition, p. 258) for finding the position of the first occurrence of a desired pattern in the text. However, I want to extend the algorithm to find multiple occurrences of the same pattern. Unfortunately, I got stuck on the latter implementation. You can see my code below:
The function calculates and returns the position of the first occurrence of a desired pattern in the text. The shift sizes are stored in the ShiftTable and the ShiftTable is indexed by the characters of a desired alphabet. Additionally, the integer counter is used for counting the total comparisons between pattern's and text's characters. The counter initially has a zero value. How could I extend this to find multiple occurences of the same pattern?
I attempted the following in the body of the main() function but it's NOT EFFICIENT although it works. If the first occurrence of the pattern is encountered, its position will be printed and the part of the text which ends with the first occurrence of the pattern will be erased. Moreover, the programme will check the remaining text for the pattern and so on.
int counter=0;
while ((position = Find(pattern,text,ShiftTable,counter)) != -1) {
cout << position << endl;
text = text.erase(0,result+m);
}
Any ideas?

Currently you always start at the beginning (i = m - 1). If you want to resume a previous search, just pass in the last position to start from.
In the following I’ve removed the counter variable – what’s the use of that anyway?
int Find(string pattern, string text, int *ShiftTable, int start = 0)
… and …
i = start + m - 1,
… and just call the code as follows:
while ((position = Find(pattern,text,ShiftTable,position)) != -1) {
cout << position << endl;
++position;
}

Related

Word search in an array of sentences in C++

How can I write a program that finds how many times a word entered from the keyboard is repeated in a sentence?
Word search in an array of sentences in C++
How can I write a program that finds how many times a word entered from the keyboard is repeated in a sentence?
I am assuming you want to write a program that counts how many times a (case-sensitive) word has been repeated in an array of sentences. We'll go step-by-step.
#include <string>
#include <vector>
/**
* Counts the number of times a word has been repeated in multiple sentences.
*
* #param sentences An array of string sentences.
* #param word The word to be searched and counted.
* #returns no. of occurrences of `word` in `sentences`.
*/
auto count(const std::vector<std::string>& sentences, const std::string& word) -> std::size_t
{
// Initialize the no. of occurrences of `word` to 0
std::size_t word_count{ 0 };
// Store `word`'s C-string
auto word_cstr = word.c_str();
// Loop through all the sentences to check for `word`'s count in each sentence
for (const auto& sentence : sentences)
{
// Find the first occurrence of `word` in `sentence`
std::size_t found{ sentence.find(word_cstr) };
// Increase the count as long as the word is occurring in the sentence
while (found < std::string::npos)
{
++word_count;
// Search for the next occurrence by starting from the
// current index's next index till the end of the word
found = sentence.find(word_cstr, found+1, word.size());
}
}
// Return the final total no. of occurrences of `word` in all `sentences`
return word_count;
}
Of course, there is (probably) a (much) better solution out there. But this is what I could come up with.
PS: Do note that this search is case-sensitive, and counts all occurrences of the word in the sentences. For example, counting the occurrences of the word "code" in "Coders gonna code the entire codebase." will return 2 ("code" and "codebase").
Alternatively, you can also loop through the sentences one-by-one, split each sentence, conditionally increase the count (the condition being the current word in the sentence should be the same as the given word)

Does substr change the position where the find function starts searching?

Does substr change the position where the find function starts searching ?
I have a char * named search_text containing the following text:
ABC_NAME = 'XYZSomeone' AND ABC_CLASS = 'XYZSomething'
I want to display the "ABC_NAME" value from that string.
Here is what I am doing:
std::cout << std::string(search_text).substr ( 12, std::string( search_text ).find ("'", 13 )-1) << std::endl;
My logic in the above in the substr is as follows:
The ABC_NAME value always begins at the 12th character, so start the substring there.
Do a find for the character ' (single quotation mark) from the 13th character onwards, starting from the 13th character (the second argument of the find() function). The resulting number will be the outer bound of the substr.
However, my code prints out the following:
XYZSomeone' AND ABC_C
However, when I try to display the value of the find() function directly, I do get the correct number for the location of the second ' (single quotation mark)
std::cout << std::string( search_text ).find ("'", 13 ) << std::endl;
This prints out:
22
So why is it that the substr is not finding the value of 22 as its second argument ?
It's a rather simple matter to evaluate your expression by hand, seeing how you already verified the result of find:
std::string(search_text).substr ( 12, std::string( search_text ).find ("'", 13 )-1)
std::string("ABC_NAME = 'XYZSomeone' AND ABC_CLASS = 'XYZSomething'").substr ( 12, 22-1)
Now check the documentation for substr: "Returns a substring [pos, pos+count)". The character at position 12 is the 'X' for the name portion, and the character at position 12+21 = 33 is the 'L' from the class portion. So we expect the substring starting at that 'X' and going up to just before that 'L', which is "XYZSomeone' AND ABC_C". Check.
(It is understandable to forget whether substr takes a length or a position at which to end. Different languages do disagree on this. Hence the link to the documentation.)
Unsolicited commentary
Trying to do so much in one line makes your code harder to read and harder to debug. In this case, it also hurts performance. There is no need to convert search_text to a std::string twice.
std::string search_string{search_text};
std::size_t found = search_string.find('\'', 12);
if ( found != std::string::npos )
found -= 12;
std::cout << search_string.substr(12, found) << std::endl;
This cuts the number of times a string is constructed (hence the times the string data is copied) from three to two.
If you are using C++17, you can improve the performance even more by constructing no strings. Just use std::string_view instead of std::string. For this scenario, it has the same member functions taking the same parameters; all you have to change is the type of search_string. This puts the performance on par with C code.
Even better: since string views are so cheap to create, you could even write your code – without a performance hit – so that it doesn't matter whether substr takes a length or takes the past-the-end position.
std::string_view search_string{search_text};
std::string_view ltrimmed = search_string.substr(12);
std::size_t found = ltrimmed.find('\'');
std::cout << ltrimmed.substr(0, found) << std::endl;
Constructive laziness FTW!

Need help implementing a certain logic that will fill a text to a certain width.

The task is to justify text within a certain width.
user inputs: Hello my name is Harrry. This is a sample text input that nobody
will enter.
output: What text width do you want?
user inputs: 15
output: |Hello my name|
|is Harrry. This|
|is a sample|
|text that|
|nobody will|
|enter. |
Basically, the line has to be 15 spaces wide including blank spaces. Also, if the next word in the line cant fit into 15, it will skip entirely. If there are multiple words in a line, it will try to distribute the spaces evenly between each word. See the line that says "Is a sample" for example.
I created a vector using getline(...) and all that and the entire text is saved in a vector. However, I'm kind of stuck on moving forward. I tried using multiple for loops, but I just cant seem to skip lines or even out the spacing at all.
Again, not looking or expecting anyone to solve this, but I'd appreciate it if you could guide me into the right direction in terms of logic/algorithm i should think about.
You should consider this Dynamic programming solution.
Split text into “good” lines
Since we don't know where we need to break the line for good justification, we start guessing where the break to be done to the paragraph. (That is we guess to determine whether we should break between two words and make the second word as start of the next line).
You notice something? We brutefore!
And note that if we can't find a word small enought to fit in the remaining space in the current line, we insert spaces inbetween the words in the current line. So, the space in the current line depends on the words that might go into the next or previous line. That's Dependency!
You are bruteforcing and you have dependency,there comes the DP!
Now lets define a state to identify the position on our path to solve this problem.
State: [i : j] ,which denotes line of words from ith word to jth word in the original sequence of words given as input.
Now, that you have state for the problem let us try to define how these states are related.
Since all our sub-problem states are just a pile of words, we can't just compare the words in each state and determine which one is better. Here better delineates to the use of line's width to hold maximum character and minimum spaces between the words in the particular line. So, we define a parameter, that would measure the goodness of the list of words from ith to jth words to make a line. (recall our definition of subproblem state). This is basically evaluating each of our subproblem state.
A simple comparison factor would be :
Define badness(i, j) for line of words[i : j].
For example,
Infinity if total length > page width,
else (page width − total length of words in current line)3
To make things even simple consider only suffix of the given text and apply this algorithm. This would reduce the DP table size from N*N to N.
So, For finishing lets make it clear what we want in DP terms,
subproblem = min. badness for suffix words[i :]
=⇒ No.of subproblems = Θ(n) where n = no of words
guessing = where to end first line, say i : j
=⇒ no. of choices for j = n − i = O(n)
recurrence relation between the subproblem:
• DP[i] = min(badness (i, j) + DP[j] for j in range (i + 1, n + 1))
• DP[n] = 0
=⇒ time per subproblem = Θ(n)
so, total time = Θ(n^2).
Also, I'll leave it to you how insert spaces between words after determining the words in each line.
Logic would be:
1) Put words in array
2) Loop though array of words
3) Count the number of chars in each word, and check until they are the text width or less (skip if more than textwidth). Remember the number of words that make up the total before going over 15 (example remember it took 3 words to get 9 characters, leaving space for 6 spaces)
4) Divide the number of spaces required by (number of words - 1)
5) Write those words, writing the same number of spaces each time.
Should give the desired effect I hope.
You obviously have some idea how to solve this, as you have already produced the sample output.
Perhaps re-solve your original problem writing down in words what you do in each step....
e.g.
Print text asking for sentence.
Take input
Split input into words.
Print text asking for width.
...
If you are stuck at any level, then expand the details into sub-steps.
I would look to separate the problem of working out a sequence of words which will fit onto a line.
Then how many spaces to add between each of the words.
Below is an example for printing one line after you find how many words to print and what is the starting word of the line.
std::cout << "|";
numOfSpaces = lineWidth - numOfCharsUsedByWords;
/*
* If we have three words |word1 word2 word3| in a line
* ideally the spaces to print between then are 1 less than the words
*/
int spaceChunks = numOfWordsInLine - 1;
/*
* Print the words from starting point to num of words
* you can print in a line
*/
for (j = 0; j < numOfWordsInLine; ++j) {
/*
* Calculation for the number of spaces to print
* after every word
*/
int spacesToPrint = 0;
if (spaceChunks <= 1) {
/*
* if one/two words then one
* chunk of spaces between so fill then up
*/
spacesToPrint = numOfSpaces;
} else {
/*
* Here its just segmenting a number into chunks
* example: segment 7 into 3 parts, will become 3 + 2 + 2
* 7 to 3 = (7%3) + (7/3) = 1 + 2 = 3
* 4 to 2 = (4%2) + (4/2) = 0 + 2 = 2
* 2 to 1 = (2%1) + (2/1) = 0 + 2 = 2
*/
spacesToPrint = (numOfSpaces % spaceChunks) + (numOfSpaces / spaceChunks);
}
numOfSpaces -= spacesToPrint;
spaceChunks--;
cout << words[j + lineStartIdx];
for (int space = 0; space < spacesToPrint; space++) {
std::cout << " ";
}
}
std::cout << "|" << std::endl;
Hope this code helps. Also you need to consider what happens if you set width less then the max word size.

getline() Adding Character to Front of String? -- Actually substr syntax error

I'm writing a program that will balance Chemistry Equations; I thought it'd be a good challenge and help reinforce the information I've recently learned.
My program is set up to use getline(cin, std::string) to receive the equation. From there it separates the equation into two halves: a left side and right side by making a substring when it encounters a =.
I'm having issues which only concerns the left side of my string, which is called std::string leftSide. My program then goes into a for loop that iterates over the length of leftSide. The first condition checks to see if the character is uppercase, because chemical formulas are written with the element symbols and a symbol consists of either one upper case letter, or an upper case and one lower case letter. After it checks to see if the current character is uppercase, it checks to see if the next character is lower case; if it's lower case then I create a temporary string, combine leftSide[index] with leftSide[index+1] in the temp string then push the string to my vector.
My problem lies on the first iteration; I've been using CuFe3 = 8 (right side doesn't matter right now) to test it out. The only thing stored in std::string temp is C. I'm not sure why this happening; also, I'm still getting numbers in my final answer and I don't understand why. Some help fixing these two issues, along with an explanation, would be greatly appreciated.
[CODE]
int index = 0;
for (it = leftSide.begin(); it!=leftSide.end(); ++it, index++)
{
bool UPPER_LETTER = isupper(leftSide[index]);
bool NEXT_LOWER_LETTER = islower(leftSide[index+1]);
if (UPPER_LETTER)// if the character is an uppercase letter
{
if (NEXT_LOWER_LETTER)
{
string temp = leftSide.substr(index, (index+1));//add THIS capital and next lowercase
elementSymbol.push_back(temp); // add temp to vector
temp.clear(); //used to try and fix problem initially
}
else if (UPPER_LETTER && !NEXT_LOWER_LETTER) //used to try and prevent number from getting in
{
string temp = leftSide.substr(index, index);
elementSymbol.push_back(temp);
}
}
else if (isdigit(leftSide[index])) // if it's a number
num++;
}
[EDIT] When I entered in only ASDF, *** ***S ***DF ***F was the output.
string temp = leftSide.substr(index, (index+1));
substr takes the first index and then a length, rather than first and last indices. You want substr(index, 2). Since in your example index is 0 you're doing: substr(index, 1) which creates a string of length 1, which is "C".
string temp = leftSide.substr(index, index);
Since index is 0 this is substr(index, 0), which creates a string of length 0, that is, an empty string.
When you're processing parts of the string with a higher index, such as Fe in "CuFe3" the value you pass in as the length parameter is higher and so you're creating strings that are longer. F is at index 2 and you call substr(index, 3), which creates the string "Fe3".
Also the standard library usually uses half open ranges, so even if substr took two indices (which, again, it doesn't) you would do substr(index, index+2) to get a two character string.
bool NEXT_LOWER_LETTER = islower(leftSide[index+1]);
You might want to check that index+1 is a valid index. If you don't want to do that manually you might at least switch to using the bounds checked function at() instead of operator[].

Finding all possible common substrings from a file consisting of strings using c++

I am trying to find all possible common strings from a file consisting of strings of various lengths. Can anybody help me out?
E.g input file is sorted:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAATTAGGCTGGG
AAAAAAAATTGAAACATCTATAGGTC
AAAAAAACTCTACCTCTCT
AAAAAAACTCTACCTCTCTATACTAATCTCCCTACA
and my desired output is:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAATTAGGCTGGG
AAAAAAAATTGAAACATCTATAGGTC
AAAAAAACTCTACCTCTCTATACTAATCTCCCTACA
[EDIT] Each line which is a substring of any other line should be removed.
Basically for each line, compare it with the next line to see if the next line is shorter or if the next line's substring is not equal to the current line. If this is true, the line is unique. This can be done with a single linear pass because the list is sorted: any entry which contains a substring of the entry will follow that entry.
A non-algorithmic optimization (micro-optimization) is to avoid the use of substr which creates a new string. We can simply compare the other string as though it was truncated without actually creating a truncated string.
vector<string> unique_lines;
for (unsigned int j=0; j < lines.size() - 2; ++j)
{
const string& line = lines[j];
const string& next_line = lines[j + 1];
// If the line is not a substring of the next line,
// add it to the list of unique lines.
if (line.size() >= next_line.size() ||
line != next_line.substr(0, line .size()))
unique_lines.push_back(line);
}
// The last line is guaranteed to not be a substring of any
// previous line as the lines are sorted.
unique_lines.push_back(lines.back());
// The desired output will be contained in 'unique_lines'.
What I understand is you want to find substring and wanted to remove such string which is substring of any string.
For that you can use strstr method to find if a string is a substring of another string.
Hope this will help..
Well, that's probably not the fastest solution to solve your problem, but seems easy to implement. You just keep a histogram of chars that will represent a signature of a string. For each string that you read (separated for spaces), you count the numbers of each char and just stores it on your answer if there isn't any other string with the same numbers of each char. Let me illustrate it:
aaa bbb aabb ab aaa
Here we have just two possible input letters, so, we just need an histogram of size 2.
aaa - hist[0] = 3, hist[1] = 0 : New one - add to the answer
bbb - hist[0] = 0, hist[1] = 3 : New one - add to the answer
aabb - hist[0] = 2, hist[1] = 2 : New one - add to the answer
ab - hist[0] = 1, hist[1] = 1 : New one - add to the answer
aaa - hist[0] = 3, hist[1] = 0 : Already exists! Don't add to the answer.
The bottleneck of your implementation will be the histogram comparisons, and there are a lot of possible implementations for it.
The simplest one would be a simple linear search, iterating through all your previous answer and comparing with the current histogram, wich would be O(1) to store and O(n) to search. If you have a big file, it would take hours to finish.
A faster one, but a lot more troublesome to implement, would use a hash table to store your answer, and use the histogram signature to generate the hash code. Would be to troublesome to explain this approach here.