How to search a string for multiple substrings - c++

I need to check a short string for matches with a list of substrings. Currently, I do this like shown below (working code on ideone)
bool ContainsMyWords(const std::wstring& input)
{
if (std::wstring::npos != input.find(L"white"))
return true;
if (std::wstring::npos != input.find(L"black"))
return true;
if (std::wstring::npos != input.find(L"green"))
return true;
// ...
return false;
}
int main() {
std::wstring input1 = L"any text goes here";
std::wstring input2 = L"any text goes here black";
std::cout << "input1 " << ContainsMyWords(input1) << std::endl;
std::cout << "input2 " << ContainsMyWords(input2) << std::endl;
return 0;
}
I have 10-20 substrings that I need to match against an input. My goal is to optimize code for CPU utilization and reduce time complexity for an average case. I receive input strings at a rate of 10 Hz, with bursts to 10 kHz (which is what I am worried about).
There is agrep library with source code written in C, I wonder if there is a standard equivalent in C++. From a quick look, it may be a bit difficult (but doable) to integrate it with what I have.
Is there a better way to match an input string against a set of predefined substrings in C++?

The best thing is to use a regular expression search, if you use the following regular expression:
"(white)|(black)|(green)"
that way, with only one pass over the string, you'll get in group 1 if a match was found for the "white" substring (and beginning and end points), in group 2 if a match of the "black" substring (and beginning and end points), and in group 3 if a match of the "green" substring. As you get, from group 0 the position of the end of the match, you can begin a new search to look for more matches, and everything in one pass over the string!!!

You could use one big if, instead of several if statements. However, Nathan's Oliver solution with std::any_of is faster than that though, when making the array of the substrings static (so that they do not get to be recreated again and again), as shown below.
bool ContainsMyWordsNathan(const std::wstring& input)
{
// do not forget to make the array static!
static std::wstring keywords[] = {L"white",L"black",L"green", ...};
return std::any_of(std::begin(keywords), std::end(keywords),
[&](const std::wstring& str){return input.find(str) != std::string::npos;});
}
PS: As discussed in Algorithm to find multiple string matches:
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

Related

How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.
I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.
// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows
that, once his program is working correctly, good output is a must. Few people
really care how much time and trouble a programmer has spent in designing and
debugging a program. Most people see only the results. Often, by the time a
programmer has finished tackling a difficult problem, any output may look
great. The programmer knows what it means and how to interpret it. However,
the same cannot be said for others, or even for the programmer himself six
months hence.
string lines;
getline(input, lines); // Stores what is in file into the string
I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.
It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.
EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.
Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.
In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.
Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.
Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.
The program:
#include <iostream>
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>
std::string to_lower(const std::string& str) {
std::string ret;
for (char c : str)
ret.push_back(tolower(c));
return ret;
}
std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
std::map<std::string, size_t> counts;
std::stringstream stream(str);
while (stream.good()) {
// wordW may have multiple words connected by special chars/digits
std::string wordW;
stream >> wordW;
// filter special chars and digits
std::replace_if(wordW.begin(), wordW.end(),
[](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');
// now wordW may have multiple words seperated by whitespaces, extract them
std::stringstream word_stream(wordW);
while (word_stream.good()) {
std::string word;
word_stream >> word;
// ignore empty words
if (word == "") continue;
// add to count.
ignorecase ? counts[to_lower(word)]++ : counts[word]++;
}
}
return counts;
}
void print_counts(const std::map<std::string, size_t>& counts) {
for (auto pair : counts)
std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}
int main() {
std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
that, once his program is working correctly, good output is a must.Few people \
really care how much time and trouble a programmer has spent in designing and \
debugging a program.Most people see only the results.Often, by the time a \
programmer has finished tackling a difficult problem, any output may look \
great.The programmer knows what it means and how to interpret it.However, \
the same cannot be said for others, or even for the programmer himself six \
months hence.";
auto counts = count_words(input);
print_counts(counts);
return 0;
}
I have tested this with Visual Studio 2017 and here is the part of the output:
a : 5
and : 3
any : 2
be : 1
by : 2
cannot : 1
care : 1
correctly : 1
debugging : 1
designing : 1
As others have already noted, an std::map handles the counting you care about quite easily.
Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white space
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except lower- and upper-case letters, which are classified accordingly:
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:
std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;
std::string word;
while (std::cin >> word)
++counts[to_lower(word)];
...and when we're done with that, we can print out the results:
for (auto w : counts)
std::cout << w.first << ": " << w.second << "\n";
Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.
EDIT:
Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.
You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here).
You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)
Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.
Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.
Full Code:
#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>
class Word //define word object
{
public:
Word(){appearances = 1;}
~Word(){}
int appearances;
std::string mWord;
};
bool isLetter(const char x)
{
return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}
int main()
{
std::string srcFile = "myTextFile.txt"; //what file are we reading
std::ifstream ifs(srcFile);
std::string content( (std::istreambuf_iterator<char>(ifs) ),
( std::istreambuf_iterator<char>() )); //read in the file
std::vector<std::string> wordStringV; //create a vector of word strings
std::string current = ""; //define our current word
for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
{
const char currentChar = *it; //make life easier
if(currentChar == ' ')
{
wordStringV.push_back(current);
current = "";
continue;
}
else if(isLetter(currentChar))
{
current += *it;
}
}
std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
std::vector<Word> wordVector;
for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
{
std::vector<Word>::iterator wordIt;
//see if the current word string has appeared before...
for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt)
{
if((*wordIt).mWord == *it)
break;
}
if(wordIt == wordVector.end()) //...if not create a new Word obj
{
Word theWord;
theWord.mWord = *it;
wordVector.push_back(theWord);
}
else //...otherwise increment the appearances.
{
++((*wordIt).appearances);
}
}
//print the words out
for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
{
Word theWord = *it;
std::cout << theWord.mWord << " " << theWord.appearances << "\n";
}
return 0;
}
Side Notes
Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.
If you don't like iterators you can instead do
for(int i = 0; i < v.size(); ++i)
{
char currentChar = vector[i];
}
It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

How can I replace all words in a string except one

So, I would like to change all words in a string except one, that stays in the middle.
#include <boost/algorithm/string/replace.hpp>
int main()
{
string test = "You want to join player group";
string find = "You want to join group";
string replace = "This is a test about group";
boost::replace_all(test, find, replace);
cout << test << endl;
}
The output was expected to be:
This is a test about player group
But it doesn't work, the output is:
You want to join player group
The problem is on finding out the words, since they are a unique string.
There's a function that reads all words, no matter their position and just change what I want?
EDIT2:
This is the best example of what I want to happen:
char* a = "This is MYYYYYYYYY line in the void Translate"; // This is the main line
char* b = "This is line in the void Translate"; // This is what needs to be find in the main line
char* c = "Testing - is line twatawtn thdwae voiwd Transwlate"; // This needs to replace ALL the words in the char* b, perserving the MYYYYYYYYY
// The output is expected to be:
Testing - is MYYYYYYYY is line twatawtn thdwae voiwd Transwlate
You need to invert your thinking here. Instead of matching "All words but one", you need to try to match that one word so you can extract it and insert it elsewhere.
We can do this with Regular Expressions, which became standardized in C++11:
std::string test = "You want to join player group";
static const std::regex find{R"(You want to join (\S+) group)"};
std::smatch search_result;
if (!std::regex_search(test, search_result, find))
{
std::cerr << "Could not match the string\n";
exit(1);
}
else
{
std::string found_group_name = search_result[1];
auto replace = boost::format("This is a test about %1% group") % found_group_name;
std::cout << replace;
}
Live Demo
To match the word "player" I used a pretty simply regular expression (\S+) which means "match one or more non-whitespace characters (greedily) and put that into a group"
"Groups" in regular expressions are enclosed by parentheses. The 0th group is always the entire match, and since we only have one set of parentheses, your word is therefore in group 1, hence the resulting access of the match result at search_result[1].
To create the regular expression, you'll notice I used the perhaps-unfamiliar string literal syntaxR"(...)". This is called a raw string literal and was also standardized in C++11. It was basically made for describing regular expressions without needing to escape backslashes. If you've used Python, it's the same as r'...'. If you've used C#, it's the same as #"..."
I threw in some boost::format to print the result because you were using Boost in the question and I thought you'd like to have some fun with it :-)
In your example, find is not a substring of test, so boost::replace_all(test, find, replace); has no effect.
Removing group from find and replace solves it:
#include <boost/algorithm/string/replace.hpp>
#include <iostream>
int main()
{
std::string test = "You want to join player group";
std::string find = "You want to join";
std::string replace = "This is a test about";
boost::replace_all(test, find, replace);
std::cout << test << std::endl;
}
Output: This is a test about player group.
In this case, there is just one replace of the beginning of the string because the end of the string is already the right one. You could have another call of replace_all to change the end if needed.
Some other options:
one is in the other answer.
split the strings into a vector (or array) of words, then insert the desired word (player) at the right spot of the replace vector, then build your output string from it.

c++ Is there a way to find sentences within strings?

I'm trying to recognise certain phrases within a user defined string but so far have only been able to get a single word.
For example, if I have the sentence:
"What do you think of stack overflow?"
is there a way to search for "What do you" within the string?
I know you can retrieve a single word with the find function but when attempting to get all three it gets stuck and can only search for the first.
Is there a way to search for the whole string in another string?
Use str.find()
size_t find (const string& str, size_t pos = 0)
Its return value is the starting position of the substring. You can test if the string you are looking for is contained in the main string by performing the simple boolean test of returning str::npos:
string str = "What do you think of stack overflow?";
if (str.find("What do you") != str::npos) // is contained
The second argument can be used to limit your search from certain string position.
The OP question mentions it gets stuck in the attempt to find a three word string. Actually, I believe you are misinterpreting the return value. It happens that the return for the single word search "What" and the string "What do you" have coincidental starting positions, therefore str.find() returns the same. To search for individual words positions, use multiple function calls.
Use regular expressions
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("What do you think of stack overflow?");
std::smatch m;
std::regex e ("\\bWhat do you think\\b");
std::cout << "The following matches and submatches were found:" << std::endl;
while (std::regex_search (s,m,e)) {
for (auto x:m) std::cout << x << " ";
std::cout << std::endl;
s = m.suffix().str();
}
return 0;
}
Also you can find wildcards implementing with boost (regex in std library was boost::regex library before c++11) there

How to match absolute value using regex

I am having trouble with absolute value in regex in C++. This is what I have as the pattern:
std::tr1::regex loadAbsNM("load -|M\\((\\d+)\\)|"); // load -|M(x)|
I am trying to use std::tr1::regex_match( IR, result, loadNM ) to match. But it is not matching anything, even though it should be.
I'm using Visual Stuido 2010 compilier
shortened version of program (included above is iostream and regex)
int main()
{
std::string IR = "load -|M(x)|";
std::smatch result;
std::tr1::regex loadAbsNM("load -|M\\((\\d+)\\)|");
if( std::tr1::regex_match( IR , result, loadAbsNM ) )
{
int x = 2;
std::cout << "matched!" << std::endl;
}
else
{
std::cout << "!UNABLE TO DECODE INSTRUCTION!" << std::endl;
}
}
output produced
!UNABLE TO DECODE INSTRUCTION!
Note that from your code, you're not going to have a match. The letter x won't match the regex \d+.
Also, I'm not too sure whether you need a backslash in front of the pipe character. As you may know, pipe (|) is used to separate possible entries: (a|b) means a or b.
Finally, since their is a pipe at the end, the expression matches the empty string which is often a bad idea.
I would suggest something like this:
"load -\\|M\\((\\d+)\\)\\|"
But that won't match:
"load -|M(x)|"
You'd need to use a number instead of 'x' as in:
"load -|M(123)|"

Equality of two strings

What is the easiest way, with the least amount of code, to compare two strings, while ignoring the following:
"hello world" == "hello world" // spaces
"hello-world" == "hello world" // hyphens
"Hello World" == "hello worlD" // case
"St pierre" == "saint pierre" == "St. Pierre" // word replacement
I'm sure this has been done before, and there are some libraries to do this kind of stuff, but I don't know any. This is in C++ preferably, but if there's a very short option in whatever other language, I'll want to hear about it too.
Alternatively, I'd also be interested in any library that could give a percentage of matching. Say, hello-world and hello wolrd are 97% likely to be the same meaning, just a hyphen and a mispelling.
Remove spaces from both strings.
Remove hyphens from both strings.
Convert both strings to lower case.
Convert all occurrences of “saint” and “st.” to “st”.
Compare strings like normal.
For example:
#include <cctype>
#include <string>
#include <algorithm>
#include <iostream>
static void remove_spaces_and_hyphens(std::string &s)
{
s.erase(std::remove_if(s.begin(), s.end(), [](char c) {
return c == ' ' || c == '-';
}), s.end());
}
static void convert_to_lower_case(std::string &s)
{
for (auto &c : s)
c = std::tolower(c);
}
static void
replace_word(std::string &s, const std::string &from, const std::string &to)
{
size_t pos = 0;
while ((pos = s.find(from, pos)) != std::string::npos) {
s.replace(pos, from.size(), to);
pos += to.size();
}
}
static void replace_words(std::string &s)
{
replace_word(s, "saint", "st");
replace_word(s, "st.", "st");
}
int main()
{
// Given two strings:
std::string s1 = "Hello, Saint Pierre!";
std::string s2 = "hELlO,St.PiERRe!";
// Remove spaces and hyphens.
remove_spaces_and_hyphens(s1);
remove_spaces_and_hyphens(s2);
// Convert to lower case.
convert_to_lower_case(s1);
convert_to_lower_case(s2);
// Replace words...
replace_words(s1);
replace_words(s2);
// Compare.
std::cout << (s1 == s2 ? "Equal" : "Doesn't look like equal") << std::endl;
}
There is a way, of course, to code this more efficiently, but I recommend you start with something working and optimize it only when it proves to be a bottleneck.
It also sounds like you might be interested in string similarity algorithms like “Levenshtein distance”. Similar algorithms are used, for example, by search engine or editors to offer suggestion on spell correction.
I dont know any library, but for equlity, if speed is not rpoblem, you can do char-by-char compare and ignore "special" characters (respectively move iterator further in text).
As for comparing texts, you can use simple Levenshtein distance.
For spaces and hyphens, just replace all spaces/hyphens in the string and do a comparison. For case, convert all text to upper or lower case and do the comparison. For word replacement, you would need a dictionary of words with the key being the abbreviation and the value being the replacement word. You may also consider using the Levenshtein Distance algorithm for showing how similar one phrase is to another. If you want statistical probablility of how close a word/phrase is to another word/phrase, you will need sample data to do a comparison.
QRegExp is what you are looking for. It won't print out the percentages, but you can make some pretty slick ways of comparing one string to another, and finding the number of matches of one string to another.
Regular Expressions are available with almost ever language out there. I like GSkinner's RegEx page for learning regular expressions.
http://qt-project.org/doc/qt-4.8/qregexp.html
Hope that helps.
for the first 3 requirments,
remove all spaces/hypens of string (or replace it to a char, e.g'')
"hello world" --> "helloworld"
compare them ignore case.
Case insensitive string comparison in C++
for the last requirment, it is more compliate.
first you need a dictionary, which in KV structure:
'St.': 'saint'
'Mr.': 'mister'
second use boost token to seperate the string, and fetch then in the KV Store
then replace the token to the string, but it may in low performance:
http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/tokenizer.htm