I'm reading in several documents, and indexing the words I read in. However, I want to ignore common case words (a, an, the, and, is, or, are, etc).
Is there a shortcut to doing this? Moreso than doing just...
if(word=="and" || word=="is" || etc etc....) ignore word;
For example, can I put them into a const string somehow, and have it just check against the string? Not sure... thank you!
Create a set<string> with the words that you would like to exclude, and use mySet.count(word) to determine if the word is in the set. If it is, the count will be 1; it will be 0 otherwise.
#include <iostream>
#include <set>
#include <string>
using namespace std;
int main() {
const char *words[] = {"a", "an", "the"};
set<string> wordSet(words, words+3);
cerr << wordSet.count("the") << endl;
cerr << wordSet.count("quick") << endl;
return 0;
}
You can use an array of strings, looping through and matching against each, or use a more optimal data structure such as a set, or trie.
Here's an example of how to do it with a normal array:
const char *commonWords[] = {"and", "is" ...};
int commonWordsLength = 2; // number of words in the array
for (int i = 0; i < commonWordsLength; ++i)
{
if (!strcmp(word, commonWords[i]))
{
//ignore word;
break;
}
}
Note that this example doesn't use the C++ STL, but you should.
If you want to maximize performance you should create a trie....
http://en.wikipedia.org/wiki/Trie
...of stopwords....
http://en.wikipedia.org/wiki/Stop_words
There is no standard C++ trie datastructure, however see this question for third party implementations...
Trie implementation
If you can't be bothered with that and want to use a standard container, the best one to use is unordered_set<string> which will put the stopwords in a hash table.
bool filter(const string& word)
{
static unordered_set<string> stopwords({"a", "an", "the"});
return !stopwords.count(word);
}
Related
I am working on a project to read a Context-free Grammar and represent it 1.Vectorial, 2. Branched chained lists, 3. Array (Table). I encounter a problem with string comparison. When I read a string from keyboard representing right side of a Production rule, I want to check if that string exists in a vector of strings. The problem is that comparison is not working right. If the string I compare is the first string from the vector of strings than the comparison is working fine. But if the string I compare is a string from the vector of strings other than first comparison is not working, it's like the string is not from the vector of strings. Sorry for my English. I better let the code explain
bool is(string &s, vector<string> &v) {
for (auto i : v) {
return (i.compare(s)==0) ? true : false;
}
}
This function returns true only if s=v[0], otherwise returns false even if s=v[2]
To do that you'd have to loop into the vector with a for loop and compare every string in it, it would be something like:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
bool doExists(string s, vector<string> v) {
for (int i=0;i<v.size();i++) {
if (s.compare(v[i]) == 0) {
return true;
}
}
return false;
}
int main(){
vector<string> phrases = {"Hello!", "I like potatos!","I like fries.","How are you today?","I'm good.","Hello!"};
int helloLocated = doExists("Hello!", phrases);
cout << helloLocated;
}
The console would print 1.
This question already has answers here:
Case-insensitive string comparison in C++ [closed]
(30 answers)
C++11 case insensitive comparison of beginning of a string (unicode)
(3 answers)
Closed 3 years ago.
Trying to compare strings using:
!(stringvector[i]).compare(vector[j][k])
only works for some entries of
vector[j][k]
-- namely ones that are a case sensitive string match.
How do I get case-insensitive matching from this functionality?
Here is a bit of code I was working on
#include <iostream>
#include <vector>
#include <string>
using namespace std; //poor form
vector<string> stringvector = {"Yo", "YO", "babbybabby"};
vector<string> vec1 = {"yo", "Yo" , "these"};
vector<string> vec2 = {"these", "checked" , "too" , "Yo", "babbybabby"};
vector<vector<string>> vvs = {vec1, vec2};
for (int v = 0; v < vvs.size(); v++) //first index of vector
{
for(int s = 0; s < vvs[v].size(); s++) //second index of vector
{
for(int w = 0; w < stringvector.size(); w++)
{
if (stringvector[w] == vvs[v][s])
{cout << "******FOUND******";}
}
}
}
This doesn't print out FOUND for the case-insensitive matches.
Stringvector[w] == vvs[v][s] does not make case-insensitive comparison, is there a way to add this functionality easily?
--Prof D
tl;dr
Use the ICU library.
"The easy way", when it comes to natural language strings, is usually fraught with problems.
As I pointed out in my answer to that "lowercase conversion" answer #Armando linked to, if you want to actually do it right, you're currently best off using the ICU library, because nothing in the standard gives you actual Unicode support at this point.
If you look at the docs to std::tolower as used by #NutCracker, you will find that...
Only 1:1 character mapping can be performed by this function, e.g. the Greek uppercase letter 'Σ' has two lowercase forms, depending on the position in a word: 'σ' and 'ς'. A call to std::tolower cannot be used to obtain the correct lowercase form in this case.
If you want to do this correctly, you need full Unicode support, and that means the ICU library until some later revision of the C++ standard actually introduces that to the standard library.
Using icu::UnicodeString -- clunky as it might be at first -- for storing your language strings gives you access to caseCompare(), which does a proper case-insensitive comparison.
You can implement a function for this purpose, example:
bool areEqualsCI(const string &x1, const string &x2){
if(x1.size() != x2.size()) return false;
for(unsigned int i=0; i<x2.size(); ++i)
if(tolower((unsigned char)x1[i]) != tolower((unsigned char)x2[i])) return false;
return true;
}
I recommendy see this post How to convert std::string to lower case?
First, I gave myself some freedom to pretty up your code a bit. For that purpose I replaced ordinary for loops with range-based for loops. Furthermore, I have changed your names of the variables. They are not perfect yet though since I don't know what's the purpose of the code. However, here is a refactored code:
#include <iostream>
#include <vector>
#include <string>
int main() {
std::vector<std::string> vec1 = { "Yo", "YO", "babbybabby" };
std::vector<std::string> vec2 = { "yo", "Yo" , "these" };
std::vector<std::string> vec3 = { "these", "checked", "too", "Yo", "babbybabby" };
std::vector<std::vector<std::string>> vec2_vec3 = { vec2, vec3 };
for (auto const& i : vec2_vec3) {
for (auto const& j : i) {
for (auto const& k : vec1) {
if (k == j) {
std::cout << k << " == " << j << std::endl;
}
}
}
}
return 0;
}
Now, if you want to compare strings case-insensitively and if you have access to Boost library, you could use boost::iequals in the following manner:
#include <boost/algorithm/string.hpp>
std::string str1 = "yo";
std::string str2 = "YO";
if (boost::iequals(str1, str2)) {
// identical strings
}
On the other hand, if you don't have access to Boost library, you can make your own iequals function by using STL algorithms (C++14 required):
bool iequals(const string& a, const string& b) {
return std::equal(str1.begin(), str1.end(),
str2.begin(), str2.end(),
[](char a, char b) {
return std::tolower(a, std::locale()) == std::tolower(b, std::locale());
});
}
std::string str1 = "yo";
std::string str2 = "YO";
if (iequals(str1, str2)) {
// identical strings
}
Note that this would only work for Single-Byte Character Sets (SBCS).
I want to check if a string is a strictly a subset of another string.
For this end I used boost::contains and I compare the size of strings as follows:
#include <boost/algorithm/string.hpp>
#include <iostream>
using namespace std;
using namespace boost::algorithm;
int main()
{
string str1 = "abc news";
string str2 = "abc";
//strim strings using boost
trim(str1);
trim(str2);
//if str2 is a subset of str1 and its size is less than the size of str1 then it is strictly contained in str1
if(contains(str1,str2) && (str2.size() < str1.size()))
{
cout <<"contains" << end;
}
return 0;
}
Is there a better way to solve this problem? Instead of also comparing the size of strings?
Example
ABC is a proper subset of ABC NEWS
ABC is not a proper subset of ABC
I would use the following:
bool is_substr_of(const std::string& sub, const std::string& s) {
return sub.size() < s.size() && s.find(sub) != s.npos;
}
This uses the standard library only, and does the size check first which is cheaper than s.find(sub) != s.npos.
You can just use == or != to compare the strings:
if(contains(str1, str2) && (str1 != str2))
...
If string contains a string and both are not equal, you have a real subset.
If this is better than your method is for you to decide. It is less typing and very clear (IMO), but probably a little bit slower if both strings are long and equal or both start with the same, long sequence.
Note: If you really care about performance, you might want to try the Boyer-Moore search and the Boyer-Moore-Horspool search. They are way faster than any trivial string search (as apparently used in the string search in stdlibc++, see here), I do not know if boost::contains uses them.
About Comparaison operations
TL;DR : Be sure about the format of what you're comparing.
Be wary of how you define strictly.
For example, you did not pointed out thoses issue is your question, but if i submit let's say :
"ABC " //IE whitespaces
"ABC\n"
What is your take on it ? Do you accept it or not ? If you don't, you'll have to either trim or to clean your output before comparing - just a general note on comparaison operations -
Anyway, as Baum pointed out, you can either check equality of your strings using == or you can compare length (which is more efficient given that you first checked for substring) with either size() or length();
another approach, using only the standard library:
#include <algorithm>
#include <string>
#include <iostream>
using namespace std;
int main()
{
string str1 = "abc news";
string str2 = "abc";
if (str2 != str1
&& search(begin(str1), end(str1),
begin(str2), end(str2)) != end(str1))
{
cout <<"contains" << endl;
}
return 0;
}
I was implementing a method to remove certain characters from a string txt, in-place. the following is my code. The result is expected as "bdeg". however the result is "bdegfg", which seems the null terminator is not set. the weird thing is that when I use gdb to debug, after setting null terminator
(gdb) p txt
$5 = (std::string &) #0xbffff248: {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x804b014 "bdeg"}}
it looks right to me. So what is the problem here?
#include <iostream>
#include <string>
using namespace std;
void censorString(string &txt, string rem)
{
// create look-up table
bool lut[256]={false};
for (int i=0; i<rem.size(); i++)
{
lut[rem[i]] = true;
}
int i=0;
int j=0;
// iterate txt to remove chars
for (i=0, j=0; i<txt.size(); i++)
{
if (!lut[txt[i]]){
txt[j]=txt[i];
j++;
}
}
// set null-terminator
txt[j]='\0';
}
int main(){
string txt="abcdefg";
censorString(txt, "acf");
// expect: "bdeg"
std::cout << txt <<endl;
}
follow-up question:
if string is not truncated like c string. so what happens with txt[j]='\0'
and why it is "bdegfg" not 'bdeg'\0'g' or some corrupted strings.
another follow-up:
if I use txt.erase(txt.begin()+j, txt.end());
it works fine. so I'd better use string related api. the point is that I do not know the time complexity of the underlying code of these api.
std::string is not null terminated as you think therefore you have to use other ways to do this
modify the function to:
void censorString(string &txt, string rem)
{
// create look-up table
bool lut[256]={false};
for (int i=0; i<rem.size(); i++)
{
lut[rem[i]] = true;
}
// iterate txt to remove chars
for (std::string::iterator it=txt.begin();it!=txt.end();)
{
if(lut[*it]){
it=txt.erase(it);//erase the character pointed by it and returns the iterator to next character
continue;
}
//increment iterator here to avoid increment after erasing the character
it++;
}
}
Here basically you have to use std::string::erase function to erase any character in the string which take iterator as input and return iterator to next character
http://en.cppreference.com/w/cpp/string/basic_string/erase
http://www.cplusplus.com/reference/string/string/erase/
the complexity of erase function is O(n). So the whole function would have complexity of o(n^2). space complexity for a very long string i.e. >256 chars would be O(n).
Well there is another way which will have only O(n) complexity for time.
create a another string and append the character while iterating over the txt string which are not censored.
The new function would be:
void censorString(string &txt, string rem)
{
// create look-up set
std::unordered_set<char> luckUpSet(rem.begin(),rem.end());
std::string newString;
// iterate txt to remove chars
for (std::string::iterator it=txt.begin();it!=txt.end();it++)
{
if(luckUpSet.find(*it)==luckUpSet.end()){
newString.push_back(*it);
}
}
txt=std::move(newString);
}
Now this function has complexity of O(n), since functionstd::unordered_set::find and std::string::push_back have complexity of O(1).
if You use normal std::set find which has complexity of O(log n), then complexity of whole function would become O(n log n).
Embedding null-terminators inside a std::string is completely valid and will not change the length of the string. It will give you unexpected results if you, for example, try to output it using a stream extraction, though.
The goal you are attempting to reach can be done much easier:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
int main()
{
std::string txt="abcdefg";
std::string filter = "acf";
txt.erase(std::remove_if(txt.begin(), txt.end(), [&](char c)
{
return std::find(filter.begin(), filter.end(), c) != filter.end();
}), txt.end());
// expect: "bdeg"
std::cout << txt << std::endl;
}
In the same vein as Himanshu's answer, you can accomplish an O(N) complexity (using additional memory) like so:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
#include <unordered_set>
int main()
{
std::string txt="abcdefg";
std::string filter = "acf";
std::unordered_set<char> filter_set(filter.begin(), filter.end());
std::string output;
std::copy_if(txt.begin(), txt.end(), std::back_inserter(output), [&](char c)
{
return filter_set.find(c) == filter_set.end();
});
// expect: "bdeg"
std::cout << output << std::endl;
}
You have not told the string that you have changed it's size. You need to use the resize method to update the size if you remove any characters from the string.
Problem is you can't treat the C++ string like a C style string is the problem. I.e. you can't just insert a 0 like in C. To convince your self of this, add this to your code "cout << txt.length() << endl;" - you'll get 7. You want to use the erase() method;
Removes specified characters from the string.
1) Removes min(count, size() - index) characters starting at index.
2) Removes the character at position.
3) Removes the character in the range [first; last).
Text is a string not a character array.
This code
// set null-terminator
txt[j]='\0';
Will not truncate the string at the j-th position.
I have comma delimited strings I need to pull values from. The problem is these strings will never be a fixed size. So I decided to iterate through the groups of commas and read what is in between. In order to do that I made a function that returns every occurrence's position in a sample string.
Is this a smart way to do it? Is this considered bad code?
#include <string>
#include <iostream>
#include <vector>
#include <Windows.h>
using namespace std;
vector<int> findLocation(string sample, char findIt);
int main()
{
string test = "19,,112456.0,a,34656";
char findIt = ',';
vector<int> results = findLocation(test,findIt);
return 0;
}
vector<int> findLocation(string sample, char findIt)
{
vector<int> characterLocations;
for(int i =0; i < sample.size(); i++)
if(sample[i] == findIt)
characterLocations.push_back(sample[i]);
return characterLocations;
}
vector<int> findLocation(string sample, char findIt)
{
vector<int> characterLocations;
for(int i =0; i < sample.size(); i++)
if(sample[i] == findIt)
characterLocations.push_back(sample[i]);
return characterLocations;
}
As currently written, this will simply return a vector containing the int representations of the characters themselves, not their positions, which is what you really want, if I read your question correctly.
Replace this line:
characterLocations.push_back(sample[i]);
with this line:
characterLocations.push_back(i);
And that should give you the vector you want.
If I were reviewing this, I would see this and assume that what you're really trying to do is tokenize a string, and there's already good ways to do that.
Best way I've seen to do this is with boost::tokenizer. It lets you specify how the string is delimited and then gives you a nice iterator interface to iterate through each value.
using namespace boost;
string sample = "Hello,My,Name,Is,Doug";
escaped_list_seperator<char> sep("" /*escape char*/, ","/*seperator*/, "" /*quotes*/)
tokenizer<escaped_list_seperator<char> > myTokens(sample, sep)
//iterate through the contents
for (tokenizer<escaped_list_seperator<char>>::iterator iter = myTokens.begin();
iter != myTokens.end();
++iter)
{
std::cout << *iter << std::endl;
}
Output:
Hello
My
Name
Is
Doug
Edit If you don't want a dependency on boost, you can also use getline with an istringstream as in this answer. To copy somewhat from that answer:
std::string str = "Hello,My,Name,Is,Doug";
std::istringstream stream(str);
std::string tok1;
while (stream)
{
std::getline(stream, tok1, ',');
std::cout << tok1 << std::endl;
}
Output:
Hello
My
Name
Is
Doug
This may not be directly what you're asking but I think it gets at your overall problem you're trying to solve.
Looks good to me too, one comment is with the naming of your variables and types. You call the vector you are going to return characterLocations which is of type int when really you are pushing back the character itself (which is type char) not its location. I am not sure what the greater application is for, but I think it would make more sense to pass back the locations. Or do a more cookie cutter string tokenize.
Well if your purpose is to find the indices of occurrences the following code will be more efficient as in c++ giving objects as parameters causes the objects to be copied which is insecure and also less efficient. Especially returning a vector is the worst possible practice in this case that's why giving it as a argument reference will be much better.
#include <string>
#include <iostream>
#include <vector>
#include <Windows.h>
using namespace std;
vector<int> findLocation(string sample, char findIt);
int main()
{
string test = "19,,112456.0,a,34656";
char findIt = ',';
vector<int> results;
findLocation(test,findIt, results);
return 0;
}
void findLocation(const string& sample, const char findIt, vector<int>& resultList)
{
const int sz = sample.size();
for(int i =0; i < sz; i++)
{
if(sample[i] == findIt)
{
resultList.push_back(i);
}
}
}
How smart it is also depends on what you do with those subtstrings delimited with commas. In some cases it may be better (e.g. faster, with smaller memory requirements) to avoid searching and splitting and just parse and process the string at the same time, possibly using a state machine.