Is there a concise opposite of "empty"? - c++

Interfaces to string classes typically have of method named IsEmpty (VCL) or empty (STL). That's absolutely reasonable because it's a special case, but the code that uses these methods often has to negate this predicate, which leads to a "optical (and even psychological) overhead" (the exclamation mark is not very obvious, especially after an opening parenthesis). See for instance this (simplified) code:
/// format an optional time specification for output
std::string fmtTime(const std::string& start, const std::string& end)
{
std::string time;
if (!start.empty() || !end.empty()) {
if (!start.empty() && !end.empty()) {
time = "from "+start+" to "+end;
} else {
if (end.empty()) {
time = "since "+start;
} else {
time = "until "+end;
}
}
}
return time;
}
It has four negations, because the empty cases are those to be skipped. I often observe this kind of negation, also when designing interfaces, and it's not a big problem but it's annoying. I only wish to support writing understandable and easy-to-read code. I hope you'll understand my point.
Maybe I'm only struck with blindness: How would you solve the above problem?
Edit: After reading some comments, I think it's nessessary to say that the original code uses the class System::AnsiString of the VCL. This class provides an IsEmpty method, which is very readable:
if (text.IsEmpty()) { /* ... */ } // read: if text is empty ...
if not negated:
if (!text.IsEmpty()) { /* ... */} // read: if not text is empty ...
...instead of if text is not empty. I think the literal is was better left to the reader's fantasy to let also the negation work well. Ok, maybe not a widespread problem...

In most cases you can reverse the order of the ifand the else to clean up the code:
const std::string fmtTime(const std::string& start, const std::string& end)
{
std::string time;
if (start.empty() && end.empty()) {
return time;
}
if (start.empty() || end.empty()) {
if (end.empty()) {
time = "since "+start;
} else {
time = "until "+end;
}
} else {
time = "from "+start+" to "+end;
}
return time;
}
Or even cleaner after some more refactoring:
std::string fmtTime(const std::string& start, const std::string& end)
{
if (start.empty() && end.empty()) {
return std::string();
}
if (start.empty()) {
return "until "+end;
}
if (end.empty()) {
return "since "+start;
}
return "from "+start+" to "+end;
}
And for the ultimate compactness (although I prefer the previous version, for its readability):
std::string fmtTime(const std::string& start, const std::string& end)
{
return start.empty() && end.empty() ? std::string()
: start.empty() ? "until "+end
: end.empty() ? "since "+start
: "from "+start+" to "+end;
}
Another possibility is to create a helper function:
inline bool non_empty(const std::string &str) {
return !str.empty();
}
if (non_empty(start) || non_empty(end)) {
...
}

I think I'd eliminate the conditions in favor of a little math:
const std::string fmtTime(const std::string& start, const std::string& end) {
typedef std::string const &s;
static const std::function<std::string(s, s)> f[] = {
[](s a, s b) { return "from " + a + " to " + b; }
[](s a, s b) { return "since " + a; },
[](s a, s b) { return "until " + b; },
[](s a, s b) { return ""; },
};
return f[start.empty() * 2 + end.empty()](start, end);
}
Edit: if you prefer, you can express the math as start.empty() * 2 + end.empty(). To understand what's going on, perhaps it's best if I expound on how I thought of things to start with. I thought of things as a 2D array:
(Feel free to swap the "start empty" and "end empty", depending on whether you prefer to think in row-major or column-major order).
The start.empty() and end.empty() (or the logical not of them, if you prefer) each act as as an index along one dimension of this 2D matrix. The math involved simply "linearizes" that addressing, so instead of two rows and two columns, we get one long row, something like this:
In mathematical terms, that's a simple matter of "row * columns + column" (or, again, vice versa, depending on whether you prefer row-major or column-major ordering). I originally expressed the * 2 part as a bit-shift and the addition as a bit-wise or (knowing the least significant bit is empty, because of the previous left-shift). I find that easy to deal with, but I guess I can understand where others might not.
I should probably add: although I've already mentioned row-major vs. column-major, it should be fairly obvious that the mapping from the two "x.empty" values to positions in the array is basically arbitrary. The value we get from .empty() means that we get a 0 when the value is not present, and a 1 when it is. As such, a direct mapping from the original values to the array positions is probably like this:
Since we're linearizing the value we have a few choices for how we do the mapping:
simply arrange the array to suit the values as we get them.
invert the value for each dimension individually (this is basically what led to the original question--the constant use of !x.empty())
Combine the two inputs into a single linear address, then "invert" by subtracting from 3.
For those who doubt the efficiency of this, it actually compiles down to this (with VC++):
mov eax, ebx
cmp QWORD PTR [rsi+16], rax
sete al
cmp QWORD PTR [rdi+16], 0
sete bl
lea eax, DWORD PTR [rbx+rax*2]
movsxd rcx, eax
shl rcx, 5
add rcx, r14
mov r9, rdi
mov r8, rsi
mov rdx, rbp
call <ridiculously long name>::operator()
Even the one-time construction for f isn't nearly as bad as some might think. It doesn't involve dynamic allocation, or anything on that order. The names are long enough that it looks a little scary initially, but in the end, it's mostly four repetitions of:
lea rax, OFFSET FLAT:??_7?$_Func_impl#U?$_Callable_obj#V<lambda_f466b26476f0b59760fb8bb0cc43dfaf>##$0A##std##V?$allocator#V?$_Func_class#V?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##std##AEBV12#AEBV12##std###2#V?$basic_string#DU?$char_traits#D#std##V?$allocator#D#2##2#AEBV42#AEBV42##std##6B#
mov QWORD PTR f$[rsp], rax
Leaving out the static const doesn't really seem to affect execution speed much. Since the table is static, I think it should be there, but as far as execution speed goes, it's not the kind of massive win we might expect if the table initialization involved four separate dynamic allocations, or anything like that.

You could say
if (theString.size()) { .... }
Whether that is more readable is a different matter. Here you are calling a method whose primary purpose is not to tell you if the thing is empty, and relying on an implicit conversion to bool. I would prefer the !s.empty() version. I might use not instead for fun:
if (not theString.empty()) { .... }
It might be interesting to see the correlation between people who find the ! and not versions confusing.

I have to refactor this, purely out of anal retentive disorder…
std::string fmtTime( const std::string & start, const std::string & end ) {
if ( start.empty() ) {
if ( end.empty() ) return ""; // should diagnose an error here?
return "until " + end;
}
if ( end.empty() ) return "since " + start;
return "from " + start + " to " + end;
}
There… clean clean clean. If something here is difficult to read, add a comment, not another if clause.

Usually it's just better to not use such complicated conditional code. Why not keep it simple?
const std::string fmtTime(const std::string& start, const std::string& end)
{
if (start.empty() && end.empty())
{
return "";
}
// either start or end or both are not empty here.
std::string time;
if (start.empty())
{
time = "until "+end;
}
else if (end.empty())
{
time = "since "+start;
}
else // both are not empty
{
time = "from "+start+" to "+end;
}
return time;
}

Globally, I have no problem with the way you've written it; it's
certainly cleaner that the alternatives that others are
proposing. If you're worried about the ! disappearing (which
is a legitimate worry), use more white space.
if ( ! start.empty() || ! end.empty() ) ...
Or try using the keyword not instead:
if ( not start.empty() || not end.empty() ) ...
(With most editors, the not will be highlighted as a keyword,
which will draw even more attention to it.)
Otherwise, two helper functions:
template <typename Container>
bool
isEmpty( Container const& container )
{
return container.empty();
}
template <typename Container>
bool
isNotEmpty( Container const& container )
{
return !container.empty();
}
This has the added advantage of giving the functionality
a better name. (Function names are verbs, so c.empty()
logically means "empty the container", and not "is the container
empty". But if you start wrapping all of the functions in the
standard library that have poor names, you've got your work cut
out for you.)

Without using negation.. ;)
const std::string fmtTime(const std::string& start, const std::string& end)
{
std::string ret;
if (start.empty() == end.empty())
{
ret = (start.empty()) ? "" : "from "+start+" to "+end;
}
else
{
ret = (start.empty()) ? "until "+end : "since "+start;
}
return ret;
}
EDIT: okay cleaned up a little more...

Since no one cared to type the complete answer with my comment, here it goes:
Create local variables that simplify the reading of expressions:
std::string fmtTime(const std::string& start, const std::string& end)
{
std::string time;
const bool hasStart = !start.empty();
const bool hasEnd = !end.empty();
if (hasStart || hasEnd) {
if (hasStart && hasEnd) {
time = "from "+start+" to "+end;
} else {
if (hasStart) {
time = "since "+start;
} else {
time = "until "+end;
}
}
}
return time;
}
The compiler is smart enough to elide those variables, and even if it did not, it won't be less efficient than the original (I expect both to be a single test of a variable). The code now is a bit more readable for a human that can just read the conditions:
if has start or end then
Of course you might also do different refactors to further simplify the number of nested operations, like singling out when there is no start or end and bailing out early...

I struggle with the psychological overhead of negative logic as well.
One solution to this (when it cannot be avoided) is to check for the explicit condition, consider:
if (!container.empty())
vs
if (container.empty() == false)
The second version is easier to read because it flows as you would read it out loud. It also makes it clear that you're checking a false condition.
Now if that is still not good enough for you, my advice would be to create a thin wrapper class that inherits from whatever container you're using and then create your own method for that particular check.
For example with strings:
class MyString : public std::string
{
public:
bool NotEmpty(void)
{
return (empty() == false);
}
};
Now it becomes just:
if (container.NotEmpty())...

If all you're concerned about is the ease with which ! can be overlooked, you can use the standard C++ alternative token not instead:
const std::string fmtTime(const std::string& start, const std::string& end)
{
std::string time;
if (not start.empty() or not end.empty()) {
if (not start.empty() and not end.empty()) {
time = "from "+start+" to "+end;
} else {
if (end.empty()) {
time = "since "+start;
} else {
time = "until "+end;
}
}
}
return time;
}
(Refer to [lex.digraph] in the standard for alternative tokens)

Would you consider assigned a good opposite?
#include <string>
template <typename CharType>
bool assigned(const std::basic_string<CharType>& s)
{
return !s.empty();
}
std::string fmtTimeSpec(const std::string& from, const std::string& to)
{
if (assigned(from)) {
if (assigned(to)) {
return "from "+from+" to "+to;
}
return "since "+from;
}
if (assigned(to)) {
return "until "+to;
}
return std::string();
}
Structural improvements of the "test function" came from numerous useful answers. Special thanks to:
C. E. Gesser
Potatoswatter

To express the opposite form of ".isEmpty()" usage, I prefer this way:
if (textView.getText().toString().isEmpty()){
//do the thing if textView has nothing inside as typed.
}else if (textView.getText().toString() != ""){
// do the thing if textView has something inside as typed.
}
Also, you may use ".equals("")" instead of "!=" typography as recommended by Android Studio.
textView.getText().toString().equals("")

Coming back to the API design aspect
(it may not be applicable to strings, but on container classes in general)
By pure chance I found an excellent answer to this old question (emphasizes mine)
What about using any()? [...]
in a completely unrelated post being the answer to the question
How do I know if a generator is empty from the start?
To contrast empty and any might be poor in English but it absolutely makes sense in API design.

A better way to express options
To be quite honest: Until now, I didn't even realize that I misused string type to negatively(!) express the presence of boundaries of a range. And that was obviously the real cause of my headache.
C++17 introduced optional. So there is little reason left to complain about the shortcomings (in terms of expressiveness) of empty() and negation.
Let's have a look at a working example,[1] [2] that uses the original string type and – as a proof of concept – another type (int for simplicity, this should better be some date type):
#include <iostream>
#include <optional>
#include <string>
#include <sstream>
template <typename T>
std::string format_range(const std::optional<T>& start,
const std::optional<T>& end)
{
std::stringstream range;
if (start) {
if (end) {
range << "from " << *start << " to " << *end;
} else {
range << "since " << *start;
}
} else if (end) {
range << "until " << *end;
}
return range.str();
}
template <typename T>
void invoke_format_range(const T& start, const T& end)
{
using namespace std;
optional<T> NONE;
cout << format_range<T>(NONE, NONE) << endl;
cout << format_range<T>(start, NONE) << endl;
cout << format_range<T>(NONE, end) << endl;
cout << format_range<T>(start, end) << endl;
}
int main()
{
invoke_format_range(std::string("START"), std::string("END"));
invoke_format_range(1, 12);
return 0;
}
[1]
If you cannot use a C++17 compatible compiler, it is relatively easy to adapt optional using your own rudimentary implementation (or try boost::optional of course).
[2]
See online demo at https://onlinegdb.com/OCw2c5mkO

Related

fast way to compare two vector containing strings

I have a vector of strings I that pass to my function and I need to compare it with some pre-defined values. What is the fastest way to do this?
The following code snippet shows what I need to do (This is how I am doing it, but what is the fastest way of doing this):
bool compare(vector<string> input1,vector<string> input2)
{
if(input1.size() != input2.size()
{
return false;
}
for(int i=0;i<input1.siz();i++)
{
if(input1[i] != input2[i])
{
return false;
}
}
return true;
}
int compare(vector<string> inputData)
{
if (compare(inputData,{"Apple","Orange","three"}))
{
return 129;
}
if (compare(inputData,{"A","B","CCC"}))
{
return 189;
}
if (compare(inputData,{"s","O","quick"}))
{
return 126;
}
if (compare(inputData,{"Apple","O123","three","four","five","six"}))
{
return 876;
}
if (compare(inputData,{"Apple","iuyt","asde","qwe","asdr"}))
{
return 234;
}
return 0;
}
Edit1
Can I compare two vector like this:
if(inputData=={"Apple","Orange","three"})
{
return 129;
}
You are asking what is the fastest way to do this, and you are indicating that you are comparing against a set of fixed and known strings. I would argue that you would probably have to implement it as a kind of state machine. Not that this is very beautiful...
if (inputData.size() != 3) return 0;
if (inputData[0].size() == 0) return 0;
const char inputData_0_0 = inputData[0][0];
if (inputData_0_0 == 'A') {
// possibly "Apple" or "A"
...
} else if (inputData_0_0 == 's') {
// possibly "s"
...
} else {
return 0;
}
The weakness of your approach is its linearity. You want a binary search for teh speedz.
By utilising the sortedness of a map, the binaryness of finding in one, and the fact that equivalence between vectors is already defined for you (no need for that first compare function!), you can do this quite easily:
std::map<std::vector<std::string>, int> lookup{
{{"Apple","Orange","three"}, 129},
{{"A","B","CCC"}, 189},
// ...
};
int compare(const std::vector<std::string>& inputData)
{
auto it = lookup.find(inputData);
if (it != lookup.end())
return it->second;
else
return 0;
}
Note also the reference passing for extra teh speedz.
(I haven't tested this for exact syntax-correctness, but you get the idea.)
However! As always, we need to be context-aware in our designs. This sort of approach is more useful at larger scale. At the moment you only have a few options, so the addition of some dynamic allocation and sorting and all that jazz may actually slow things down. Ultimately, you will want to take my solution, and your solution, and measure the results for typical inputs and whatnot.
Once you've done that, if you still need more speed for some reason, consider looking at ways to reduce the dynamic allocations inherent in both the vectors and the strings themselves.
To answer your follow-up question: almost; you do need to specify the type:
// new code is here
// ||||||||||||||||||||||||
if (inputData == std::vector<std::string>{"Apple","Orange","three"})
{
return 129;
}
As explored above, though, let std::map::find do this for you instead. It's better at it.
One key to efficiency is eliminating needless allocation.
Thus, it becomes:
bool compare(
std::vector<std::string> const& a,
std::initializer_list<const char*> b
) noexcept {
return std::equal(begin(a), end(a), begin(b), end(b));
}
Alternatively, make them static const, and accept the slight overhead.
As an aside, using C++17 std::string_view (look at boost), C++20 std::span (look for the Guideline support library (GSL)) also allows a nicer alternative:
bool compare(std::span<std::string> a, std::span<std::string_view> b) noexcept {
return a == b;
}
The other is minimizing the number of comparisons. You can either use hashing, binary search, or manual ordering of comparisons.
Unfortunately, transparent comparators are a C++14 thing, so you cannot use std::map.
If you want a fast way to do it where the vectors to compare to are not known in advance, but are reused so can have a little initial run-time overhead, you can build a tree structure similar to the compile time version Dirk Herrmann has. This will run in O(n) by just iterating over the input and following a tree.
In the simplest case, you might build a tree for each letter/element. A partial implementation could be:
typedef std::vector<std::string> Vector;
typedef Vector::const_iterator Iterator;
typedef std::string::const_iterator StrIterator;
struct Node
{
std::unique_ptr<Node> children[256];
std::unique_ptr<Node> new_str_child;
int result;
bool is_result;
};
Node root;
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node);
int compare(const Vector &input)
{
return compare(input.begin(), input.end(), input.front().begin(), input.front().end(), &root);
}
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node)
{
if (str_it != str_end)
{
// Check next character
auto next_child = node->children[(unsigned char)*str_it].get();
if (next_child)
return compare(vec_it, vec_end, str_it + 1, str_end, next_child);
else return -1; // No string matched
}
// At end of input string
++vec_it;
if (vec_it != vec_end)
{
auto next_child = node->new_str_child.get();
if (next_child)
return compare(vec_it, vec_end, vec_it->begin(), vec_it->end(), next_child);
else return -1; // Have another string, but not in tree
}
// At end of input vector
if (node->is_result)
return node->result; // Got a match
else return -1; // Run out of input, but all possible matches were longer
}
Which can also be done without recursion. For use cases like yours you will find most nodes only have a single success value, so you can collapse those into prefix substrings, to use the OP example:
"A"
|-"pple" - new vector - "O" - "range" - new vector - "three" - ret 129
| |- "i" - "uyt" - new vector - "asde" ... - ret 234
| |- "0" - "123" - new vector - "three" ... - ret 876
|- new vector "B" - new vector - "CCC" - ret 189
"s" - new vector "O" - new vector "quick" - ret 126
you could make use of std::equal function like below :
bool compare(vector<string> input1,vector<string> input2)
{
if(input1.size() != input2.size()
{
return false;
}
return std::equal(input1.begin(), input2.end(), input2.begin())
}
Can I compare two vector like this
The answer is No, you need compare a vector with another vector, like this:
vector<string>data = {"ab", "cd", "ef"};
if(data == vector<string>{"ab", "cd", "efg"})
cout << "Equal" << endl;
else
cout << "Not Equal" << endl;
What is the fastest way to do this?
I'm not an expert of asymptotic analysis but:
Using the relational operator equality (==) you have a shortcut to compare two vectors, first validating the size and, second, each element on them. This way provide a linear execution (T(n), where n is the size of vector) which compare each item of the vector, but each string must be compared and, generally, it is another linear comparison (T(m), where m is the size of the string).
Suppose that each string has de same size (m) and you have a vector of size n, each comparison could have a behavior of T(nm).
So:
if you want a shortcut to compare two vector you can use the
relational operator equality.
If you want an program which perform a fast comparison you should look for some algorithm for compare strings.

To find duplicate entry in c++ using 2D Vector (std::vector)

I wrote a program to find duplicate entry in a table. I am a beginner in C++, hence I don't know how this program is working efficient. Is there any other idea to write this program? Here I have 3 tables (2D Vector), that they are 1)aRecord_arr 2)mainTable and 3)idxTable. idxtable is use to identify the keys to check duplicate entry. aRecord_arr table to be add in maintable. If it is already exist in maintable, it will show the error "Duplicate Entry". So Check this program, and give your suggestions.
typedef vector<string> rec_t;
typedef vector<rec_t> tab_t;
typedef vector<int> cn_t;
int main()
{
tab_t aRecord_arr= { {"a","apple","fruit"},
{"b","banana","fruit"} };
tab_t mainTable = { {"o","orange","fruit"},
{"p","pineapple","fruit"},
{"b","banana","fruit"},
{"m","melon","fruit"},
{"a","apple","fruit"},
{"g","guava","fruit"} };
tab_t idxTable = { {"code","k"},
{"name","k"},
{"category","n"}};
size_t Num_aRecords = aRecord_arr.size();
int idxSize = idxTable.size();
int mainSize = mainTable.size();
rec_t r1;
rec_t r2;
tab_t t1,t2;
cn_t idx;
for(int i=0;i<idxSize;i++)
{
if(idxTable[i][1]=="k")
{
idx.push_back(i);
}
}
for(size_t j=0;j<Num_aRecords;j++)
{
for(unsigned int id=0;id<idx.size();id++)
{
r1.push_back(aRecord_arr[j][idx[id]]);
}
t1.push_back(std::move(r1));
}
for(int j=0;j<mainSize;j++)
{
for(unsigned int id=0;id<idx.size();id++)
{
r2.push_back(mainTable[j][idx[id]]);
}
t2.push_back(std::move(r2));
}
for(size_t i=0;i<t1.size();i++)
{
for(size_t j=0;j<t2.size();j++)
{
if(t1[i]==t2[j])
{
cout<<"Duplicate Entry"<<endl;
exit(0);
}
}
}
}
If you want to avoid duplicate entries in an array, you should consider using a std::setinstead.
What you want is probably a std::map or a std::set
Don't reinvent the wheel, the STL is full of goodies.
You seem to be rooted in a weakly typed language - but C++ is strongly typed.
You will 'pay' the disadvantage of strong typing almost no matter what you do, but you almost painstakingly avoid the advantage.
Let me start with the field that always says 'fruit' - my suggestion is to make this an enum, like:
enum PlantType { fruit, veggie };
Second, you have a vector that always contain 3 strings, all with the same meaning. this seems to be a job for a struct, like:
struct Post {
PlantType kind;
char firstchar;
string name;
// possibly other characteristics
};
the 'firstchar' is probably premature optimization, but lets keep that for now.
Now you want to add a new Post, to an existing vector of Posts, like:
vector<Post> mainDB;
bool AddOne( const Post& p )
{
for( auto& pp : mainDB )
if( pp.name == p.name )
return false;
mainDB.push_back(p);
return true;
}
Now you can use it like:
if( ! AddOne( Post{ fruit, 'b', "banana" } ) )
cerr << "duplicate entry";
If you need speed (at the cost of memory), switch your mainDB to map, like:
map<string,Post> mainDB;
bool AddOne( const Post& p )
{
if( mainDB.find(p.name) != mainDB.end() )
return false;
mainDB[p.name]=p;
return true;
}
this also makes it easier (and faster) to find and use a specific post, like
cout << "the fruit is called " << mainDB["banana"].name ;
beware that the above will cause a runtime error if the post dont exists
As you can see, firstchar was never used, and could be omitted. std::map
has a hash-function-specialization for string keys, and it will probably be
orders of magnitude faster than anything you or I could whip up by hand.
All of the above assumed inclusion of the correct headers, and
using namespace std;
if you dont like using namespace, prepend std:: to all the right places
hope it helps :)

Alternatives to standard functions of C++ to get speed optimization

Just to clarify that I also think the title is a bit silly. We all know that most built-in functions of the language are really well written and fast (there are ones even written by assembly). Though may be there still are some advices for my situation. I have a small project which demonstrates the work of a search engine. In the indexing phase, I have a filter method to filter out unnecessary things from the keywords. It's here:
bool Indexer::filter(string &keyword)
{
// Remove all characters defined in isGarbage method
keyword.resize(std::remove_if(keyword.begin(), keyword.end(), isGarbage) - keyword.begin());
// Transform all characters to lower case
std::transform(keyword.begin(), keyword.end(), keyword.begin(), ::tolower);
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}
At first sign, these functions (alls are member functions of STL container or standard function) are supposed to be fast and not take many time in the indexing phase. But after profiling with Valgrind, the inclusive cost of this filter is ridiculous high: 33.4%. There are three standard functions of this filter take most of the time for that percentage: std::remove_if takes 6.53%, std::set::find takes 15.07% and std::transform takes 7.71%.
So if there are any thing I can do (or change) to reduce the instruction times cost by this filter (like using parallellizing or something like that), please give me your advice. Thanks in advance.
UPDATE: Thanks for all your suggestion. So in brief, I've summarize what I need to do is:
1) Merge tolower and remove_if into one by construct my own loop.
2) Use unordered_set instead of set for faster find method.
Thus I've chosen Mark_B's as the right answer.
First, are you certain that optimization and inlining are enabled when you compile?
Assuming that's the case, I would first try writing my own transformer that combines removing garbage and lower-casing into one step to prevent iterating over the keyword that second time.
There's not a lot you can do about the find without using a different container such as unordered_set as suggested in a comment.
Is it possible for your application that doing the filtering really just is a really CPU-intensive part of the operation?
If you use a boost filter iterator you can merge the remove_if and transform into one, something like (untested):
keyword.erase(std::transform(boost::make_filter_iterator(!boost::bind(isGarbage), keyword.begin(), keyword.end()),
boost::make_filter_iterator(!boost::bind(isGarbage), keyword.end(), keyword.end()),
keyword.begin(),
::tolower), keyword.end());
This is assuming you want the side effect of modifying the string to still be visible externally, otherwise pass by const reference instead and just use count_if and a predicate to do all in one. You can build a hierarchical data structure (basically a tree) for the list of stop words that makes "in-place" matching possible, for example if your stop words are SELECT, SELECTION, SELECTED you might build a tree:
|- (other/empty accept)
\- S-E-L-E-C-T- (empty, fail)
|- (other, accept)
|- I-O-N (fail)
\- E-D (fail)
You can traverse a tree structure like that simultaneously whilst transforming and filtering without any modifications to the string itself. In reality you'd want to compact the multi-character runs into a single node in the tree (probably).
You can build such a data structure fairly trivially with something like:
#include <iostream>
#include <map>
#include <memory>
class keywords {
struct node {
node() : end(false) {}
std::map<char, std::unique_ptr<node>> children;
bool end;
} root;
void add(const std::string::const_iterator& stop, const std::string::const_iterator c, node& n) {
if (!n.children[*c])
n.children[*c] = std::unique_ptr<node>(new node);
if (stop == c+1) {
n.children[*c]->end = true;
return;
}
add(stop, c+1, *n.children[*c]);
}
public:
void add(const std::string& str) {
add(str.end(), str.begin(), root);
}
bool match(const std::string& str) const {
const node *current = &root;
std::string::size_type pos = 0;
while(current && pos < str.size()) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(str[pos++]);
current = it != current->children.end() ? it->second.get() : nullptr;
}
if (!current) {
return false;
}
return current->end;
}
};
int main() {
keywords list;
list.add("SELECT");
list.add("SELECTION");
list.add("SELECTED");
std::cout << list.match("TEST") << std::endl;
std::cout << list.match("SELECT") << std::endl;
std::cout << list.match("SELECTOR") << std::endl;
std::cout << list.match("SELECTED") << std::endl;
std::cout << list.match("SELECTION") << std::endl;
}
This worked as you'd hope and gave:
0
1
0
1
1
Which then just needs to have match() modified to call the transformation and filtering functions appropriately e.g.:
const char c = str[pos++];
if (filter(c)) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(transform(c));
}
You can optimise this a bit (compact long single string runs) and make it more generic, but it shows how doing everything in-place in one pass might be achieved and that's the most likely candidate for speeding up the function you showed.
(Benchmark changes of course)
If a call to isGarbage() does not require synchronization, then parallelization should be the first optimization to consider (given of course that filtering one keyword is a big enough task, otherwise parallelization should be done one level higher). Here's how it could be done - in one pass through the original data, multi-threaded using Threading Building Blocks:
bool isGarbage(char c) {
return c == 'a';
}
struct RemoveGarbageAndLowerCase {
std::string result;
const std::string& keyword;
RemoveGarbageAndLowerCase(const std::string& keyword_) : keyword(keyword_) {}
RemoveGarbageAndLowerCase(RemoveGarbageAndLowerCase& r, tbb::split) : keyword(r.keyword) {}
void operator()(const tbb::blocked_range<size_t> &r) {
for(size_t i = r.begin(); i != r.end(); ++i) {
if(!isGarbage(keyword[i])) {
result.push_back(tolower(keyword[i]));
}
}
}
void join(RemoveGarbageAndLowerCase &rhs) {
result.insert(result.end(), rhs.result.begin(), rhs.result.end());
}
};
void filter_garbage(std::string &keyword) {
RemoveGarbageAndLowerCase res(keyword);
tbb::parallel_reduce(tbb::blocked_range<size_t>(0, keyword.size()), res);
keyword = res.result;
}
int main() {
std::string keyword = "ThIas_iS:saome-aTYpe_Ofa=MoDElaKEYwoRDastrang";
filter_garbage(keyword);
std::cout << keyword << std::endl;
return 0;
}
Of course, the final code could be improved further by avoiding data copying, but the goal of the sample is to demonstrate that it's an easily threadable problem.
You might make this faster by making a single pass through the string, ignoring the garbage characters. Something like this (pseudo-code):
std::string normalizedKeyword;
normalizedKeyword.reserve(keyword.size())
for (auto p = keyword.begin(); p != keyword.end(); ++p)
{
char ch = *p;
if (!isGarbage(ch))
normalizedKeyword.append(tolower(ch));
}
// then search for normalizedKeyword in stopwords
This should eliminate the overhead of std::remove_if, although there is a memory allocation and some new overhead of copying characters to normalizedKeyword.
The problem here isn't the standard functions, it's your use of them. You are making multiple passes over your string when you obviously need to be doing only one.
What you need to do probably can't be done with the algorithms straight up, you'll need help from boost or rolling your own.
You should also carefully consider whether resizing the string is actually necessary. Yeah, you might save some space but it's going to cost you in speed. Removing this alone might account for quite a bit of your operation's expense.
Here's a way to combine the garbage removal and lower-casing into a single step. It won't work for multi-byte encoding such as UTF-8, but neither did your original code. I assume 0 and 1 are both garbage values.
bool Indexer::filter(string &keyword)
{
static char replacements[256] = {1}; // initialize with an invalid char
if (replacements[0] == 1)
{
for (int i = 0; i < 256; ++i)
replacements[i] = isGarbage(i) ? 0 : ::tolower(i);
}
string::iterator tail = keyword.begin();
for (string::iterator it = keyword.begin(); it != keyword.end(); ++it)
{
unsigned int index = (unsigned int) *it & 0xff;
if (replacements[index])
*tail++ = replacements[index];
}
keyword.resize(tail - keyword.begin());
    // After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
    if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
        return false;
    return true;
}
The largest part of your timing is the std::set::find so I'd also try std::unordered_set to see if it improves things.
I would implement it with lower level C functions, something like this maybe (not checking this compiles), doing the replacement in place and not resizing the keyword.
Instead of using a set for garbage characters, I'd add a static table of all 256 characters (yeah, it will work for ascii only), with 0 for all characters that are ok, and 1 for those who should be filtered out. something like:
static const char GARBAGE[256] = { 1, 1, 1, 1, 1, ...., 0, 0, 0, 0, 1, 1, ... };
then for each character in offset pos in const char *str you can just check if (GARBAGE[str[pos]] == 1);
this is more or less what an unordered set does, but will have much less instructions. stopwords should be an unordered set if they're not.
now the filtering function (I'm assuming ascii/utf8 and null terminated strings here):
bool Indexer::filter(char *keyword)
{
char *head = pos;
char *tail = pos;
while (*head != '\0') {
//copy non garbage chars from head to tail, lowercasing them while at it
if (!GARBAGE[*head]) {
*tail = tolower(*head);
++tail; //we only advance tail if no garbag
}
//head always advances
++head;
}
*tail = '\0';
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (tail == keyword || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}

Is this the right way to use recursion?

Given strings s and t compute recursively, if t is contained in s return true.
Example: bool find("Names Richard", "Richard") == true;
I have written the code below, but I'm not sure if its the right way to use recursion in C++; I just learned recursion today in class.
#include <iostream>
using namespace std;
bool find(string s, string t)
{
if (s.empty() || t.empty())
return false;
int find = static_cast<int>(s.find(t));
if (find > 0)
return true;
}
int main()
{
bool b = find("Mississippi", "sip");
string s;
if (b == 1) s = "true";
else
s = "false";
cout << s;
}
If anyone find an error in my code, please tell me so I can fix it or where I can learn/read more about this topic. I need to get ready for a test on recursion on this Wednesday.
The question has changed since I wrote my answer.
My comments are on the code that looked like this (and could recurse)...
#include <iostream>
using namespace std;
bool find(string s, string t)
{
if (s.empty() || t.empty())
return false;
string start = s.substr(0, 2);
if (start == t && find(s.substr(3), t));
return true;
}
int main()
{
bool b = find("Mississippi", "sip");
string s;
if (b == 1) s = "true";
else
s = "false";
cout << s;
}
Watch out for this:
if (start == t && find(s.substr(3), t));
return true;
This does not do what you think it does.
The ; at the end of the if-statement leaves an empty body. Your find() function will return true regardless of the outcome of that test.
I recommend you turn up the warning levels on your compiler to catch this kind of issue before you have to debug it.
As an aside, I find using braces around every code-block, even one-line blocks, helps me avoid this kind of mistake.
There are other errors in your code, too. Removing the magic numbers 2 and 3 from find() will encourage you to think about what they represent and point you on the right path.
How would you expect start == t && find(s.substr(3), t) to work? If you can express an algorithm in plain English (or your native tongue), you have a much higher chance of being able to express it in C++.
Additionally, I recommend adding test cases that should return false (such as find("satsuma", "onion")) to ensure that your code works as well as calls that should return true.
The last piece of advice is stylistic, laying your code out like this will make the boolean expression that you are testing more obvious without resorting to a temporary and comparing to 1:
int main()
{
std::string s;
if (find("Mississippi", "sip"))
{
s = "true";
}
else
{
s = "false";
}
std::cout << s << std::endl;
}
Good luck with your class!
Your recursive function needs 2 things:
Definite conditions of failure and success (may be more than 1)
a call of itself to process a simpler version of the problem (getting closer to the answer).
Here's a quick analysis:
bool find(string s, string t)
{
if (s.empty() || t.empty()) //definite condition of failure. Good
return false;
string start = s.substr(0, 2);
if (start == t && find(s.substr(3), t)); //mixed up definition of success and recursive call
return true;
}
Try this instead:
bool find(string s, string t)
{
if (s.empty() || t.empty()) //definite condition of failure. Done!
return false;
string start = s.substr(0, 2);
if (start == t) //definite condition of success. Done!
return true;
else
return find(s.substr(3), t) //simply the problem and return whatever it finds
}
You're on the right lines - so long as the function calls itself you can say that it's recursive - but even the most simple testing should tell you that your code doesn't work correctly. Change "sip" to "sipx", for example, and it still outputs true. Have you compiled and run this program? Have you tested it with various different inputs?
You are not using recursion. Using std::string::find in your function feels like cheating (this will most likely not earn points).
The only reasonable interpretation of the task is: Check if t is an infix of s without using loops or string functions.
Let's look at the trivial case: Epsilon (the empty word) is an infix of ever word, so if t.empty() holds, you must return true.
Otherwise you have two choices to make:
t might be a prefix of s which is simple to check using recursion; simply check if the first character of t equals the first character of s and call isPrefix with the remainder of the strings. If this returns true, you return true.
Otherwise you pop the first character of s (and not of t) and proceed recursively (calling find this time).
If you follow this recipe (which btw. is easier to implement with char const* than with std::string if you ask me) you get a recursive function that only uses conditionals and no library support.
Note: this is not at all the most efficient implementation, but you didn't ask for efficiency but for a recursive function.

Evaluating expressions inside C++ strings: "Hi ${user} from ${host}"

I'm looking for a clean C++ way to parse a string containing expressions wrapped in ${} and build a result string from the programmatically evaluated expressions.
Example: "Hi ${user} from ${host}" will be evaluated to "Hi foo from bar" if I implement the program to let "user" evaluate to "foo", etc.
The current approach I'm thinking of consists of a state machine that eats one character at a time from the string and evaluates the expression after reaching '}'. Any hints or other suggestions?
Note: boost:: is most welcome! :-)
Update Thanks for the first three suggestions! Unfortunately I made the example too simple! I need to be able examine the contents within ${} so it's not a simple search and replace. Maybe it will say ${uppercase:foo} and then I have to use "foo" as a key in a hashmap and then convert it to uppercase, but I tried to avoid the inner details of ${} when writing the original question above... :-)
#include <iostream>
#include <conio.h>
#include <string>
#include <map>
using namespace std;
struct Token
{
enum E
{
Replace,
Literal,
Eos
};
};
class ParseExp
{
private:
enum State
{
State_Begin,
State_Literal,
State_StartRep,
State_RepWord,
State_EndRep
};
string m_str;
int m_char;
unsigned int m_length;
string m_lexme;
Token::E m_token;
State m_state;
public:
void Parse(const string& str)
{
m_char = 0;
m_str = str;
m_length = str.size();
}
Token::E NextToken()
{
if (m_char >= m_length)
m_token = Token::Eos;
m_lexme = "";
m_state = State_Begin;
bool stop = false;
while (m_char <= m_length && !stop)
{
char ch = m_str[m_char++];
switch (m_state)
{
case State_Begin:
if (ch == '$')
{
m_state = State_StartRep;
m_token = Token::Replace;
continue;
}
else
{
m_state = State_Literal;
m_token = Token::Literal;
}
break;
case State_StartRep:
if (ch == '{')
{
m_state = State_RepWord;
continue;
}
else
continue;
break;
case State_RepWord:
if (ch == '}')
{
stop = true;
continue;
}
break;
case State_Literal:
if (ch == '$')
{
stop = true;
m_char--;
continue;
}
}
m_lexme += ch;
}
return m_token;
}
const string& Lexme() const
{
return m_lexme;
}
Token::E Token() const
{
return m_token;
}
};
string DoReplace(const string& str, const map<string, string>& dict)
{
ParseExp exp;
exp.Parse(str);
string ret = "";
while (exp.NextToken() != Token::Eos)
{
if (exp.Token() == Token::Literal)
ret += exp.Lexme();
else
{
map<string, string>::const_iterator iter = dict.find(exp.Lexme());
if (iter != dict.end())
ret += (*iter).second;
else
ret += "undefined(" + exp.Lexme() + ")";
}
}
return ret;
}
int main()
{
map<string, string> words;
words["hello"] = "hey";
words["test"] = "bla";
cout << DoReplace("${hello} world ${test} ${undef}", words);
_getch();
}
I will be happy to explain anything about this code :)
How many evaluation expressions do intend to have? If it's small enough, you might just want to use brute force.
For instance, if you have a std::map<string, string> that goes from your key to its value, for instance user to Matt Cruikshank, you might just want to iterate over your entire map and do a simple replace on your string of every "${" + key + "}" to its value.
Boost::Regex would be the route I'd suggest. The regex_replace algorithm should do most of your heavy lifting.
If you don't like my first answer, then dig in to Boost Regex - probably boost::regex_replace.
How complex can the expressions get? Are they just identifiers, or can they be actual expressions like "${numBad/(double)total*100.0}%"?
Do you have to use the ${ and } delimiters or can you use other delimiters?
You don't really care about parsing. You just want to generate and format strings with placeholder data in it. Right?
For a platform neutral approach, consider the humble sprintf function. It is the most ubiquitous and does what I am assuming that you need. It works on "char stars" so you are going to have to get into some memory management.
Are you using STL? Then consider the basic_string& replace function. It doesn't do exactly what you want but you could make it work.
If you are using ATL/MFC, then consider the CStringT::Format method.
If you are managing the variables separately, why not go the route of an embeddable interpreter. I have used tcl in the past, but you might try lua which is designed for embedding. Ruby and Python are two other embeddable interpreters that are easy to embed, but aren't quite as lightweight. The strategy is to instantiate an interpreter (a context), add variables to it, then evaluate strings within that context. An interpreter will properly handle malformed input that could lead to security or stability problems for your application.