How could I speed up comparison of std::string against string literals?

How could I speed up comparison of std::string against string literals? - c++

I have a bunch of code where objects of type std::string are compared for equality against string literals. Something like this:
//const std:string someString = //blahblahblah;
if( someString == "(" ) {
//do something
} else if( someString == ")" ) {
//do something else
} else if// this chain can be very long
The comparison time accumulates to a serious amount (yes, I profiled) and so it'd be nice to speed it up.
The code compares the string against numerous short string literals and this comparison can hardly be avoided. Leaving the string declared as std::string is most likely inevitable - there're thousands lines of code like that. Leaving string literals and comparison with == is also likely inevitable - rewriting the whole code would be a pain.
The problem is the STL implementation that comes with Visual C++11 uses somewhat strange approach. == is mapped onto std::operator==(const basic_string&, const char*) which calls basic_string::compare( const char* ) which in turn calls std::char_traits<char>( const char* ) which calls strlen() to compute the length of the string literal. Then the comparison runs for the two strings and lengths of both strings are passed into that comparison.
The compiler has a hard time analyzing all this and emits code that traverses the string literal twice. With short literals that's not much time but every comparison involves traversing the literal twice instead of once. Simply calling strcmp() would most likely be faster.
Is there anything I could do like perhaps writing a custom comparator class that would help avoid traversing the string literals twice in this scenario?

Similar to Dietmar's solution, but with slightly less editing: you can wrap the string (once) instead of each literal
#include <string>
#include <cstring>
struct FastLiteralWrapper {
std::string const &s;
explicit FastLiteralWrapper(std::string const &s_) : s(s_) {}
template <std::size_t ArrayLength>
bool operator== (char const (&other)[ArrayLength]) {
std::size_t const StringLength = ArrayLength - 1;
return StringLength == s.size()
&& std::memcmp(s.data(), other, StringLength) == 0;
}
};
and your code becomes:
const std:string someStdString = "blahblahblah";
// just for the context of the comparison:
FastLiteralWrapper someString(someStdString);
if( someString == "(" ) {
//do something
} else if( someString == ")" ) {
//do something else
} else if// this chain can be very long
NB. the fastest solution - at the cost of more editing - is probably to build a (perfect) hash or trie mapping string literals to enumerated constants, and then just switch on the looked-up value. Long if/else if chains usually smell bad IMO.

Well, aside from C++14's string_literal, you could easily code up a solution:
For comparison with a single character, use a character literal and:
bool operator==(const std::string& s, char c)
{
return s.size() == 1 && s[0] == c;
}
For comparison with a string literal, you can use something like this:
template<std::size_t N>
bool operator==(const std::string& s, char const (&literal)[N])
{
return s.size() == N && std::memcmp(s.data(), literal, N-1) == 0;
}
Disclaimer:
The first might even be superfluous,
Only do this if you measure an improvement over what you had.

If you have long chain of string literals to compare to there is likely some potential to deal with comparing prefixes to group common processing. Especially when comparing a known set of strings for equality with an input string, there is also the option to use a perfect hash and key the operations off an integer produced by those.
Since the use of a perfect hash will probably have the best performance but also requires major changes of the code layout, an alternative could be to determine the size of the string literals at compile time and use this size while comparing. For example:
class Literal {
char const* d_base;
std::size_t d_length;
public:
template <std::size_t Length>
Literal(char const (&base)[Length]): d_base(base), d_length(Length - 1) {}
bool operator== (std::string const& other) const {
return other.size() == this->d_length
&& !other.memcmp(this->d_base, other.c_str(), this->d_length);
}
bool operator!=(std::string const& other) const { return !(*this == other); }
};
bool operator== (std::string const& str, Literal const& literal) {
return literal == str;
}
bool operator!= (std::string const& str, Literal const& literal) {
return !(str == literal);
}
Obviously, this assumes that your literals don't embed null characters ('\0') other than the implicitly added terminating null character as the static length would otherwise be distorted. Using C++11 constexpr it would be possible to guard against that possibility but the code gets somewhat more complicated without any good reason. You'd then compare your strings using something like
if (someString == Literal("(")) {
...
}
else if (someString == Literal(")")) {
...
}

The fastest string comparison you can get is by interning the strings: Build a large hash table that contains all strings that are ever created. Ensure that whenever a string object is created, it is first looked up from the hash table, only creating a new object if no preexisting object is found. Naturally, this functionality should be encapsulated in your own string class.
Once you have done this, string comparison is equivalent to comparing their addresses.
This is actually quite an old technique first popularized with the LISP language.
The point, why this is faster, is that every string only has to be created once. If you are careful, you'll never generate the same string twice from the same input bytes, so string creation overhead is controlled by the amount of input data you work through. And hashing all your input data once is not a big deal.
The comparisons, on the other hand, tend to involve the same strings over and over again (like your comparing to literal strings) when you write some kind of a parser or interpreter. And these comparisons are reduced to a single machine instruction.

2 other ideas :
A) Build a FSA using a lexical analyser tool like flex, so the string is converted to an integer token value, depending what it matches.
B) Use length, to break up long elseif chains, possibly partly table driven
Why not get the length of the string something, at the top then just compare against the literals it could possibly match.
If there's a lot of them, it may be worth making it table driven and use a map and function pointers. You could just special case the single character literals, for example perhaps using a function lookup table.
Finding non-matches fast and the common lengths may suffice, and not require too much code restructuring, but be more maintainable as well as faster.
int len = strlen (something);
if ( ! knownliterallength[ len]) {
// not match
...
} else {
// First char may be used to index search, or literals are stored in map with *func()
switch (len)
{
case 1: // Could use a look table index by char and *func()
processchar( something[0]);
break;
case 2: // Short strings
case 3:
case 4:
processrunts( something);
break
default:
// First char used to index search, or literals are stored in map with *func()
processlong( something);
break
}
}

This is not the prettiest solution but it has proved quite fast when there is a lot of short strings to be compared (like operators and control characters/keywords in a script parser?).
Create a search tree based on string length and only compare characters. Try to represent known strings as an enumeration if this makes it cleaner in the particular implementation.
Short example:
enum StrE {
UNKNOWN = 0 ,
RIGHT_PAR ,
LEFT_PAR ,
NOT_EQUAL ,
EQUAL
};
StrE strCmp(std::string str)
{
size_t l = str.length();
switch(l)
{
case 1:
{
if(str[0] == ')') return RIGHT_PAR;
if(str[0] == '(') return LEFT_PAR;
// ...
break;
}
case 2:
{
if(str[0] == '!' && str[1] == '=') return NOT_EQUAL;
if(str[0] == '=' && str[1] == '=') return EQUAL;
// ...
break;
}
// ...
}
return UNKNOWN;
}
int main()
{
std::string input = "==";
switch(strCmp(input))
{
case RIGHT_PAR:
printf("right par");
break;
case LEFT_PAR:
printf("left par");
break;
case NOT_EQUAL:
printf("not equal");
break;
case EQUAL:
printf("equal");
break;
case UNKNOWN:
printf("unknown");
break;
}
}

Related

What is the C++ convention when I need to add a useless return statement?

I was trying to write a function that returns the first non-repeated character in a string. The algorithm I made was:
Assert that the string is non-empty
Iterate through the string and add all non-repeated characters to a set
Assert that the set be non-empty
Iterate through string again and return the first character that's in the set
Add a useless return statement to make the compiler happy. (Arbitrarily return 'F')
Obviously my algorithm is very "brute force" and could be improved on. It runs, anyhow. I was wondering if there's a better way to do this and was also wondering what the convention is for useless return statements. Don't be afraid to criticize me harshly. I'm trying to become a C++ stiffler. ;)
#include <iostream>
#include <string>
#include <set>
char first_nonrepeating_char(const std::string&);
int main() {
std::string S = "yodawgIheardyoulike";
std::cout << first_nonrepeating_char(S);
}
// Finds that first non-repeated character in the string
char first_nonrepeating_char(const std::string& str) {
assert (str.size() > 0);
std::set<char> nonRepChars;
std::string::const_iterator it = str.begin();
while (it != str.end()) {
if (nonRepChars.count(*it) == 0) {
nonRepChars.insert(*it);
} else {
nonRepChars.erase(*it);
}
++it;
}
assert (nonRepChars.size() != 0);
it = str.begin();
while (it != str.end()) {
if (nonRepChars.count(*it) == 1) return (*it);
++it;
}
return ('F'); // NEVER HAPPENS
}

The main problem is just getting rid of warnings.
Ideally you should be able to just say
assert( false ); // Should never get here
but unfortunately that does not get rid of all warnings with the compilers I use most, namely Visual C++ and g++.
Instead I do this:
xassert_should_never_get_here();
where xassert_should_never_get_here is a function that
is declared as "noreturn" by compiler-specific means, e.g. __declspec for Visual C++,
has an assert(false) to handle debug builds,
then throws a std::logic_error.
The last two points are accomplished by a macro XASSERT (its actual name in my code is CPPX_XASSERT, it's always a good idea to use prefixes for macro names so as to reduce name conflict probability).
Of course, the assertion that you should not get to the end, is equivalent to an assertion that the argument string does contain at least one non-repeated character, which therefore is a precondition of the function (part of its contract), which I think should be documented by a comment. :-)
There are three main "modern C++" ways of coding things up when you do not have that precondition, namely
choose one char value to signify "no such", e.g. '\0', or
throw an exception in the case of no such, or
return a boxed result which can be logically "empty", e.g. the Boost class corresponding to Barton and Nackmann's Fallible.
About the algorithm: when you're not intested in where the first non-repeating char is, you can avoid the rescan of the string by maintaining a count per character, e.g. by using a map<char, int> instead of a set<char>.

There is a simpler and "cleaner" way of doing it, but it is not computationally faster than "brute force".
Use a table that counts the number of occurrences of each character in the input string.
Then go over the input string one more time, and return the first character whose count is 1.
char GetFirstNonRepeatedChar(const char* s)
{
int table[256] = {0};
for (int i=0; s[i]!=0; i++)
table[s[i]]++;
for (int i=0; s[i]!=0; i++)
if (table[s[i]] == 1)
return s[i];
return 0;
}
Note: the above will work for ASCII strings.
If you're using a different format, then you'll need to change the 256 (and the char of course).

Switch Case in c++ [duplicate]

This question already has answers here:
C/C++: switch for non-integers
(17 answers)
Closed 9 years ago.
How can I compare an array of char in c++ using switch-case?
Here is a part of my code:
char[256] buff;
switch(buff){
case "Name": printf("%s",buff);
break;
case "Number": printf("Number "%s",buff);
break;
defaul : break
}
I receive the error :" error: switch quantity not an integer".How can I resolve it?

If you really need a switch statement, you will need to convert your buff variable to an integer. To do so, you could use a hash function or a std::map.
The easy approach would be to make a std::map<std::string,int> containing the keys you want to use in the switch associated with unique int values. You would get something like:
std::map<string,int> switchmap;
...
switch(switchmap.find(std::string(buff))->second){
...
}
The std::map approach is very readable and shouldn't cause much confusion.

You just can't use an array as the expression in a switch construct.

In C++ case statements require a constant integer value and cannot be used with values calculated at runtime. However if you are using C++11 you can use a constexpr function to generate case values simulate using strings with a case statement.
This uses a hash function that takes a pointer to a string and generates a value at compile time instead of runtime. If more than one string generates the same value (a hash collision) you get the familiar error message about multiple case statements using the same value.
constexpr unsigned int djb2Hash(const char* str, int index = 0)
{
return !str[index] ? 0x1505 : (djb2Hash(str, index + 1) * 0x21) ^ str[index];
}
The djb2Hash function can then be used directly in both the switch and case statements. There is one caveat however, the hash function can result in a collision at runtime. The probability of this happening is driven primarily by the quality of the hash function. The solution presented here does not attempt to address this problem (but may in the future).
void DoSomething(const char *str)
{
switch(djb2Hash(str))
{
case djb2Hash("Hello"): SayHello(); break;
case djb2Hash("World"): SayWorld(); break;
}
}
This works very well but might be considered ugly. You can simplify this further by declaring a user defined literal that handles invoking the hash function.
// Create a literal type for short-hand case strings
constexpr unsigned int operator"" _C ( const char str[], size_t size)
{
return djb2Hash(str);
}
void DoSomething(const char *str)
{
switch(djb2Hash(str))
{
case "Hello"_C: SayHello(); break;
case "World"_C: SayWorld(); break;
}
}
This provides a more intuitive usage of strings in a switch statements but may also be considered slightly confusing because of the user defined literal.
[Edit: Added note about runtime hash collisions. Much Kudos to R. Martinho Fernandes for bringing it to my attention!]

You cannot use a non-integral type in a switch statement. Your problem would require something like:
char buff[256];
if(!strcmp(buf, "Name") printf("%s",buff);
if(!strcmp(buf, "Number") printf("%s",buff);
To get the results you are looking for - basically a bunch of if statements to replace the switch.

You are trying to do something we all dearly wish we could, but not in C/C++ :) The case in a switch statement must be integral values. One easy alternative is to have an enumeration that matches the set of strings you want to act on.

In C++ you can use a switch-case only with integers (char, int, ...) but not with c-strings (char *)
In your case you have to use a if-then-else construct
if (strcmp(buff, "Name") == 0) {
...
} else if (...) {
...
}

As the error says, switch only works for integers. The simplest resolution is to use a chain of if...else if... tests.
However, using a char array rather than a string is awkward, since you need quirky C-style functions to compare them. I suggest you use std::string instead.
std::string buff;
if (buff == "Name") {
// deal with name
} else if (buff == "Number") {
// deal with number
} else {
// none of the above
}
More complex approaches, perhaps mapping strings to numbers for use in a switch or to functors to call, are possible and may be more efficient if you have a huge number of cases; but you should get the simple version working before worrying about such optimisations.

Unlike many other languages that allow string and other object comparisons to be used in a switch-case, c++ requires that the underlying value be an integer. If you want use more complex object types, you will have to use an if else-if construct.

You can't use a switch directly for this situation.
Typically, you'd want to use a std::map (or std::unordered_map) to store the action to associate with each input. You might (for example) use a std::map<std::string, std::function>, and then store the addresses of functions/function objects in the map, so your final construct would be something like:
std::map<std::string, std::function> funcs;
funcs["Name"] = [](std::string const &n) {std::cout << n;};
funcs["Number"] = [](std::string const &n) {std::cout << "Number: " << n;};
// ...
auto f = funcs[buff];
f(buff);
// or combine lookup and call: funcs[buff](buff);
Two notes: first, you probably really want to use map::find for the second part, so you can detect when/if the string you're looking for isn't present in the map.
Second, as it stands, your code doesn't seem to make much sense -- you're both switching on buff and printing out buff's value, so (for example) if you buff contains Number, your output will be "Number Number". I'd guess you intend to use buff and some other variable that holds the value you care about.

You can partially do a "string" compare.
The below does not specifically satisfy your query (as C won't ride that pony), nor is it elegant code, but a variation on it may get you through your need. I do not recommend you do this if you a learning C/C++, but this construct has worked well in limited programming environment.
(I use it in PIC programming where strlen(buff)==1 or 2 and sizeof(int)==2.)
Let's assume sizeof(int) == 4 and strlen(buff) >= 3.
char buff[256];
// buff assignment code is TBD.
// Form a switch 4-byte key from the string "buff".
// assuming a big endian CPU.
int key = (buff[0] << 3*8) | (buff[1] << 2*8) | (buff[2] << 1*8) | (buff[3] << 0*8);
// if on a little endian machine use:
// int key = (buff[0] << 0*8) | (buff[1] << 1*8) | (buff[2] << 2*8) | (buff[3] << 3*8);
switch (key) {
// Notice the single quote vs. double quote use of constructing case constants.
case 'Name': printf("%s",buff); break;
case 'Numb': printf("Number \"%s\"",buff); break;
default : ;
}

!strcmp as substitute for ==

I'm working with rapidxml, so I would like to have comparisons like this in the code:
if ( searchNode->first_attribute("name")->value() == "foo" )
This gives the following warning:
comparison with string literal results in unspecified behaviour [-Waddress]
Is it a good idea to substitute it with:
if ( !strcmp(searchNode->first_attribute("name")->value() , "foo") )
Which gives no warning?
The latter looks ugly to me, but is there anything else?

You cannot in general use == to compare strings in C, since that only compares the address of the first character which is not what you want.
You must use strcmp(), but I would endorse this style:
if( strcmp(searchNode->first_attribute("name")->value(), "foo") == 0) { }
rather than using !, since that operator is a boolean operator and strcmp()'s return value is not boolean. I realize it works and is well-defined, I just consider it ugly and confused.
Of course you can wrap it:
#include <stdbool.h>
static bool first_attrib_name_is(const Node *node, const char *string)
{
return strcmp(node->first_attribute("name")->value(), string) == 0;
}
then your code becomes the slightly more palatable:
if( first_attrib_name_is(searchNode, "foo") ) { }
Note: I use the bool return type, which is standard from C99.

If the value() returns char* or const char*, you have little choice - strcmp or one of its length-limiting alternatives is what you need. If value() can be changed to return std::string, you could go back to using ==.

When comparing char* types with "==" you just compare the pointers. Use the C++ string type if you want to do the comparison with "=="

You have a few options:
You can use strcmp, but I would recommend wrapping it. e.g.
bool equals(const char* a, const char* b) {
return strcmp(a, b) == 0;
}
then you could write: if (equals(searchNode->first_attribute("name")->value(), "foo"))
You can convert the return value to a std::string and use the == operator
if (std::string(searchNode->first_attribute("name")->value()) == "foo")
That will introduce a string copy operation which, depending on context, may be undesirable.
You can use a string reference class. The purpose of a string reference class is to provide a string-like object which does not own the actual string contents. I've seen a few of these and it's simple enough to write your own, but since Boost has a string reference class, I'll use that for an example.
#include <boost/utility/string_ref.hpp>
using namespace boost;
if (string_ref(searchNode->first_attribute("name")->value()) == string_ref("foo"))

C++ Trie search performance

So I made a trie that holds quite a large amount of data, my search algorithm is quite fast but I wanted to see if anyone had any insight as to how I could make it any faster.
bool search (string word)
{
int wordLength = word.length();
node *current = head;
for (unsigned int i=0; i<wordLength; ++i)
{
if (current->child[((int)word[i]+(int)'a')] == NULL)
return false;
else
current = current->child[((int)word[i]+(int)'a')];
}
return current->is_end;
}

Looks good performance-wise, except these tidbits:
Declare the function parameter as const string& (instead of just string), to avoid unnecessary copying.
You could extract the common subexpression current->child[((int)word[i]+(int)'a')] in front of the if, to avoid repetition and make the code slightly smaller, but any compiler worth its salt will do that optimization for you anyway.
"Style" suggestions:
What if word contains character below 'a' (such as capital letter, digit, punctuation mark, new line etc...)? You'll need to validate input to avoid accessing the wrong memory location and crashing. Also shouldn't this be -(int)'a' instead of + (I'm assuming you just want to support a limited set of characters: 'a' and above)?
Declare wordLength as size_t (or better yet auto), but this is not important for strings of any practical length (might even hurt the performance slightly if size_t is greater than int). Ditto for i.

bool search (string word)
Calling this function, string word will be copied,
type of function below will be faster.
bool search (const string &word)
or
bool search (const char *word)

C++: is string.empty() always equivalent to string == ""?

Can I make an assumption that given
std::string str;
... // do something to str
Is the following statement is always true?
(str.empty() == (str == ""))

Answer
Yes. Here is the relevant implementation from bits/basic_string.h, the code for basic_string<_CharT, _Traits, _Alloc>:
/**
* Returns true if the %string is empty. Equivalent to *this == "".
*/
bool
empty() const
{ return this->size() == 0; }
Discussion
Even though the two forms are equivalent for std::string, you may wish to use .empty() because it is more general.
Indeed, J.F. Sebastian comments that if you switch to using std::wstring instead of std::string, then =="" won't even compile, because you can't compare a string of wchar_t with one of char. This, however, is not directly relevant to your original question, and I am 99% sure you will not switch to std::wstring.

It should be. The ANSI/ISO standard states in 21.3.3 basic_string capacity:
size_type size() const;
Returns: a count of char-like objects currently in the string.
bool empty() const;
Returns: size() == 0
However, in clause 18 of 21.3.1 basic_string constructors it states that the character-type assignment operator uses traits::length() to establish the length of the controlled sequence so you could end up with something strange if you are using a different specialization of std::basic_string<>.
I think that the 100% correct statement is that
(str.empty() == (str == std::string()))
or something like that. If you haven't done anything strange, then std::string("") and std::string() should be equivalent
They are logically similar but they are testing for different things. str.empty() is checking if the string is empty where the other is checking for equality against a C-style empty string. I would use whichever is more appropriate for what you are trying to do. If you want to know if a string is empty, then use str.empty().

str.empty() is never slower, but might be faster than str == "". This depends on implementation. So you should use str.empty() just in case.
This is a bit like using ++i instead of i++ to increase a counter (assuming you do not need the result of the increment operator itself). Your compiler might optimise, but you lose nothing using ++i, and might win something, so you are better off using ++i.
Apart from performance issues, the answer to your question is yes; both expressions are logically equivalent.

Yes (str.empty() == (str == "")) is always* true for std::string. But remember that a string can contain '\0' characters. So even though the expression s == "" may be false, s.c_str() may still return an empty C-string. For example:
#include <string>
#include <iostream>
using namespace std;
void test( const string & s ) {
bool bempty = s.empty();
bool beq = std::operator==(s, ""); // avoid global namespace operator==
const char * res = (bempty == beq ) ? "PASS" : "FAIL";
const char * isempty = bempty ? " empty " : "NOT empty ";
const char * iseq = beq ? " == \"\"" : "NOT == \"\"";
cout << res << " size=" << s.size();
cout << " c_str=\"" << s.c_str() << "\" ";
cout << isempty << iseq << endl;
}
int main() {
string s; test(s); // PASS size=0 c_str="" empty == ""
s.push_back('\0'); test(s); // PASS size=1 c_str="" NOT empty NOT == ""
s.push_back('x'); test(s); // PASS size=2 c_str="" NOT empty NOT == ""
s.push_back('\0'); test(s); // PASS size=3 c_str="" NOT empty NOT == ""
s.push_back('y'); test(s); // PASS size=4 c_str="" NOT empty NOT == ""
return 0;
}
**barring an overload of operator== in the global namespace, as others have mentioned*

Some implementations might test for the null character as the first character in the string resulting in a slight speed increase over calculating the size of the string.
I believe that this is not common however.

Normally, yes.
But if someone decides to redefine an operator then all bets are off:
bool operator == (const std::string& a, const char b[])
{
return a != b; // paging www.thedailywtf.com
}

Yes it is equivalent but allows the core code to change the implementation of what empty() actually means depending on OS/Hardware/anything and not affect your code at all. There is similiar practice in Java and .NET

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How could I speed up comparison of std::string against string literals? - c++

Related

What is the C++ convention when I need to add a useless return statement?

Switch Case in c++ [duplicate]

!strcmp as substitute for ==

C++ Trie search performance

C++: is string.empty() always equivalent to string == ""?

Categories

Resources