C++ code performance strings compare

C++ code performance strings compare - c++

I have an array of struct (arrBoards) which has some integer values, vector and a string type.
I want to compare if certain string in the struct is equal with entered parameter (string p1).
What idea is faster - to check equation of input string with every string element inside an array, or firstly check if string.length() in current string element of the array greater than 0, then compare the strings.
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
if (p1.length())
{
transform(p1.begin(), p1.end(), p1.begin(), ::tolower); //to lowercase
for (int i=0; i<arrSize; i++) //check if string element already exists
if ( rdPtr->arrBoards[i].sName.length() ) //check length of the string in the element of the array
if ( rdPtr->arrBoards[i].sName == p1 )
{
*/ some code */
break;
}
}
I think the second idea is better because it don't need to calculate the name everytime, but I can be wrong because using if could slow down code.
Thanks for the answers

I'm sure the comparison operator (==) of the string class is already optimized enough. Just use it.
operator==(...) returns a bool based on a short-circuit comparison
return __x.size() == __n && _Traits::compare(__x.data(), __s, __n) == 0;
It checks the size of the strings before calling compare(), so, there is no need for further optimization.
Always remember one of the principles of Software Engineering: KISS :P

What you want to do is play percentages.
Since the strings are highly likely to be different, you want to find that out as quickly as possible.
You're comparing length first, but don't assume length is cheap to compute, compared to whatever else you're doing.
Here's the kind of thing I've done (in C):
if (a[0]==b[0] && strcmp(a, b)==0)
so if the leading characters are different, it never gets to the string compare.
If the dataset is such that the leading characters are likely to be different, it saves a lot of time.
(strcmp also has this kind of optimization, but you still have to pay the price of setting up the arguments and getting in and out of the function. We're talking about small numbers of cycles here.)
If you do something like that, then you may find the loop iteration overhead is costing a significant fraction of time.
If so, you might consider unrolling it.
(The compiler might unroll it for you, but I wouldn't depend on it.)

Comparing a number is faster than comparing a string. Try comparing the strings length before comparing the string itself.

Related

Why is vector faster than unordered_map?

I am solving a problem on LeetCode, but nobody has yet been able to explain my issue.
The problem is as such:
Given an arbitrary ransom note string and another string containing letters from all the magazines, write a function that will return true if the ransom note can be constructed from the magazines ; otherwise, it will return false.
Each letter in the magazine string can only be used once in your ransom note.
Note:
You may assume that both strings contain only lowercase letters.
canConstruct("a", "b") -> false
canConstruct("aa", "ab") -> false
canConstruct("aa", "aab") -> true
My code (which takes 32ms):
class Solution {
public:
bool canConstruct(string ransomNote, string magazine) {
if(ransomNote.size() > magazine.size()) return false;
unordered_map<char, int> m;
for(int i = 0; i < magazine.size(); i++)
m[magazine[i]]++;
for(int i = 0; i < ransomNote.size(); i++)
{
if(m[ransomNote[i]] <= 0) return false;
m[ransomNote[i]]--;
}
return true;
}
};
The code (which I dont know why is faster - takes 19ms):
bool canConstruct(string ransomNote, string magazine) {
int lettersLeft = ransomNote.size(); // Remaining # of letters to be found in magazine
int arr[26] = {0};
for (int j = 0; j < ransomNote.size(); j++) {
arr[ransomNote[j] - 'a']++; // letter - 'a' gives a value of 0 - 25 for each lower case letter a-z
}
int i = 0;
while (i < magazine.size() && lettersLeft > 0) {
if (arr[magazine[i] - 'a'] > 0) {
arr[magazine[i] - 'a']--;
lettersLeft--;
}
i++;
}
if (lettersLeft == 0) {
return true;
} else {
return false;
}
}
Both of these have the same complexity and use the same structure to solve the problem, but I don't understand why one takes almost twice as much time than the other. The time to query a vector is O(1), but its the same for an unordered_map. Same story with adding an entry/key to either of them.
Please, could someone explain why the run time varies so much?

First thing to note is, although the average time to query an unordered_map is constant, the worst case is not O(1). As you can see here it actually rises to the order of O(N), N denoting the size of the container.
Secondly, as vector allocates sequential portions of memory, accessing to that memory is highly efficient and actually is constant, even in the worst-case. (i.e. simple pointer arithmetic, as opposed to computing the result of a more complex hash function) There is also the possibility of various levels of caching of sequential memory that may be involved (i.e. depending on the platform your code is running on) which may make the execution of a code using vector even faster, compared to one that is using unordered_map.
In essence, in terms of complexity, the worst-case performance of a vector is more efficient than that of unordered_map. On top of that, most hardware systems offer features such as caching which give usage of vector an even bigger edge. (i.e. lesser constant factors in O(1) operations)

Your second approach uses plain C array where accessing an element is a simple pointer dereference. But that is not the case with unordered_map. There are two points to note:
First, accessing an element is not a simple pointer dereference. It has to do other works to maintain it's internal structure. An unordered_map is actually a hash table under the hood and C++ standard indirectly mandates it to be implemented using open addressing which is a far more complex algorithm than simple array access.
Second, O(1) access is on average but not on worst case.
For these reasons no wonder that array version will work better than unordered_map even though they have same run time complexity. This is another example where two codes with same run time complexity performs differently.
You will see the benefit of unordered_map only when you have a large number of keys (oppose to fixed 26 here).

"O(1)" means "constant time" -- that is, an algorithm that is (truly) O(1) will not get slower when there is more data (in this case, when there are more items in the map or array). It does not indicate how fast the algorithm runs -- it only indicates that it won't slow down if there is more data. Seeing different times for one O(1) algorithm vs. another does not mean that they are not O(1). You should not expect that one O(1) algorithm will run exactly as fast as another. But, if there is a difference, you should see the same difference if the maps/arrays have more data in them.

efficient is_in(vector<string>& S, string P) function

Given a set of S string { S0, S2, S3,..., Sn-1 }, and a string P, how to determine the function bool is_in( string, vector ) without doing the obvious loop.
Meaning that I don't want to do this:
bool is_in(vector<string>& S, string P)
{
for(int i=0; i<S.size(); i++)
if(P == S[i]) return true;
return false;
}
Ideally, I would like to have a sort of hash function, that I could compute a priori. Something like this:
bool is_in(vector<string>& S, string P)
{
someHashType h = hash( S );
if( someFunction( h, S ) return true;
return false;
}
Note:
S is s static vector (in my case, size 1000, unsorted)
P an entry of a collection of strings I'm testing against S (also unsorted) (in my case, 10M) -
So that's why I need to be fast.
This is NOT a homework problem - But part of a large scale software.

The problem with "I want this function to be faster" is that it does, nearly always, involve SOME extra work somewhere else. And that may or may not mean that the improvement is "worth it". All that depends on what your collection of strings that you are looking for is used for in the rest of the code. If it's just a "is the word in this list then do X" (e.g. a bad word check for commit messages, must not have swear-words and company names in them), then I would change the vector to an unordered_set. That has a O(1) search time, and would look something like:
bool is_in(unordered_set<string>& S, string P)
{
auto it = S.find(P);
return (it != S.end());
}
But this will of course have consequences elsewhere, and if you rely on the list being a vector so that for example iterating over it is fast somewhere else in the code, this will probably slow that part down.
Edit: You have, I take it, profiled your code in a real use-case and found this particular function to take a significant amount of time. Otherwise, you'd be better off measuring that FIRST.

Finally I found what I was looking for:
There is a tool called BloomFilter which allows a pre-computed hash of a collection of strings.
I developed my solution around the code located at C++Bloom Filter Library
The code would go like this:
insert all strings to the 'bloom' filter
check if a given string is in the filter.
The advantage is that the strings don't need to be storage in memory, as it would be in a set, unordered_set or any object like that.
in my particular object, I had a table of strings with 10M strings (800MB).
The size of the filter in memory is around 20M, and the search is quite faster.
The 'Bloom Filter' is an statistical algorithm, so it can have a few false positives. - But the probability for that is quite low (controlled by a parameter)
Note that there is no false negative.

String Comparison return value (Is is used in applications that sorts characters ?)

When we use strcmp(str1, str2); or str1.compare(str2); the return values are like -1, 0 and 1, for str1 < str2, str1 == str2 or str1 > str2 respectively.
The question is, is it defined like this for a specific reason?
For instance, in binary tree sorting algorithm, we push smaller values to the left child and larger values to the right child. This strcmp or string::compare functions seem to be perfect for that. However, does anyone use string matching in order to sort a tree (integer index are easier to use) ?
So, what is the actual purpose of the three return values ( -1, 0, 1). Why cant it just return 1 for true, and 0 for false?
Thanks

The purpose of having three return values is exactly what it seems like: to answer all questions about string comparisons at once.
Everyone has different needs. Some people sometimes need a simple less-than test; strncmp provides this. Some people need equality testing; strncmp provides this. Some people really do need to know the full relationship between two strings; strncmp provides this.
What you absolutely don't want is someone writing this:
if(strless(lhs, rhs))
{
}
else if(strequal(lhs, rhs))
{
}
That's doing two potentially expensive comparison operations. strless also knows if they were equal, because it had to get to the end of both strings to return that it was not less.
Oh, and FYI: the return values isn't -1 or +1; it's greater than zero or less than zero. Or zero if they're equal.

It's useful for certain cases where knowing all three cases is important. Use operator< for string when you just care about a boolean comparison.

It could, but then you would need multiple functions for sorting and comparison. With strcmp() returning smaller, equal or bigger, you can use them easily for comparison and for sorting.
Remember that BSTs are not the only place where you would like to compare strings. You might want to sort a name list or similar. Also, it is not uncommon to have a string as key in a tree too.

As others have stated, there are real purposes for comparison of strings with < > == implications. For example; fixed length numbers assigned to strings will resolve correctly; ie: "312235423" > "312235422". On some occasions this is useful.
However the feature you're asking for, true/false for solutions still works with the given return values.
if (-1)
{
// resolves true
}
else if (1)
{
// also resolves true
}
else if (0)
{
// resolves false
}

Two short questions about std::vector

When a vector is created it has a default allocation size (probably this is not the right term to use, maybe step size?). When the number of elements reaches this size, the vector is resized. Is this size compiler specific? Can I control it? Is this a good idea?
Do repeated calls to vector::size() recount the number of elements (O(n) calculation) or is this value stored somewhere (O(1) lookup). For example, in the code below
// Split given string on whitespace
vector<string> split( const string& s )
{
vector<string> tokens;
string::size_type i, j;
i = 0;
while ( i != s.size() ) {
// ignore leading blanks
while ( isspace(s[i]) && i != s.size() ) {
i++;
}
// found a word, now find its end
j = i;
while ( !isspace(s[j]) && j != s.size() ) {
j++;
}
// if we found a word, add it to the vector
if ( i != j ) {
tokens.push_back( s.substr(i, j-i) );
i = j;
}
}
return tokens;
}
assuming s can be very large, should I call s.size() only once and store the result?
Thanks!

In most cases, you should leave the allocation alone unless you know the number of items ahead of time, so you can reserve the correct amount of space.
At least in every case of which I'm aware, std::vector::size() just returns a stored value, so it has constant complexity. In theory, the C++ standard allows it to do otherwise. There are reasons to allow otherwise for some other containers, primarily std::list, and rather than make a special case for those, they simply recommend constant time for all containers instead of requiring it for any. I can't quite imagine a vector::size that counted elements though -- I'm pretty no such thing has ever existed.
P.S., an easier way to do what your code above does, is something like this:
std::vector<string> split(std::string const &input) {
vector<string> ret;
istringstream buffer(input);
copy(istream_iterator<string>(input),
istream_iterator<string>(),
back_inserter(ret));
return ret;
}
Edit: IMO, The C++ Standard Library, by Nicolai Josuttis is an excellent reference on such things.

The actual size of the capacity increment is implementation-dependent, but it has to be (roughly) exponential to support the container's complexity requirements. As an example, the Visual C++ standard library will allocate exactly the space required for the first few elements (five, if I recall correctly), then increases the size exponentially after that.
The size has to be stored somehow in the vector, otherwise it doesn't know where the end of the sequence is! However, it may not necessarily be stored as an integer. The Visual C++ implementation (again, as an example) stores three pointers:
a pointer to the beginning of the underlying array,
a pointer to the current end of the sequence, and
a pointer to the end of the underlying array.
The size can be computed from (1) and (2); the capacity can be computed from (1) and (3).
Other implementations might store the information differently.

It's library-specific. You might be able to control the incremental allocation, but you might not.
The size is stored, so it is very fast (constant time) to retrieve. How else could it work? C has no way of knowing in general whether a memory location is "real data" or not.

The resizing mechanism is usually fixed. (Most compilers double the size of the vector when it reaches the limit.) The C++ standard specifies no way to control this behaviour.
The size is internally updated whenever you insert/remove elements and when you call size(), it's returned immediately. So yes, it's O(1).

Unrelated to your actual questions, but here's a more "STL" way of doing what you're doing:
vector<string> split(const string& s)
{
istringstream stream(s);
istream_iterator<string> iter(stream), eos;
vector<string> tokens;
copy(iter, eos, back_inserter(tokens));
return tokens;
}

When the number of elements reaches this size, the vector is resized. Is this size compiler specific? Can I control it? Is this a good idea?
In general, this is a library-specific behavior, but you may be able to influence this behavior by specifying a custom allocator, which is non-trivial work.
Do repeated calls to vector::size() recount the number of elements (O(n) calculation) or is this value stored somewhere (O(1) lookup).
Most implementations store the size as a member. It's a single memory read.

How to get the next prefix in C++?

Given a sequence (for example a string "Xa"), I want to get the next prefix in order lexicographic (i.e "Xb"). The next of "aZ" should be "b"
A motivating use case where this function is useful is described here.
As I don't want to reinvent the wheel, I'm wondering if there is any function in C++ STL or boost that can help to define this generic function easily?
If not, do you think that this function can be useful?
Notes
Even if the examples are strings, the function should work for any Sequence.
The lexicographic order should be a template parameter of the function.
From the answers I conclude that there is nothing on C++/Boost that can help to define this generic function easily and also that this function is too specific to be proposed for free. I will implement a generic next_prefix and after that I will request if you find it useful.
I have accepted the single answer that gives some hints on how to do that even if the proposed implementation is not generic.

I'm not sure I understand the semantics by which you wish the string to transform, but maybe something like the following can be a starting point for you. The code will increment the sequence, as if it was a sequence of digits representing a number.
template<typename Bi, typename I>
bool increment(Bi first, Bi last, I minval, I maxval)
{
if( last == first ) return false;
while( --last != first && *last == maxval ) *last = minval;
if( last == first && *last == maxval ) {
*last = minval;
return false;
}
++*last;
return true;
}
Maybe you wish to add an overload with a function object, or an overload or specialization for primitives. A couple of examples:
string s1("aaz");
increment(s1.begin(), s1.end(), 'a', 'z');
cout << s1 << endl; // aba
string s2("95");
do {
cout << s2 << ' '; // 95 96 97 98 99
} while( increment(s2.begin(), s2.end(), '0', '9') );
cout << endl;

That seem so specific that I can't see how it would get in STL or boost.

When you say the order is a template parameter, what are you envisaging will be passed? A comparator that takes two characters and returns bool?
If so, then that's a bit of a nightmare, because the only way to find "the least char greater than my current char" is to sort all the chars, find your current char in the result, and step forward one (or actually, if some chars might compare equal, use upper_bound with your current char to find the first greater char).
In practice, for any sane string collation you can define a "get the next char, or warn me if I gave you the last char" function more efficiently, and build your "get the next prefix" function on top of that. Hopefully, permitting an arbitrary order is more flexibility than you need.

Orderings are typically specified as a comparator, not as a sequence generator.
Lexicographical orderings in particular tend be only partial, for example, in case or diacritic insensitivity. Therefore your final product will be nondeterministic, or at best arbitrary. ("Always choose lowest numerical encoding"?)
In any case, if you accept a comparator as input, the only way to translate that to an increment operation would be to compare the current value against every other in the character space. Which could work, 127 values being so few (a comparator-sorted table would make short work of the problem), or could be impossibly slow, if you use any other kind of character.

The best way is likely to define the character ordering somehow, then define the rules from going from one character to two characters to three characters.
Use whatever sort function you wish to use over the complete list of characters that you want to include, then just use that as the ordering. Find the index of the current character, and you can easily find the previous and next characters. Only advance the right-most character, unless it's going to roll over, then advance the next character to the left.
In other words, reinventing the wheel is like 10 lines of Python. Probably less than 500 lines of C++. :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ code performance strings compare - c++

Comparing a number is faster than comparing a string. Try comparing the strings length before comparing the string itself.

Related

Why is vector faster than unordered_map?

efficient is_in(vector<string>& S, string P) function

String Comparison return value (Is is used in applications that sorts characters ?)

Two short questions about std::vector

How to get the next prefix in C++?

Categories

Resources