If-else-if versus map - c++

Suppose I have such an if/else-if chain:
if( x.GetId() == 1 )
{
}
else if( x.GetId() == 2 )
{
}
// ... 50 more else if statements
What I wonder is, if I keep a map, will it be any better in terms of performance? (assuming keys are integers)

Maps (usually) are implemented using red-black trees which gives O(log N) lookups as the tree is constantly kept in balance. Your linear list of if statements will be O(N) worst case. So, yes a map would be significantly faster for lookup.
Many people are recommending using a switch statement, which may not be faster for you, depending on your actual if statements. A compiler can sometimes optimize switch by using a jump table which would be O(1), but this is only possible for values that an undefined criteria; hence this behavior can be somewhat nondeterministic. Though there is a great article with a few tips on optimizing switch statements here Optimizing C and C++ Code.
You technically could even formulate a balanced tree manually, this works best for static data and I happened to just recently create a function to quickly find which bit was set in a byte (This was used in an embedded application on an I/O pin interrupt and had to be quick when 99% of the time only 1 bit would be set in the byte):
unsigned char single_bit_index(unsigned char bit) {
// Hard-coded balanced tree lookup
if(bit > 0x08)
if(bit > 0x20)
if(bit == 0x40)
return 6;
else
return 7;
else
if(bit == 0x10)
return 4;
else
return 5;
else
if(bit > 0x02)
if(bit == 0x04)
return 2;
else
return 3;
else
if(bit == 0x01)
return 0;
else
return 1;
}
This gives a constant lookup in 3 steps for any of the 8 values which gives me very deterministic performance, a linear search -- given random data -- would average 4 step lookups, with a best-case of 1 and worst-case of 8 steps.
This is a good example of a range that a compiler would probably not optimize to a jump table since the 8 values I am searching for are so far apart: 1, 2, 4, 8, 16, 32, 64, and 128. It would have to create a very sparse 128 position table with only 8 elements containing a target, which on a PC with a ton of RAM might not be a big deal, but on a microcontroller it'd be killer.

why dont you use a a switch ?
swich(x.GetId())
{
case 1: /* do work */ break; // From the most used case
case 2: /* do work */ break;
case ...: // To the less used case
}
EDIT:
Put the most frequently used case in the top of the switch (This can have some performance issue if x.GetId is generally equal to 50)

switch is the best thing I think

The better solution would be a switch statement. This will allow you to check the value of x.GetId() just once, rather than (on average) 25 times as your code is doing now.
If you want to get fancy, you can use a data structure containing pointers to functions that handle whatever it is that's in the braces. If your ID values are consecutive (i.e. numbers between 1 and 50) then an array of function pointers would be best. If they are spread out, then a map would be more appropriate.

The answer, as with most performance related questions, is maybe.
If the IDs are in a fortunate range, a switch might become a jump-table, providing constant time lookups to all IDs. You won't get much better than this, short of redesigning. Alternatively, if the IDs are consecutive but you don't get a jump-table out of the compiler, you can force the issue by filling an array with function pointers.
[from here on out, switch refers to a generic if/else chain]
A map provides worst-case logarithmic lookup for any given ID, while a switch can only guarantee linear. However, if the IDs are not random, sorting the switch cases by usage might ensure the worst-case scenario is sufficiently rare that this doesn't matter.
A map will incur some initial overhead when loading the IDs and associating them with the functions, and then incur a the overhead of calling a function pointer every time you access an ID. A switch incurs additional overhead when writing the routine, and possibly significant overhead when debugging it.
Redesigning might allow you to avoid the question all together. No matter how you implement it, this smells like trouble. I can't help but think there's a better way to handle this.

If I really had a potential switch of fifty possibilities, I'd definitely think about a vector of pointers to functions.
#include <cstdio>
#include <cstdlib>
#include <ctime>
const unsigned int Max = 4;
void f1();
void f2();
void f3();
void f4();
void (*vp[Max])();
int main()
{
vp[ 0 ] = f1;
vp[ 1 ] = f2;
vp[ 2 ] = f3;
vp[ 3 ] = f4;
srand( std::time( NULL ) );
vp[( rand() % Max )]();
}
void f1()
{
std::printf( "Hello from f1!\n" );
}
void f2()
{
std::printf( "Hello from f2!\n" );
}
void f3()
{
std::printf( "Hello from f3!\n" );
}
void f4()
{
std::printf( "Hello from f4!\n" );
}

There are a lot of suggestions involving switch-case. In terms of efficiency, this might be better, might be the same. Won't be worse.
But if you're just setting/returning a value or name based on the ID, then YES. A map is exactly what you need. STL containers are optimised, and if you think you can optimise better, then you are either incredibly smart or staggeringly dumb.
e.g A single call using a std::map called mymap,
thisvar = mymap[x.getID()];
is much better than 50 of these
if(x.getID() == ...){thisvar = ...;}
because it's more efficient as the number of IDs increases. If you're interested in why, search for a good primer on data structures.
But what I'd really look at here is maintenance/fixing time. If you need to change the name of the variable, or change from using getID() or getName(), or make any kind of minor change, you've got to do it FIFTY TIMES in your example. And you need a new line every time you add an ID.
The map reduces that to one code change NO MATTER HOW MANY IDs YOU HAVE.
That said, if you're actually carrying out different actions for each ID, a switch-case might be better. With switch-case rather than if statements, you can improve performance and readability. See here: Advantage of switch over if-else statement
I'd avoid pointers to functions unless you're very clear on how they'd improve your code, because if you're not 100% certain what you're doing, the syntax can be messed up, and it's overkill for anything you'd feasibly use a map for.
Basically, I'd be interested in the problem you're trying to solve. You might be better off with a map or a switch-case, but if you think you can use a map, that is ABSOLUTELY what you should be using instead.

Related

CodeSignal: Execution time limit exceeded with c++

I am trying to solve the programming problem firstDuplicate on codesignal. The problem is "Given an array a that contains only numbers in the range 1 to a.length, find the first duplicate number for which the second occurrence has minimal index".
Example: For a = [2, 1, 3, 5, 3, 2] the output should be firstDuplicate(a) = 3
There are 2 duplicates: numbers 2 and 3. The second occurrence of 3 has a smaller index than the second occurrence of 2 does, so the answer is 3.
With this code I pass 21/23 tests, but then it tells me that the program exceeded the execution time limit on test 22. How would I go about making it faster so that it passes the remaining two tests?
#include <algorithm>
int firstDuplicate(vector<int> a) {
vector<int> seen;
for (size_t i = 0; i < a.size(); ++i){
if (std::find(seen.begin(), seen.end(), a[i]) != seen.end()){
return a[i];
}else{
seen.push_back(a[i]);
}
}
if (seen == a){
return -1;
}
}
Anytime you get asked a question about "find the duplicate", "find the missing element", or "find the thing that should be there", your first instinct should be use a hash table. In C++, there are the unordered_map and unordered_set classes that are for such types of coding exercises. The unordered_set is effectively a map of keys to bools.
Also, pass you vector by reference, not value. Passing by value incurs the overhead of copying the entire vector.
Also, that comparison seems costly and unnecessary at the end.
This is probably closer to what you want:
#include <unordered_set>
int firstDuplicate(const vector<int>& a) {
std::unordered_set<int> seen;
for (int i : a) {
auto result_pair = seen.insert(i);
bool duplicate = (result_pair.second == false);
if (duplicate) {
return (i);
}
}
return -1;
}
std::find is linear time complexity in terms of distance between first and last element (or until the number is found) in the container, thus having a worst-case complexity of O(N), so your algorithm would be O(N^2).
Instead of storing your numbers in a vector and searching for it every time, Yyu should do something like hashing with std::map to store the numbers encountered and return a number if while iterating, it is already present in the map.
std::map<int, int> hash;
for(const auto &i: a) {
if(hash[i])
return i;
else
hash[i] = 1;
}
Edit: std::unordered_map is even more efficient if the order of keys doesn't matter, since insertion time complexity is constant in average case as compared to logarithmic insertion complexity for std::map.
It's probably an unnecessary optimization, but I think I'd try to take slightly better advantage of the specification. A hash table is intended primarily for cases where you have a fairly sparse conversion from possible keys to actual keys--that is, only a small percentage of possible keys are ever used. For example, if your keys are strings of length up to 20 characters, the theoretical maximum number of keys is 25620. With that many possible keys, it's clear no practical program is going to store any more than a minuscule percentage, so a hash table makes sense.
In this case, however, we're told that the input is: "an array a that contains only numbers in the range 1 to a.length". So, even if half the numbers are duplicates, we're using 50% of the possible keys.
Under the circumstances, instead of a hash table, even though it's often maligned, I'd use an std::vector<bool>, and expect to get considerably better performance in the vast majority of cases.
int firstDuplicate(std::vector<int> const &input) {
std::vector<bool> seen(input.size()+1);
for (auto i : input) {
if (seen[i])
return i;
seen[i] = true;
}
return -1;
}
The advantage here is fairly simple: at least in a typical case, std::vector<bool> uses a specialization to store bools in only one bit apiece. This way we're storing only one bit for each number of input, which increases storage density, so we can expect excellent use of the cache. In particular, as long as the number of bytes in the cache is at least a little more than 1/8th the number of elements in the input array, we can expect all of seen to be in the cache most of the time.
Now make no mistake: if you look around, you'll find quite a few articles pointing out that vector<bool> has problems--and for some cases, that's entirely true. There are places and times that vector<bool> should be avoided. But none of its limitations applies to the way we're using it here--and it really does give an advantage in storage density that can be quite useful, especially for cases like this one.
We could also write some custom code to implement a bitmap that would give still faster code than vector<bool>. But using vector<bool> is easy, and writing our own replacement that's more efficient is quite a bit of extra work...

Why is vector faster than unordered_map?

I am solving a problem on LeetCode, but nobody has yet been able to explain my issue.
The problem is as such:
Given an arbitrary ransom note string and another string containing letters from all the magazines, write a function that will return true if the ransom note can be constructed from the magazines ; otherwise, it will return false.
Each letter in the magazine string can only be used once in your ransom note.
Note:
You may assume that both strings contain only lowercase letters.
canConstruct("a", "b") -> false
canConstruct("aa", "ab") -> false
canConstruct("aa", "aab") -> true
My code (which takes 32ms):
class Solution {
public:
bool canConstruct(string ransomNote, string magazine) {
if(ransomNote.size() > magazine.size()) return false;
unordered_map<char, int> m;
for(int i = 0; i < magazine.size(); i++)
m[magazine[i]]++;
for(int i = 0; i < ransomNote.size(); i++)
{
if(m[ransomNote[i]] <= 0) return false;
m[ransomNote[i]]--;
}
return true;
}
};
The code (which I dont know why is faster - takes 19ms):
bool canConstruct(string ransomNote, string magazine) {
int lettersLeft = ransomNote.size(); // Remaining # of letters to be found in magazine
int arr[26] = {0};
for (int j = 0; j < ransomNote.size(); j++) {
arr[ransomNote[j] - 'a']++; // letter - 'a' gives a value of 0 - 25 for each lower case letter a-z
}
int i = 0;
while (i < magazine.size() && lettersLeft > 0) {
if (arr[magazine[i] - 'a'] > 0) {
arr[magazine[i] - 'a']--;
lettersLeft--;
}
i++;
}
if (lettersLeft == 0) {
return true;
} else {
return false;
}
}
Both of these have the same complexity and use the same structure to solve the problem, but I don't understand why one takes almost twice as much time than the other. The time to query a vector is O(1), but its the same for an unordered_map. Same story with adding an entry/key to either of them.
Please, could someone explain why the run time varies so much?
First thing to note is, although the average time to query an unordered_map is constant, the worst case is not O(1). As you can see here it actually rises to the order of O(N), N denoting the size of the container.
Secondly, as vector allocates sequential portions of memory, accessing to that memory is highly efficient and actually is constant, even in the worst-case. (i.e. simple pointer arithmetic, as opposed to computing the result of a more complex hash function) There is also the possibility of various levels of caching of sequential memory that may be involved (i.e. depending on the platform your code is running on) which may make the execution of a code using vector even faster, compared to one that is using unordered_map.
In essence, in terms of complexity, the worst-case performance of a vector is more efficient than that of unordered_map. On top of that, most hardware systems offer features such as caching which give usage of vector an even bigger edge. (i.e. lesser constant factors in O(1) operations)
Your second approach uses plain C array where accessing an element is a simple pointer dereference. But that is not the case with unordered_map. There are two points to note:
First, accessing an element is not a simple pointer dereference. It has to do other works to maintain it's internal structure. An unordered_map is actually a hash table under the hood and C++ standard indirectly mandates it to be implemented using open addressing which is a far more complex algorithm than simple array access.
Second, O(1) access is on average but not on worst case.
For these reasons no wonder that array version will work better than unordered_map even though they have same run time complexity. This is another example where two codes with same run time complexity performs differently.
You will see the benefit of unordered_map only when you have a large number of keys (oppose to fixed 26 here).
"O(1)" means "constant time" -- that is, an algorithm that is (truly) O(1) will not get slower when there is more data (in this case, when there are more items in the map or array). It does not indicate how fast the algorithm runs -- it only indicates that it won't slow down if there is more data. Seeing different times for one O(1) algorithm vs. another does not mean that they are not O(1). You should not expect that one O(1) algorithm will run exactly as fast as another. But, if there is a difference, you should see the same difference if the maps/arrays have more data in them.

What is the performance of std::bitset?

I recently asked a question on Programmers regarding reasons to use manual bit manipulation of primitive types over std::bitset.
From that discussion I have concluded that the main reason is its comparatively poorer performance, although I'm not aware of any measured basis for this opinion. So next question is:
what is the performance hit, if any, likely to be incurred by using std::bitset over bit-manipulation of a primitive?
The question is intentionally broad, because after looking online I haven't been able to find anything, so I'll take what I can get. Basically I'm after a resource that provides some profiling of std::bitset vs 'pre-bitset' alternatives to the same problems on some common machine architecture using GCC, Clang and/or VC++. There is a very comprehensive paper which attempts to answer this question for bit vectors:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
Unfortunately, it either predates or considered out of scope std::bitset, so it focuses on vectors/dynamic array implementations instead.
I really just want to know whether std::bitset is better than the alternatives for the use cases it is intended to solve. I already know that it is easier and clearer than bit-fiddling on an integer, but is it as fast?
Update
It's been ages since I posted this one, but:
I already know that it is easier and clearer than bit-fiddling on an
integer, but is it as fast?
If you are using bitset in a way that does actually make it clearer and cleaner than bit-fiddling, like checking for one bit at a time instead of using a bit mask, then inevitably you lose all those benefits that bitwise operations provide, like being able to check to see if 64 bits are set at one time against a mask, or using FFS instructions to quickly determine which bit is set among 64-bits.
I'm not sure that bitset incurs a penalty to use in all ways possible (ex: using its bitwise operator&), but if you use it like a fixed-size boolean array which is pretty much the way I always see people using it, then you generally lose all those benefits described above. We unfortunately can't get that level of expressiveness of just accessing one bit at a time with operator[] and have the optimizer figure out all the bitwise manipulations and FFS and FFZ and so forth going on for us, at least not since the last time I checked (otherwise bitset would be one of my favorite structures).
Now if you are going to use bitset<N> bits interchangeably with like, say, uint64_t bits[N/64] as in accessing both the same way using bitwise operations, it might be on par (haven't checked since this ancient post). But then you lose many of the benefits of using bitset in the first place.
for_each method
In the past I got into some misunderstandings, I think, when I proposed a for_each method to iterate through things like vector<bool>, deque, and bitset. The point of such a method is to utilize the internal knowledge of the container to iterate through elements more efficiently while invoking a functor, just as some associative containers offer a find method of their own instead of using std::find to do a better than linear-time search.
For example, you can iterate through all set bits of a vector<bool> or bitset if you had internal knowledge of these containers by checking for 64 elements at a time using a 64-bit mask when 64 contiguous indices are occupied, and likewise use FFS instructions when that's not the case.
But an iterator design having to do this type of scalar logic in operator++ would inevitably have to do something considerably more expensive, just by the nature in which iterators are designed in these peculiar cases. bitset lacks iterators outright and that often makes people wanting to use it to avoid dealing with bitwise logic to use operator[] to check each bit individually in a sequential loop that just wants to find out which bits are set. That too is not nearly as efficient as what a for_each method implementation could do.
Double/Nested Iterators
Another alternative to the for_each container-specific method proposed above would be to use double/nested iterators: that is, an outer iterator which points to a sub-range of a different type of iterator. Client code example:
for (auto outer_it = bitset.nbegin(); outer_it != bitset.nend(); ++outer_it)
{
for (auto inner_it = outer_it->first; inner_it != outer_it->last; ++inner_it)
// do something with *inner_it (bit index)
}
While not conforming to the flat type of iterator design available now in standard containers, this can allow some very interesting optimizations. As an example, imagine a case like this:
bitset<64> bits = 0x1fbf; // 0b1111110111111;
In that case, the outer iterator can, with just a few bitwise iterations ((FFZ/or/complement), deduce that the first range of bits to process would be bits [0, 6), at which point we can iterate through that sub-range very cheaply through the inner/nested iterator (it would just increment an integer, making ++inner_it equivalent to just ++int). Then when we increment the outer iterator, it can then very quickly, and again with a few bitwise instructions, determine that the next range would be [7, 13). After we iterate through that sub-range, we're done. Take this as another example:
bitset<16> bits = 0xffff;
In such a case, the first and last sub-range would be [0, 16), and the bitset could determine that with a single bitwise instruction at which point we can iterate through all set bits and then we're done.
This type of nested iterator design would map particularly well to vector<bool>, deque, and bitset as well as other data structures people might create like unrolled lists.
I say that in a way that goes beyond just armchair speculation, since I have a set of data structures which resemble the likes of deque which are actually on par with sequential iteration of vector (still noticeably slower for random-access, especially if we're just storing a bunch of primitives and doing trivial processing). However, to achieve the comparable times to vector for sequential iteration, I had to use these types of techniques (for_each method and double/nested iterators) to reduce the amount of processing and branching going on in each iteration. I could not rival the times otherwise using just the flat iterator design and/or operator[]. And I'm certainly not smarter than the standard library implementers but came up with a deque-like container which can be sequentially iterated much faster, and that strongly suggests to me that it's an issue with the standard interface design of iterators in this case which come with some overhead in these peculiar cases that the optimizer cannot optimize away.
Old Answer
I'm one of those who would give you a similar performance answer, but I'll try to give you something a bit more in-depth than "just because". It is something I came across through actual profiling and timing, not merely distrust and paranoia.
One of the biggest problems with bitset and vector<bool> is that their interface design is "too convenient" if you want to use them like an array of booleans. Optimizers are great at obliterating all that structure you establish to provide safety, reduce maintenance cost, make changes less intrusive, etc. They do an especially fine job with selecting instructions and allocating the minimal number of registers to make such code run as fast as the not-so-safe, not-so-easy-to-maintain/change alternatives.
The part that makes the bitset interface "too convenient" at the cost of efficiency is the random-access operator[] as well as the iterator design for vector<bool>. When you access one of these at index n, the code has to first figure out which byte the nth bit belongs to, and then the sub-index to the bit within that. That first phase typically involves a division/rshifts against an lvalue along with modulo/bitwise and which is more costly than the actual bit operation you're trying to perform.
The iterator design for vector<bool> faces a similar awkward dilemma where it either has to branch into different code every 8+ times you iterate through it or pay that kind of indexing cost described above. If the former is done, it makes the logic asymmetrical across iterations, and iterator designs tend to take a performance hit in those rare cases. To exemplify, if vector had a for_each method of its own, you could iterate through, say, a range of 64 elements at once by just masking the bits against a 64-bit mask for vector<bool> if all the bits are set without checking each bit individually. It could even use FFS to figure out the range all at once. An iterator design would tend to inevitably have to do it in a scalar fashion or store more state which has to be redundantly checked every iteration.
For random access, optimizers can't seem to optimize away this indexing overhead to figure out which byte and relative bit to access (perhaps a bit too runtime-dependent) when it's not needed, and you tend to see significant performance gains with that more manual code processing bits sequentially with advanced knowledge of which byte/word/dword/qword it's working on. It's somewhat of an unfair comparison, but the difficulty with std::bitset is that there's no way to make a fair comparison in such cases where the code knows what byte it wants to access in advance, and more often than not, you tend to have this info in advance. It's an apples to orange comparison in the random-access case, but you often only need oranges.
Perhaps that wouldn't be the case if the interface design involved a bitset where operator[] returned a proxy, requiring a two-index access pattern to use. For example, in such a case, you would access bit 8 by writing bitset[0][6] = true; bitset[0][7] = true; with a template parameter to indicate the size of the proxy (64-bits, e.g.). A good optimizer may be able to take such a design and make it rival the manual, old school kind of way of doing the bit manipulation by hand by translating that into: bitset |= 0x60;
Another design that might help is if bitsets provided a for_each_bit kind of method, passing a bit proxy to the functor you provide. That might actually be able to rival the manual method.
std::deque has a similar interface problem. Its performance shouldn't be that much slower than std::vector for sequential access. Yet unfortunately we access it sequentially using operator[] which is designed for random access or through an iterator, and the internal rep of deques simply don't map very efficiently to an iterator-based design. If deque provided a for_each kind of method of its own, then there it could potentially start to get a lot closer to std::vector's sequential access performance. These are some of the rare cases where that Sequence interface design comes with some efficiency overhead that optimizers often can't obliterate. Often good optimizers can make convenience come free of runtime cost in a production build, but unfortunately not in all cases.
Sorry!
Also sorry, in retrospect I wandered a bit with this post talking about vector<bool> and deque in addition to bitset. It's because we had a codebase where the use of these three, and particularly iterating through them or using them with random-access, were often hotspots.
Apples to Oranges
As emphasized in the old answer, comparing straightforward usage of bitset to primitive types with low-level bitwise logic is comparing apples to oranges. It's not like bitset is implemented very inefficiently for what it does. If you genuinely need to access a bunch of bits with a random access pattern which, for some reason or other, needs to check and set just one bit a time, then it might be ideally implemented for such a purpose. But my point is that almost all use cases I've encountered didn't require that, and when it's not required, the old school way involving bitwise operations tends to be significantly more efficient.
Did a short test profiling std::bitset vs bool arrays for sequential and random access - you can too:
#include <iostream>
#include <bitset>
#include <cstdlib> // rand
#include <ctime> // timer
inline unsigned long get_time_in_ms()
{
return (unsigned long)((double(clock()) / CLOCKS_PER_SEC) * 1000);
}
void one_sec_delay()
{
unsigned long end_time = get_time_in_ms() + 1000;
while(get_time_in_ms() < end_time)
{
}
}
int main(int argc, char **argv)
{
srand(get_time_in_ms());
using namespace std;
bitset<5000000> bits;
bool *bools = new bool[5000000];
unsigned long current_time, difference1, difference2;
double total;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bools[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bools[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bool:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bits[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bits[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bitset:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
delete [] bools;
cin.get();
return 0;
}
Please note: the outputting of the sum total is necessary so the compiler doesn't optimise out the for loop - which some do if the result of the loop isn't used.
Under GCC x64 with the following flags: -O2;-Wall;-march=native;-fomit-frame-pointer;-std=c++11;
I get the following results:
Bool array:
random access time = 4695, sequential access time = 390
Bitset:
random access time = 5382, sequential access time = 749
Not a great answer here, but rather a related anecdote:
A few years ago I was working on real-time software and we ran into scheduling problems. There was a module which was way over time-budget, and this was very surprising because the module was only responsible for some mapping and packing/unpacking of bits into/from 32-bit words.
It turned out that the module was using std::bitset. We replaced this with manual operations and the execution time decreased from 3 milliseconds to 25 microseconds. That was a significant performance issue and a significant improvement.
The point is, the performance issues caused by this class can be very real.
In addition to what the other answers said about the performance of access, there may also be a significant space overhead: Typical bitset<> implementations simply use the longest integer type to back their bits. Thus, the following code
#include <bitset>
#include <stdio.h>
struct Bitfield {
unsigned char a:1, b:1, c:1, d:1, e:1, f:1, g:1, h:1;
};
struct Bitset {
std::bitset<8> bits;
};
int main() {
printf("sizeof(Bitfield) = %zd\n", sizeof(Bitfield));
printf("sizeof(Bitset) = %zd\n", sizeof(Bitset));
printf("sizeof(std::bitset<1>) = %zd\n", sizeof(std::bitset<1>));
}
produces the following output on my machine:
sizeof(Bitfield) = 1
sizeof(Bitset) = 8
sizeof(std::bitset<1>) = 8
As you see, my compiler allocates a whopping 64 bits to store a single one, with the bitfield approach, I only need to round up to eight bits.
This factor eight in space usage can become important if you have a lot of small bitsets.
Rhetorical question: Why std::bitset is written in that inefficacy way?
Answer: It is not.
Another rhetorical question: What is difference between:
std::bitset<128> a = src;
a[i] = true;
a = a << 64;
and
std::bitset<129> a = src;
a[i] = true;
a = a << 63;
Answer: 50 times difference in performance http://quick-bench.com/iRokweQ6JqF2Il-T-9JSmR0bdyw
You need be very careful what you ask for, bitset support lot of things but each have it own cost. With correct handling you will have exactly same behavior as raw code:
void f(std::bitset<64>& b, int i)
{
b |= 1L << i;
b = b << 15;
}
void f(unsigned long& b, int i)
{
b |= 1L << i;
b = b << 15;
}
Both generate same assembly: https://godbolt.org/g/PUUUyd (64 bit GCC)
Another thing is that bitset is more portable but this have cost too:
void h(std::bitset<64>& b, unsigned i)
{
b = b << i;
}
void h(unsigned long& b, unsigned i)
{
b = b << i;
}
If i > 64 then bit set will be zero and in case of unsigned we have UB.
void h(std::bitset<64>& b, unsigned i)
{
if (i < 64) b = b << i;
}
void h(unsigned long& b, unsigned i)
{
if (i < 64) b = b << i;
}
With check preventing UB both generate same code.
Another place is set and [], first one is safe and mean you will never get UB but this will cost you a branch. [] have UB if you use wrong value but is fast as using var |= 1L<< i;. Of corse if std::bitset do not need have more bits than biggest int available on system because other wise you need split value to get correct element in internal table. This mean for std::bitset<N> size N is very important for performance. If is bigger or smaller than optimal one you will pay cost of it.
Overall I find that best way is use something like that:
constexpr size_t minBitSet = sizeof(std::bitset<1>)*8;
template<size_t N>
using fasterBitSet = std::bitset<minBitSet * ((N + minBitSet - 1) / minBitSet)>;
This will remove cost of trimming exceeding bits: http://quick-bench.com/Di1tE0vyhFNQERvucAHLaOgucAY

Filling unordered_set is too slow

We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?
What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.
Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this caseā€”it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.
If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.

c switch and jump tables

It is my understanding that a switch statement in c/c++ will sometimes compile to a jump table.
My question is, are there any thumb rules to assure that?
In my case I'm doing something like this:
enum myenum{
MY_CASE0= 0,
MY_CASE0= 1,
.
.
.
};
switch(foo)
{
case MY_CASE0:
//do stuff
break;
case MY_CASE1:
//do stuff
break;
.
.
.
}
I cover all the cases from 1 to n by order. Is safe to assume it will compile to a jump table?
The original code was a long and messy if else statement, so at the very least I gain some readability.
A good compiler can and will choose between a jump table, a chained if/else or a combination. A poorly designed compiler may not make such a choice - and may even produce very bad code for switch-blocks. But any decent compiler should produce efficient code for switch-blocks. T
he major decision factor here is that the compiler may choose if/else when the numbers are far apart [and not trivially (e.g. dividing by 2, 4, 8, 16, 256 etc) changed to a closer value], e.g.
switch(x)
{
case 1:
...
case 4912:
...
case 11211:
...
case 19102:
...
}
would require a jump table of at least 19102 * 2 bytes.
On the other hand, if the numbers are close together, the compiler will typically use a jumptable.
Even if it's a if/else type of design, it will typically do a "binary search" - if we take the above example:
if (x <= 4912)
{
if (x == 1)
{
....
}
else if (x == 4912)
{
....
}
} else {
if (x == 11211)
{
....
}
else if (x == 19102)
{
...
}
}
If we have LOTS of cases, this approach will nest quite deep, and humans will probably get lost after three or four levels of depth (bearing in mind that each if starts at some point in the MIDDLE of the range), but it reduces the number of tests by a log2(n) where n is the number of choices. It is certainly a lot more efficient than the naive approach of
if (x == first value) ...
else if (x == second value) ...
else if (x == third value) ...
..
else if (x == nth value) ...
else ...
This can be slightly better if certain values are put at the beginning of the if-else chain, but that assumes you can determine what is the most common before running the code.
If performance is CRITICAL to your case, then you need to benchmark the two alternatives. But my guess is that just writing the code as a switch will make the code much clearer, and at the same time run at least as fast, if not faster.
Compilers can certainly convert any C/C++ switch into a jump table, but a compiler would do this for efficiency. Ask yourself, what would I do if I were writing a compiler and I had just build a parse tree for a switch/case statement? I have studied compiler design and construction, and here are some of the decisions,
How to help a compiler decide to implement a jump table:
case values are small integers (0,1,2,3,...)
case values are in a compact range (few holes, remember default is an option)
there are enough cases to make the optimization worthwhile (> N, examine your compiler source to find the constant)
clever compilers may subtract/add a constant to a jumptable index if the range is compact (example: 1000, 1001, 1002, 1003, 1004, 1005, etc)
avoid fallthrough and transfer of control (goto, continue)
only one break at end of each case
Though the mechanics may differ between compilers, the compiler is essentially creating unnamed functions (well, maybe not a function, because the compiler may use jump into the code block and jump outof the code block, or may be clever and use jsr and return)
The certain way to get a jump table is to write it. It is an array of pointers to functions, indexed by the value you want.
How?
Define a typedef for your function pointer, Understanding typedefs for function pointers in C,
typedef void (*FunkPtr)(double a1, double a2);
FunkPtr JumpTable[] = {
function_name_0,
function_name_1,
function_name_2,
...
function_name_n
};
Of course, you have already defined function_name_{0..n}, so the compiler can find the address of the function to evoke.
I will leave evocation of the function pointer and boundary checking as an exercise for the reader.