How to improve the C++ STL bitset efficiency? - c++

I have a program that heavily uses STL's bitset. And gperftools shows that one of the performance bottle neck is std::_Base_bitset::_S_maskbit (inline).
From here https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a00775_source.html#l00078 it seem a mask for accessing or modifying a bitset is always recomputed. This makes me wonder if a look-up table would help.
I tried to implement my own version of bitset, where a mask look-up table is used. However since my version does not use gcc built-in instructions like __builtin_memcpy, it is actually much slower than the STL bitset.
So I wonder if there is a way to replace std::_Base_bitset::_S_maskbit, or should I write my own version of bitset by copying the code of STL bitset and adding a look-up table.
Thanks!

If your bitsets are sufficiently small, using a std::vector<char> can be an improvement. Sure, you use 8 times the memory, but you wouldn't need to calculate masks anymore and profiling showed that's relevant for you.
Accessing arrays is pretty fast on x86 due to its good support for addressing modes and prefetchers, but bitsets are more the domain of ARM where many operations can include a free bitshift.

From the link it seems the recomputing of the mask bit (_S_maskbit) is just a left shift followed by a modular operation, which (if _GLIBCXX_BITSET_BITS_PER_WORD is a power of 2) could have been optimized by the compiler as a logic-AND. So the complexity of recomputing the bit is really low, probably lower than accessing a lookup table.
Given that the function is inline and relatively short, gperf may not be accurate in profiling it. Or perhaps the __pos % _GLIBCXX_BITSET_BITS_PER_WORD could not be optimized as __pos & (_GLIBCXX_BITSET_BITS_PER_WORD - 1), in which case operator% would probably be the most expensive operation here.

Related

Integer/Floating points values with SSE

I have to multiply a vector of integers with an other vector of integers, and then add the result (so a vector of integers) with a vector of floating points values.
Should I use MMX or SSE4 for integers, or can I just use SSE with all these values (even if there is integer ?) putting integers in __m128 registers ?
Indeed, I am often using integers in __m128 registers, and I don't know if I am wasting time (implicit casting values) or if it's the same thing.
I am compiling with -O3 option.
You should probably just use SSE for everything (MMX is just a very out-dated precursor to SSE). If you're going to be targetting mainly newer CPUs then you might even consider AVX/AVX2.
Start by implementing everything cleanly and robustly in scalar code, then benchmark it. It's possible that a scalar implementation will be fast enough, and you won't need to do anything else. Furthermore, gcc and other compilers (e.g. clang, ICC, even Visual Studio) are getting reasonably good at auto-vectorization, so you may get SIMD-vectorized code "for free" that meets your performance needs. However if you still need better performance at this point then you can start to convert your scalar code to SSE. Keep the original scalar implementation for validation and benchmarking purposes though - it's very easy to introduce bugs when optimising code, and it's useful to know how much faster your optimised code is than the baseline code (you're probably looking for somewhere between 2x and 4x faster for SSE versus scalar code).
While previous answer is reasonable, there is one significant difference - data organization. For direct SSE use data better be organized as Structure-of-Arrays (SoA). Typically, you scalar code might have data made around Array-of-Structures (AoS) layout. If it is the case, conversion from scalar to vectorized form would be difficult
More reading https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions

Which bitset implementation should I use for maximum performance?

I'm currently trying to implement various algorithms in a Just In Time (JIT) compiler. Many of the algorithms operate on bitmaps, more commonly known as bitsets.
In C++ there are various ways of implementing a bitset. As a true C++ developer, I would prefer to use something from the STL. The most important aspect is performance. I don't necessarily need a dynamically resizable bitset.
As I see it, there are three possible options.
I. One option would be to use std::vector<bool>, which has been optimized for space. This would also indicate that the data doesn't have to be contiguous in memory. I guess this could decrease performance. On the other hand, having one bit for each bool value could improve speed since it's very cache friendly.
II. Another option would be to instead use a std::vector<char>. It guarantees that the data is contiguous in memory and it's easier to access individual elements. However, it feels strange to use this option since it's not intended to be a bitset.
III. The third option would be to use the actual std::bitset. That fact that it's not dynamically resizable doesn't matter.
Which one should I choose for maximum performance?
Best way is to just benchmark it, because every situation is different.
I wouldn't use std::vector<bool>. I tried it once and the performance was horrible. I could improve the performance of my application by simply using std::vector<char> instead.
I didn't really compare std::bitset with std::vector<char>, but if space is not a problem in your case, I would go for std::vector<char>. It uses 8 times more space than a bitset, but since it doesn't have to do bit-operations to get or set the data, it should be faster.
Of course if you need to store lots of data in the bitset/vector, then it could be beneficial to use bitset, because that would fit in the cache of the processor.
The easiest way is to use a typedef, and to hide the implementation. Both bitset and vector support the [] operator, so it should be easy to switch one implementation by the other.
I answered a similar question recently in this forum. I recommend my BITSCAN library. I have just released version 1.0. BITSCAN is specifically designed for fast bit scanning operations.
A BitBoard class wraps a number of different implementations for typical operations such as bsf, bsr or popcount for 64-bit words (aka bitboards). Classes BitBoardN, BBIntrin and BBSentinel extend bit scanning to bit strings. A bit string in BITSCAN is an array of bitboards. The base wrapper class for a bit string is BitBoardN. BBIntrin extends BitBoardN by using Windows compiler intrinsics over 64 bitboards. BBIntrin is made portable to POSIX by using the appropriate asm equivalent functions.
I have used BITSCAN to implement a number of efficient solvers for NP combinatorial problems in the graph domain. Typically the adjacency matrix of the graph as well as vertex sets are encoded as bit strings and typical computations are performed using bit masks. Code for simple bitencoded graph objects is available in GRAPH. Examples of how to use BITSCAN and GRAPH are also available.
A comparison between BITSCAN and typical implementations in STL (bitset) and BOOST (dynamic_bitset) can be found here:
http://blog.biicode.com/bitscan-efficiency-at-glance/
You might also be interested in this (somewhat dated) paper:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
[Update] The previous link seems to be broken, but I think it was pointing to this article:
https://www.researchgate.net/publication/220803585_Performance_of_C_bit-vector_implementations

Fast implementation of MD5 in C++

First of all, to be clear, I'm aware that a huge number of MD5 implementations exist in C++. The problem here is I'm wondering if there is a comparison of which implementation is faster than the others. Since I'm using this MD5 hash function on files with size larger than 10GB, speed indeed is a major concern here.
I think the point avakar is trying to make is: with modern processing power the IO speed of your hard drive is the bottleneck not the calculation of the hash. Getting a more efficient algorithm will not help you as that is not (likely) the slowest point.
If you are doing anything special (1000's of rounds for example) then it may be different, but if you are just calculating a hash of a file. You need to speed up your IO, not your math.
I don't think it matters much (on the same hardware; but indeed GPGPU-s are different, and perhaps faster, hardware for that kind of problem). The main part of md5 is a quite complex loop of complex arithmetic operations. What does matter is the quality of compiler optimizations.
And what does also matter is how you read the file. On Linux, mmap and madvise and readahead could be relevant. Disk speed is probably the bottleneck (use an SSD if you can).
And are you sure you want md5 specifically? There are simpler and faster hash coding algorithms (md4, etc.). Still your problem is more I/O bound than CPU bound.
I'm sure there are plenty of CUDA/OpenCL adaptations of the algorithm out there which should give you a definite speedup. You could also take the basic algorithm and think a bit -> get a CUDA/OpenCL implementation going.
Block-ciphers are perfect candidates for this type of implementation.
You could also get a C implementation of it and grab a copy of the Intel C compiler and see how good that is. The vectorization extensions in Intel CPUs are amazing for speed boosts.
table available here:
http://www.golubev.com/gpuest.htm
looks like probably your bottleneck will be your harddrive IO

bitscan (bsf) on std::bitset ? Or similar

I'm doing a project that involves solving some NP-hard graph problems. Specifically triangulation of Bayesian networks...
Anyway, I'm using std::bitset for creating an adjacency matrix and this is very nice... But I would like to scan the bitset using a bsf instruction. E.g. not using a while loop.
Anybody know if this is possible?
Looking at the STL that comes with gcc 4.0.0, the bitset methods _Find_first and _Find_next already do what you want. Specifically, they use __builtin_ctzl() (described here), which should use the appropriate instruction. (I would guess that the same applies for older versions of gcc.)
And the nice thing is that bitset already does the right thing: single instruction if it's a bitset that fits within a single unsigned long; a loop over the longs if it uses several. In case of a loop, it's a loop whose length is known at compile time, with a handful of instructions, so it might be fully unrolled by the optimizer. I.e. it would probably be hard to beat bitset by rolling your own.
Use std::bitset::to_ulong and then BSF it(you'll need a custom contian for something bigger that 32bits though, or some way to get references to each 32/64 set of bits.
for BSF on MSVC you can use _BitScanForward else you can use something from here
You might also want to look into using BitMagic instead
Possibly you might consider looking at BITSCAN C++ library for bit array manipulation and Ugraph for managing bit-encoded undirected graphs. Since it is my code I will not add additional comments on this.
Hope it helps! (If it does I would be interested in any feedback).

What is more efficient a switch case or an std::map

I'm thinking about the tokenizer here.
Each token calls a different function inside the parser.
What is more efficient:
A map of std::functions/boost::functions
A switch case
I would suggest reading switch() vs. lookup table? from Joel on Software. Particularly, this response is interesting:
" Prime example of people wasting time
trying to optimize the least
significant thing."
Yes and no. In a VM, you typically
call tiny functions that each do very
little. It's the not the call/return
that hurts you as much as the preamble
and clean-up routine for each function
often being a significant percentage
of the execution time. This has been
researched to death, especially by
people who've implemented threaded
interpreters.
In virtual machines, lookup tables storing computed addresses to call are usually preferred to switches. (direct threading, or "label as values". directly calls the label address stored in the lookup table) That's because it allows, under certain conditions, to reduce branch misprediction, which is extremely expensive in long-pipelined CPUs (it forces to flush the pipeline). It, however, makes the code less portable.
This issue has been discussed extensively in the VM community, I would suggest you to look for scholar papers in this field if you want to read more about it. Ertl & Gregg wrote a great article on this topic in 2001, The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures
But as mentioned, I'm pretty sure that these details are not relevant for your code. These are small details, and you should not focus too much on it. Python interpreter is using switches, because they think it makes the code more readable. Why don't you pick the usage you're the most comfortable with? Performance impact will be rather small, you'd better focus on code readability for now ;)
Edit: If it matters, using a hash table will always be slower than a lookup table. For a lookup table, you use enum types for your "keys", and the value is retrieved using a single indirect jump. This is a single assembly operation. O(1). A hash table lookup first requires to calculate a hash, then to retrieve the value, which is way more expensive.
Using an array where the function addresses are stored, and accessed using values of an enum is good. But using a hash table to do the same adds an important overhead
To sum up, we have:
cost(Hash_table) >> cost(direct_lookup_table)
cost(direct_lookup_table) ~= cost(switch) if your compiler translates switches into lookup tables.
cost(switch) >> cost(direct_lookup_table) (O(N) vs O(1)) if your compiler does not translate switches and use conditionals, but I can't think of any compiler doing this.
But inlined direct threading makes the code less readable.
STL Map that comes with visual studio 2008 will give you O(log(n)) for each function call since it hides a tree structure beneath.
With modern compiler (depending on implementation) , A switch statement will give you O(1) , the compiler translates it to some kind of lookup table.
So in general , switch is faster.
However , consider the following facts:
The difference between map and switch is that : Map can be built dynamically while switch can't. Map can contain any arbitrary type as a key while switch is very limited to c++ Primitive types (char , int , enum , etc...).
By the way , you can use a hash map to achieve nearly O(1) dispatching (though , depending on the hash table implementation , it can sometimes be O(n) at worst case). Even though , switch will still be faster.
Edit
I am writing the following only for fun and for the matter of the discussion
I can suggest an nice optimization for you but it depends on the nature of your language and whether you can expect how your language will be used.
When you write the code:
You divide your tokens into two groups , one group will be of very High frequently used and the other of low frequently used. You also sort the high frequently used tokens.
For the high frequently tokens you write an if-else series with the highest frequently used coming first. for the low frequently used , you write a switch statement.
The idea is to use the CPU branch prediction in order to even avoid another level of indirection (assuming the condition checking in the if statement is nearly costless).
in most cases the CPU will pick the correct branch without any level of indirection . They will be few cases however that the branch will go to the wrong place.
Depending on the nature of your languege , Statisticly it may give a better performance.
Edit : Due to some comments below , Changed The sentence telling that compilers will allways translate a switch to LUT.
What is your definition of "efficient"? If you mean faster, then you probably should profile some test code for a definite answer. If you're after flexible and easy-to-extend code though, then do yourself a favor and use the map approach. Everything else is just premature optimization...
Like yossi1981 said, a switch could be optimized of beeing a fast lookup-table but there is not guarantee, every compiler has other algorithms to determine whether to implement the switch as consecutive if's or as fast lookup table, or maybe a combination of both.
To gain a fast switch your values should meet the following rule:
they should be consecutive, that is e.g. 0,1,2,3,4. You can leave some values out but things like 0,1,2,34,43 are extremely unlikely to be optimized.
The question really is: is the performance of such significance in your application?
And wouldn't a map which loads its values dynamically from a file be more readable and maintainable instead of a huge statement which spans multiple pages of code?
You don't say what type your tokens are. If they are not integers, you don't have a choice - switches only work with integer types.
The C++ standard says nothing about the performance of its requirements, only that the functionality should be there.
These sort of questions about which is better or faster or more efficient are meaningless unless you state which implementation you're talking about. For example, the string handling in a certain version of a certain implementation of JavaScript was atrocious, but you can't extrapolate that to being a feature of the relevant standard.
I would even go so far as to say it doesn't matter regardless of the implementation since the functionality provided by switch and std::map is different (although there's overlap).
These sort of micro-optimizations are almost never necessary, in my opinion.