Which bitset implementation should I use for maximum performance? - c++

I'm currently trying to implement various algorithms in a Just In Time (JIT) compiler. Many of the algorithms operate on bitmaps, more commonly known as bitsets.
In C++ there are various ways of implementing a bitset. As a true C++ developer, I would prefer to use something from the STL. The most important aspect is performance. I don't necessarily need a dynamically resizable bitset.
As I see it, there are three possible options.
I. One option would be to use std::vector<bool>, which has been optimized for space. This would also indicate that the data doesn't have to be contiguous in memory. I guess this could decrease performance. On the other hand, having one bit for each bool value could improve speed since it's very cache friendly.
II. Another option would be to instead use a std::vector<char>. It guarantees that the data is contiguous in memory and it's easier to access individual elements. However, it feels strange to use this option since it's not intended to be a bitset.
III. The third option would be to use the actual std::bitset. That fact that it's not dynamically resizable doesn't matter.
Which one should I choose for maximum performance?

Best way is to just benchmark it, because every situation is different.
I wouldn't use std::vector<bool>. I tried it once and the performance was horrible. I could improve the performance of my application by simply using std::vector<char> instead.
I didn't really compare std::bitset with std::vector<char>, but if space is not a problem in your case, I would go for std::vector<char>. It uses 8 times more space than a bitset, but since it doesn't have to do bit-operations to get or set the data, it should be faster.
Of course if you need to store lots of data in the bitset/vector, then it could be beneficial to use bitset, because that would fit in the cache of the processor.
The easiest way is to use a typedef, and to hide the implementation. Both bitset and vector support the [] operator, so it should be easy to switch one implementation by the other.

I answered a similar question recently in this forum. I recommend my BITSCAN library. I have just released version 1.0. BITSCAN is specifically designed for fast bit scanning operations.
A BitBoard class wraps a number of different implementations for typical operations such as bsf, bsr or popcount for 64-bit words (aka bitboards). Classes BitBoardN, BBIntrin and BBSentinel extend bit scanning to bit strings. A bit string in BITSCAN is an array of bitboards. The base wrapper class for a bit string is BitBoardN. BBIntrin extends BitBoardN by using Windows compiler intrinsics over 64 bitboards. BBIntrin is made portable to POSIX by using the appropriate asm equivalent functions.
I have used BITSCAN to implement a number of efficient solvers for NP combinatorial problems in the graph domain. Typically the adjacency matrix of the graph as well as vertex sets are encoded as bit strings and typical computations are performed using bit masks. Code for simple bitencoded graph objects is available in GRAPH. Examples of how to use BITSCAN and GRAPH are also available.
A comparison between BITSCAN and typical implementations in STL (bitset) and BOOST (dynamic_bitset) can be found here:
http://blog.biicode.com/bitscan-efficiency-at-glance/

You might also be interested in this (somewhat dated) paper:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
[Update] The previous link seems to be broken, but I think it was pointing to this article:
https://www.researchgate.net/publication/220803585_Performance_of_C_bit-vector_implementations

Related

What is are the advantages of a custom data structure?

What's the need to go for defining and implementing data structures (e.g. stack) ourselves if they are already available in C++ STL?
What are the differences between the two implementations?
First, implementing by your own an existing data structure is a useful exercise. You understand better what it does (so you can understand better what the standard containers do). In particular, you understand better why time complexity is so important.
Then, there is a quality of implementation issue. The standard implementation might not be suitable for you.
Let me give an example. Indeed, std::stack is implementing a stack. It is a general-purpose implementation. Have you measured sizeof(std::stack<char>)? Have you benchmarked it, in the case of a million of stacks of 3.2 elements on average with a Poisson distribution?
Perhaps in your case, you happen to know that you have millions of stacks of char-s (never NUL), and that 99% of them have less than 4 elements. With that additional knowledge, you probably should be able to implement something "better" than what the standard C++ stack provides. So std::stack<char> would work, but given that extra knowledge you'll be able to implement it differently. You still (for readability and maintenance) would use the same methods as in std::stack<char> - so your WeirdSmallStackOfChar would have a push method, etc. If (later during the project) you realize or that bigger stack might be useful (e.g. in 1% of cases) you'll reimplement your stack differently (e.g. if your code base grow to a million lines of C++ and you realize that you have quite often bigger stacks, you might "remove" your WeirdSmallStackOfChar class and add typedef std::stack<char> WeirdSmallStackOfChar; ....)
If you happen to know that all your stacks have less than 4 char-s and that \0 is not valid in them, representing such "stack"-s as a char w[4] field is probably the wisest approach. It is fast and easy to code.
So, if performance and memory space matters, you might perhaps code something as weird as
class MyWeirdStackOfChars {
bool small;
union {
std::stack<char>* bigstack;
char smallstack[4];
}
Of course, that is very incomplete. When small is true your implementation uses smallstack. For the 1% case where it is false, your implemention uses bigstack. The rest of MyWeirdStackOfChars is left as an exercise (not that easy) to the reader. Don't forget to follow the rule of five.
Ok, maybe the above example is not convincing. But what about std::map<int,double>? You might have millions of them, and you might know that 99.5% of them are smaller than 5. You obviously could optimize for that case. It is highly probable that representing small maps by an array of pairs of int & double is more efficient both in terms of memory and in terms of CPU time.
Sometimes, you even know that all your maps have less than 16 entries (and std::map<int,double> don't know that) and that the key is never 0. Then you might represent them differently. In that case, I guess that I am able to implement something much more efficient than what std::map<int,double> provides (probably, because of cache effects, an array of 16 entries with an int and a double is the fastest).
That is why any developer should know the classical algorithms (and have read some Introduction to Algorithms), even if in many cases he would use existing containers. Be also aware of the as-if rule.
STL implementation of Data Structures is not perfect for every possible use case.
I like the example of hash tables. I have been using STL implementation for a while, but I use it mainly for Competitive Programming contests.
Imagine that you are Google and you have billions of dollars in resources destined to storing and accessing hash tables. You would probably like to have the best possible implementation for the company use cases, since it will save resources and make search faster in general.
Oh, and I forgot to mention that you also have some of the best engineers on the planet working for you (:
(This video is made by Kulukundis talking about the new hash table made by his team at Google )
https://www.youtube.com/watch?v=ncHmEUmJZf4
Some other reasons that justify implementing your own version of Data Structures:
Test your understanding of a specific structure.
Customize part of the structure to some peculiar use case.
Seek better performance than STL for a specific data structure.
Hating STL errors.
Benchmarking STL against some simple implementation.

How to improve the C++ STL bitset efficiency?

I have a program that heavily uses STL's bitset. And gperftools shows that one of the performance bottle neck is std::_Base_bitset::_S_maskbit (inline).
From here https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a00775_source.html#l00078 it seem a mask for accessing or modifying a bitset is always recomputed. This makes me wonder if a look-up table would help.
I tried to implement my own version of bitset, where a mask look-up table is used. However since my version does not use gcc built-in instructions like __builtin_memcpy, it is actually much slower than the STL bitset.
So I wonder if there is a way to replace std::_Base_bitset::_S_maskbit, or should I write my own version of bitset by copying the code of STL bitset and adding a look-up table.
Thanks!
If your bitsets are sufficiently small, using a std::vector<char> can be an improvement. Sure, you use 8 times the memory, but you wouldn't need to calculate masks anymore and profiling showed that's relevant for you.
Accessing arrays is pretty fast on x86 due to its good support for addressing modes and prefetchers, but bitsets are more the domain of ARM where many operations can include a free bitshift.
From the link it seems the recomputing of the mask bit (_S_maskbit) is just a left shift followed by a modular operation, which (if _GLIBCXX_BITSET_BITS_PER_WORD is a power of 2) could have been optimized by the compiler as a logic-AND. So the complexity of recomputing the bit is really low, probably lower than accessing a lookup table.
Given that the function is inline and relatively short, gperf may not be accurate in profiling it. Or perhaps the __pos % _GLIBCXX_BITSET_BITS_PER_WORD could not be optimized as __pos & (_GLIBCXX_BITSET_BITS_PER_WORD - 1), in which case operator% would probably be the most expensive operation here.

choose appropriate data structure

I'm implementing C++ code communicating with hardware which runs a number of hardware-assisted data structures (direct access tables, and search trees). So I need to maintain a local cache which would store data before pushing it down on the hardware.
I think to replicate H/W tree structure I could choose std::map, but what about direct table (basically it is implemented as a sequential array of results and allows direct-access lookups)?
Are there close enough analogues in STL to implement such structures or simple array would suffice?
Thanks.
If you are working with hardware structures, you are probably best off mimicking the structures as exactly as possible using C structs and C arrays.
This will give you the ability to map the hardware structure as exactly as possible and to move the data around with a simple memcpy.
The STL will probably not be terribly useful since it does lots of stuff behind the scenes and you have no control of the memory layout. This will mean that each write to hardware will involve a complex serialization exercise that you will probably want to avoid.
I believe you're looking for std::vector. Or, if the size is known at compile time, std::array (since C++11).
C++11 has an unordered-map, and unordered-set, which are analogous to a hash table. Maps are faster for iteration, while sets are faster for look up.
But first you should run a profiler to see if your data-structures are what slows your program down

bitscan (bsf) on std::bitset ? Or similar

I'm doing a project that involves solving some NP-hard graph problems. Specifically triangulation of Bayesian networks...
Anyway, I'm using std::bitset for creating an adjacency matrix and this is very nice... But I would like to scan the bitset using a bsf instruction. E.g. not using a while loop.
Anybody know if this is possible?
Looking at the STL that comes with gcc 4.0.0, the bitset methods _Find_first and _Find_next already do what you want. Specifically, they use __builtin_ctzl() (described here), which should use the appropriate instruction. (I would guess that the same applies for older versions of gcc.)
And the nice thing is that bitset already does the right thing: single instruction if it's a bitset that fits within a single unsigned long; a loop over the longs if it uses several. In case of a loop, it's a loop whose length is known at compile time, with a handful of instructions, so it might be fully unrolled by the optimizer. I.e. it would probably be hard to beat bitset by rolling your own.
Use std::bitset::to_ulong and then BSF it(you'll need a custom contian for something bigger that 32bits though, or some way to get references to each 32/64 set of bits.
for BSF on MSVC you can use _BitScanForward else you can use something from here
You might also want to look into using BitMagic instead
Possibly you might consider looking at BITSCAN C++ library for bit array manipulation and Ugraph for managing bit-encoded undirected graphs. Since it is my code I will not add additional comments on this.
Hope it helps! (If it does I would be interested in any feedback).

Super high performance C/C++ hash map (table, dictionary) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need to map primitive keys (int, maybe long) to struct values in a high-performance hash map data structure.
My program will have a few hundred of these maps, and each map will generally have at most a few thousand entries. However, the maps will be "refreshing" or "churning" constantly; imagine processing millions of add and delete messages a second.
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
I would recommend you to try Google SparseHash (or the C11 version Google SparseHash-c11) and see if it suits your needs. They have a memory efficient implementation as well as one optimized for speed.
I did a benchmark a long time ago, it was the best hashtable implementation available in terms of speed (however with drawbacks).
What libraries in C or C++ have a data structure that fits this use case? Or, how would you recommend building your own? Thanks!
Check out the LGPL'd Judy arrays. Never used myself, but was advertised to me on few occasions.
You can also try to benchmark STL containers (std::hash_map, etc). Depending on platform/implementation and source code tuning (preallocate as much as you can dynamic memory management is expensive) they could be performant enough.
Also, if performance of the final solution trumps the cost of the solution, you can try to order the system with sufficient RAM to put everything into plain arrays. Performance of access by index is unbeatable.
The add/delete operations are much (100x) more frequent than the get operation.
That hints that you might want to concentrate on improving algorithms first. If data are only written, not read, then why write them at all?
Just use boost::unordered_map (or tr1 etc) by default. Then profile your code and see if that code is the bottleneck. Only then would I suggest to precisely analyze your requirements to find a faster substitute.
If you have a multithreaded program, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map, but it's main functions are thread safe.
Also have a look at facebook's folly library, it has high performance concurrent hash table and skip list.
khash is very efficient. There is author's detailed benchmark: https://attractivechaos.wordpress.com/2008/10/07/another-look-at-my-old-benchmark/ and it also shows khash beats many other hash libraries.
from android sources (thus Apache 2 licensed)
https://github.com/CyanogenMod/android_system_core/tree/ics/libcutils
look at hashmap.c, pick include/cutils/hashmap.h, if you don't need thread safety you can remove mutex code, a sample implementation is in libcutils/str_parms.c
First check if existing solutions like libmemcache fits your need.
If not ...
Hash maps seems to be the definite answer to your requirement. It provides o(1) lookup based on the keys. Most STL libraries provide some sort of hash these days. So use the one provided by your platform.
Once that part is done, you have to test the solution to see if the default hashing algorithm is good enough performance wise for your needs.
If it is not, you should explore some good fast hashing algorithms found on the net
good old prime number multiply algo
http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/
http://code.google.com/p/google-sparsehash/
If this is not good enough, you could roll a hashing module by yourself, that fixes the problem that you saw with the STL containers you have tested, and one of the hashing algorithms above. Be sure to post the results somewhere.
Oh and its interesting that you have multiple maps ... perhaps you can simplify by having your key as a 64 bit num with the high bits used to distinguish which map it belongs to and add all key value pairs to one giant hash. I have seen hashes that have hundred thousand or so symbols working perfectly well on the basic prime number hashing algorithm quite well.
You can check how that solution performs compared to hundreds of maps .. i think that could be better from a memory profiling point of view ... please do post the results somewhere if you do get to do this exercise
I believe that more than the hashing algorithm it could be the constant add/delete of memory (can it be avoided?) and the cpu cache usage profile that might be more crucial for the performance of your application
good luck
Try hash tables from Miscellaneous Container Templates. Its closed_hash_map is about the same speed as Google's dense_hash_map, but is easier to use (no restriction on contained values) and has some other perks as well.
I would suggest uthash. Just include #include "uthash.h" then add a UT_hash_handle to the structure and choose one or more fields in your structure to act as the key. A word about performance here.
http://incise.org/hash-table-benchmarks.html gcc has a very very good implementation. However, mind that it must respect a very bad standard decision :
If a rehash happens, all iterators are invalidated, but references and
pointers to individual elements remain valid. If no actual rehash
happens, no changes.
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
This means basically the standard says that the implementation MUST BE based on linked lists.
It prevents open addressing which has better performance.
I think google sparse is using open addressing, though in these benchmarks only the dense version outperforms the competition.
However, the sparse version outperforms all competition in memory usage. (also it doesn't have any plateau, pure straight line wrt number of elements)