I'm doing a project that involves solving some NP-hard graph problems. Specifically triangulation of Bayesian networks...
Anyway, I'm using std::bitset for creating an adjacency matrix and this is very nice... But I would like to scan the bitset using a bsf instruction. E.g. not using a while loop.
Anybody know if this is possible?
Looking at the STL that comes with gcc 4.0.0, the bitset methods _Find_first and _Find_next already do what you want. Specifically, they use __builtin_ctzl() (described here), which should use the appropriate instruction. (I would guess that the same applies for older versions of gcc.)
And the nice thing is that bitset already does the right thing: single instruction if it's a bitset that fits within a single unsigned long; a loop over the longs if it uses several. In case of a loop, it's a loop whose length is known at compile time, with a handful of instructions, so it might be fully unrolled by the optimizer. I.e. it would probably be hard to beat bitset by rolling your own.
Use std::bitset::to_ulong and then BSF it(you'll need a custom contian for something bigger that 32bits though, or some way to get references to each 32/64 set of bits.
for BSF on MSVC you can use _BitScanForward else you can use something from here
You might also want to look into using BitMagic instead
Possibly you might consider looking at BITSCAN C++ library for bit array manipulation and Ugraph for managing bit-encoded undirected graphs. Since it is my code I will not add additional comments on this.
Hope it helps! (If it does I would be interested in any feedback).
Related
I'm looking for the fastest way to do simple operations using Eigen. There are so many datastructures available, its hard to tell which is the fastest.
I've tried to predefine my data structures, but even then my code is being outperformed by similar Fortran code. I've guessed Eigen::Vector3d is the fastest for my needs, (since its predefined), but I could easily be wrong. Using -O3 optimization during compile time gave me a big boost, but I'm still running 4x slower than a Fortran implementation of the same code.
I make use of an 'Atom' structure, which is then stored in an 'atoms' vector defined by the following:
struct Atom {
std::string element;
//double x, y, z;
Eigen::Vector3d coordinate;
};
std::vector<Atom> atoms;
The slowest part of my code is the following:
distance = atoms[i].coordinate - atoms[j].coordinate;
distance_norm = distance.norm();
Is there a faster data structure I could use? Or is there a faster way to perform these basic operations?
As you pointed out in your comment, adding the -fno-math-errno compiler flag gives you a huge increase in speed. As to why that happens, your code snipped shows that you're doing a sqrt via distance_norm = distance.norm();.
This makes the compiler not set ERRNO after each sqrt (that's a saved write to a thread local variable), which is faster and enables vectorization of any loop that is doing this repeatedly.The only disadvantage to this is that the IEEE adherence is lost. See gcc man.
Another thing you might want to try is adding -march=native and adding -mfma if -march=native doesn't turn it on for you (I seem to remember that in some cases it wasn't turned on by native and had to be turned on by hand - check here for details). And as always with Eigen, you can disable bounds checking with -DNDEBUG.
SoA instead of AoS!!! If performance is actually a real problem, consider using a single 4xN matrix to store the positions (and have Atom keep the column index instead of the Eigen::Vector3d). It shouldn't matter too much in the small code snippet you showed, but depending on the rest of your code, may give you another huge increase in performance.
Given you are ~4x off, it might be worth checking that you have enabled vectorization such as AVX or AVX2 at compile time. There are of course also SSE2 (~2x) and AVX512 (~8x) when dealing with doubles.
Either try another compiler like Intel C++ compiler (free for academic and non-profit usage) or use other libraries like Intel MKL (far faster that your own code) or even other BLAS/LAPACK implementations for dense matrices or PARDISO or SuperLU (not sure if still exists) for sparse matrices.
So basically, what I need is to have 28,800 values which can be accessed by an index and can all be set to true or false. Using an array of bools or integers isn't an option, because the size needs to be set during rumtime using a parameter. Using a vector is way too slow and memory intensive. I'm new to C++ and therefore don't have any idea on how to solve this, can anyone help?
EDIT: Thank you to all who commented! Like I said, Im new to C++ programming and your answers really helped me to understand the functionality behind vectors.
So, after everyone said that vector isn't slow I checked again and it turns out that my program was running so slow because of another bug I had while filling the vector. But especially midor's and Some programmer dude's answers helped me to make the program run a bit faster than before, so thanks!
Using a vector is way too slow and memory intensive.
C++ specializes std::vector<bool> so it uses only as much memory as it needs. One bit per "flag" (+ bookkeeping overhead of course).
You could only optimize that implementation if you knew its size a priori (which you don't according to your question), or if you know that the bitmap will only contain very few set bits (e.g. 1bit in a 50'000, but you'd need to measure if a more complex implementation would be worth it). For sparse bitmaps an std::unordered_set<std::uint32_t> that stores the set bits could be an option.
But 28'800 is a very small number, so don't waste your time on optimizations. You won't get any benefits from it.
I have a program that heavily uses STL's bitset. And gperftools shows that one of the performance bottle neck is std::_Base_bitset::_S_maskbit (inline).
From here https://gcc.gnu.org/onlinedocs/gcc-4.6.2/libstdc++/api/a00775_source.html#l00078 it seem a mask for accessing or modifying a bitset is always recomputed. This makes me wonder if a look-up table would help.
I tried to implement my own version of bitset, where a mask look-up table is used. However since my version does not use gcc built-in instructions like __builtin_memcpy, it is actually much slower than the STL bitset.
So I wonder if there is a way to replace std::_Base_bitset::_S_maskbit, or should I write my own version of bitset by copying the code of STL bitset and adding a look-up table.
Thanks!
If your bitsets are sufficiently small, using a std::vector<char> can be an improvement. Sure, you use 8 times the memory, but you wouldn't need to calculate masks anymore and profiling showed that's relevant for you.
Accessing arrays is pretty fast on x86 due to its good support for addressing modes and prefetchers, but bitsets are more the domain of ARM where many operations can include a free bitshift.
From the link it seems the recomputing of the mask bit (_S_maskbit) is just a left shift followed by a modular operation, which (if _GLIBCXX_BITSET_BITS_PER_WORD is a power of 2) could have been optimized by the compiler as a logic-AND. So the complexity of recomputing the bit is really low, probably lower than accessing a lookup table.
Given that the function is inline and relatively short, gperf may not be accurate in profiling it. Or perhaps the __pos % _GLIBCXX_BITSET_BITS_PER_WORD could not be optimized as __pos & (_GLIBCXX_BITSET_BITS_PER_WORD - 1), in which case operator% would probably be the most expensive operation here.
I'm currently trying to implement various algorithms in a Just In Time (JIT) compiler. Many of the algorithms operate on bitmaps, more commonly known as bitsets.
In C++ there are various ways of implementing a bitset. As a true C++ developer, I would prefer to use something from the STL. The most important aspect is performance. I don't necessarily need a dynamically resizable bitset.
As I see it, there are three possible options.
I. One option would be to use std::vector<bool>, which has been optimized for space. This would also indicate that the data doesn't have to be contiguous in memory. I guess this could decrease performance. On the other hand, having one bit for each bool value could improve speed since it's very cache friendly.
II. Another option would be to instead use a std::vector<char>. It guarantees that the data is contiguous in memory and it's easier to access individual elements. However, it feels strange to use this option since it's not intended to be a bitset.
III. The third option would be to use the actual std::bitset. That fact that it's not dynamically resizable doesn't matter.
Which one should I choose for maximum performance?
Best way is to just benchmark it, because every situation is different.
I wouldn't use std::vector<bool>. I tried it once and the performance was horrible. I could improve the performance of my application by simply using std::vector<char> instead.
I didn't really compare std::bitset with std::vector<char>, but if space is not a problem in your case, I would go for std::vector<char>. It uses 8 times more space than a bitset, but since it doesn't have to do bit-operations to get or set the data, it should be faster.
Of course if you need to store lots of data in the bitset/vector, then it could be beneficial to use bitset, because that would fit in the cache of the processor.
The easiest way is to use a typedef, and to hide the implementation. Both bitset and vector support the [] operator, so it should be easy to switch one implementation by the other.
I answered a similar question recently in this forum. I recommend my BITSCAN library. I have just released version 1.0. BITSCAN is specifically designed for fast bit scanning operations.
A BitBoard class wraps a number of different implementations for typical operations such as bsf, bsr or popcount for 64-bit words (aka bitboards). Classes BitBoardN, BBIntrin and BBSentinel extend bit scanning to bit strings. A bit string in BITSCAN is an array of bitboards. The base wrapper class for a bit string is BitBoardN. BBIntrin extends BitBoardN by using Windows compiler intrinsics over 64 bitboards. BBIntrin is made portable to POSIX by using the appropriate asm equivalent functions.
I have used BITSCAN to implement a number of efficient solvers for NP combinatorial problems in the graph domain. Typically the adjacency matrix of the graph as well as vertex sets are encoded as bit strings and typical computations are performed using bit masks. Code for simple bitencoded graph objects is available in GRAPH. Examples of how to use BITSCAN and GRAPH are also available.
A comparison between BITSCAN and typical implementations in STL (bitset) and BOOST (dynamic_bitset) can be found here:
http://blog.biicode.com/bitscan-efficiency-at-glance/
You might also be interested in this (somewhat dated) paper:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
[Update] The previous link seems to be broken, but I think it was pointing to this article:
https://www.researchgate.net/publication/220803585_Performance_of_C_bit-vector_implementations
I have to do calculation on array of 1,2,3...9 dimensional vectors, and the number of those vectors varies significantly (say from 100 to up to couple of millions). Of course, it would be great if the data container can be easily decomposed to enable parallel algorithms.
I came across blitz++(almost impossible to compile for me), but are there any other fast libs that manipulate array of vector data? Is boost::fusion worth a look? Furthermore, vtk's vtkDoubleArray seems nice, but vtk is lib used only for visualization. I must admit that having array of tuples is a tempting idea, but I didn't see any benchmarks regarding boost::fusion and/or vtkDoubleArray. Just as they are not built for speed in mind. Any thoughts?
best regards,
mightydodol
Eigen, supports auto-vectorisation of vector on certains compilers (GCC 4, VC++ 2008).
For linear algebra, you probably want to evaluate Boost uBLAS, which is a subset of the full BLAS package. As you mention, Boost Fusion may also be appropriate, depending on the algorithms you are implementing.
I believe you can use the non-GUI parts of VTK such as vtkDoubleArray without linking in the visualisation libraries if you don't need them. Note that VTK is designed for efficiency of rendering, not of calculations. If you don't want to render the results, you might as well use one of the scientific packages that provide optimized algorithms.
There is a Parallel flavour of BLAS called (strangely enough) PBLAS. I don't think this is available through the Boost wrapping, so you would use the C interface directly.
Without knowing what yo want to do with your arrays, it's hard to give really firm advice. If high performance manipulation of the arrays is needed then Blitz++ is probably your best bet. If you are having trouble compiling it then perhaps you need to change your compiler or system. They do support g++ so a recent version on just about anything should get you going.
I haven't used Boost::fusion but a quick read of the manual suggests that it's major goal is just to make heterogeneous containers. I don't think that's what you want.
I have tried to use the GSL but find it hopelessly awkward for anything I have wanted to do.
I'm no expert, but you might want to consider using a MATLAB API.
There is the GNU Scientific Library for operation in vector or matrix
I would try using Blitz++, it will give you a really good performance. Armadillo is also quite efficient.