should I use a bit set or a vector? C++ - c++

I need to be able to store an array of binary numbers in c++ which will be passed through different methods and eventually outputted to file and terminal,
What are the key differences between vectors and bit sets and which would be easiest and/or more efficient to use?
(I do not know how many bits I need to store)

std::bitset size should be known at compile time, so your choice is obvious - use std::vector<bool>.
Since it's implemented not the same as std::vector<char> (since a single element takes a bit, not a full char), it should be a good solution in terms of memory use.

It all depends on what you want to do binary.
You could also use boost.dynamic_bitset which is like std::bitset but not with fixed bits.
The main drawback is dependency on boost if you don't already use it.
You could also store your input in std::vector<char> and use a bitset per char to convert binary notation.
As others already told: std::bitset uses a fixed number of bits.
std::vector<bool> is not always advised because it has its quirks, as it is not a real container (gotw).

If you don't know how many bits you need to store at compilation time, you cannot use bitset, as it's size is fixed. Because of this, you should you vector<bool>, as it can be resized dynamically. If you want an array of bits saved like this, you can then use vector< vector<bool> >.

If you don't have any specific upper bound for the size of the numbers you need to store, then you need to store your data in two different dimensions:
The first dimension will be the number, whose size varies.
The second dimension will be your array of numbers.
For the latter, using std:vector is fine if you require your values to be contiguous in memory. For the numbers themselves you don't need any data structure at all: just allocate memory using new and an unsigned primitive type such as unsigned char, uint8_t or any other if you have alignment constraints.
If, on the other hand, you know that your numbers won't be larger than let's say 64 bits, then use a data type that you know will hold this amount of data such as uint64_t.
PS: remember that what you store are numbers. The computer will store them in binary whether you use them as such or with any other representation.

I think that in your case you should use std::vector with the value type of std::bitset. Using such an approach you can consider your "binary numbers" either like strings or like objects of some integer type and at the same time it is easy to do binary operations like setting or resetting bits.

Related

C++ bitmap to hold binary flags

the problem is this: I need to create a bitmap (a series of binary flags) to hold true/false information about a bunch of objects; the number of object is not known a priori so I have to allocate enough flags at runtime, possibly during the bitmap creation.
Given max_num_elements elements, my first idea was to allocate an array of ((num_elements/8)+1)*sizeof(char) bits: since a char is 8bit long it can handle 8 binary flags, so I get the minimun number of chars to hold num_elements flags, with a max memory waste of 7bits.
The real problem is for checking/setting the flags: I tought to do some bitshifting on the whole array followed by a bitwise and to get flag n like
flag_n = (flag_array>>n)&0d1
but if I understood correctly, the shifting operation won't afffect the whole array, just the first element.
How can I achieve this?
std::vector<bool> is specialised to achieve exactly this.
It is actually a problem in many cases, since access to an element returns a proxy object rather than a bool&, so it doesn't quite behave like all the other containers, but it seems like in your case it suits what you need.

Why use array of char instead of int for bitsets in C++

For a project I'm working on I need to create my own implementation for a bitset. I've taken a look at the STL library to see how they handle this and looked at a few other things online. It seems like it's pretty standard to use a char array. Is there a reason why everyone uses char arrays instead of the integer type?
Simply because a char in C++ is a single byte, (or at least, it's guaranteed by the C++ standard to be less than or equal in size to int or short) whereas the size of an int is usually larger than a byte. (It's usually 32-bit, or 4-bytes, on most machines these days.) Since a single byte is the smallest addressable unit of data a computer can process, it's natural to use arrays of chars when working with individual bits. If you used int, for example, then you would waste significant space for any number of bits that is not a multiple of sizeof(int), but with a byte array you waste the least amount of space possible.
Char is (usually) the smallest unit of bits that a microprocessor can manipulate. If you're creating an object that works with arbitrary numbers of bits, it makes sense to use an array of the smallest unit. That way you always use the fewest units possible.
If you need a non-arbitrary-sized bitset and the processor has a native type large enough to contain it, use an N-bit type. It will be more efficient than an array.

C++ 2.5 bytes (20-bit) integer

I know it's ridiculous, but I need it for storage optimization. Is there any good way to implement it in C++?
It has to be flexible enough so that I can use it as a normal data type e.g Vector< int20 >, operator overloading, etc..
If storage is your main concern, I suspect you need quite a few 20-bit variables. How about storing them in pairs? You could create a class representing two such variables and store them in 2.5+2.5 = 5 bytes.
To access the variables conveniently you could override the []-operator so you could write:
int fst = pair[0];
int snd = pair[1];
Since you may want to allow for manipulations such as
pair[1] += 5;
you would not want to return a copy of the backing bytes, but a reference. However, you can't return a direct reference to the backing bytes (since it would mess up it's neighboring value), so you'd actually need to return a proxy for the backing bytes (which in turn has a reference to the backing bytes) and let the proxy overload the relevant operators.
As a metter of fact, as #Tony suggest, you could generalize this to have a general container holding N such 20-bit variables.
(I've done this myself in a specialization of a vector for efficient storage of booleans (as single bits).)
No... you can't do that as a single value-semantic type... any class data must be a multiple of the 8-bit character size (inviting all the usual quips about CHAR_BITS etc).
That said, let's clutch at straws...
Unfortunately, you're obviously handling very many data items. If this is more than 64k, any proxy object into a custom container of packed values will probably need a >16 bit index/handle too, but still one of the few possibilities I can see worth further consideration. It might be suitable if you're only actively working with and needing value semantic behaviour for a small subset of the values at one point in time.
struct Proxy
{
Int20_Container& container_; // might not need if a singleton
Int20_Container::size_type index_;
...
};
So, the proxy might be 32, 64 or more bits - the potential benefit is only if you can create them on the fly from indices into the container, have them write directly back into the container, and keep them short-lived with few concurrently. (One simple way - not necessarily the fastest - to implement this model is to use an STL bitset or vector as the Int20_Container, and either store 20 times the logical index in index_, or multiply on the fly.)
It's also vaguely possible that although your values range over a 20-bit space, you've less than say 64k distinct values in actual use. If you have some such insight into your data set, you can create a lookup table where 16-bit array indices map to 20-bit values.
Use a class. As long as you respect the copy/assign/clone/etc... STL semantics, you won't have any problem.
But it will not optimize the memory space on your computer. Especially if you put in in a flat array, the 20bit will likely be aligned on a 32bit boundary, so the benefit of a 20bit type there is useless.
In that case, you will need to define your own optimized array type, that could be compatible with the STL. But don't expect it to be fast. It won't be.
Use a bitfield. (I'm really surprised nobody has suggested this.)
struct int20_and_something_else {
int less_than_a_million : 20;
int less_than_four_thousand : 12; // total 32 bits
};
This only works as a mutual optimization of elements in a structure, where you can spackle the gaps with some other data. But it works very well!
If you truly need to optimize a gigantic array of 20-bit numbers and nothing else, there is:
struct int20_x3 {
int one : 20;
int two : 20;
int three : 20; // 60 bits is almost 64
void set( int index, int value );
int get( int index );
};
You can add getter/setter functions to make it prettier if you like, but you can't take the address of a bitfield, and they can't participate in an array. (Of course, you can have an array of the struct.)
Use as:
int20_x3 *big_array = new int20_x3[ array_size / 3 + 1 ];
big_array[ index / 3 ].set( index % 3, value );
You can use C++ std::bitset. Store everything in a bitset and access your data using the correct index (x20).
Your not going to be able to get exactly 20 bits as a type(even with a bit packed struct), as it will always be aligned (at smallest grainularity) to a byte. Imo the only way to go, if you must have 20 bits, is to create a bitstream to handle the data(which you can overload to accept indexing etc)
You can use the union keyword to create a bit field. I've used it way back when bit fields were a necessity. Otherwise, you can create a class that holds 3 bytes, but through bitwise operations exposes just the most significant 20.
As far as I know that isn't possible.
The easiest option would be to define a custom type, that uses an int32_t as the backing storage, and implements appropriate maths as override operators.
For better storage density, you could store 3 int20 in a single int64_t value.
Just an idea: use optimized storage (5 bytes for two instances), and for operations, convert it into 32-bit int and then back.
While its possible to do this a number of ways.
One possibilty would be to use bit twidling to store them as the left and right parts of a 5 byte array with a class to store/retrieve which converts yoiur desired array entry to an array entry in byte5[] array and extracts the left ot right half as appropriate.
However on most hardware requires integers to be word aligned so as well as the bit twiddling to extract the integer you would need some bit shifiting to align it properly.
I think it would be more efficient to increase your swap space and let virtual memory take care of your large array (after all 20 vs 32 is not much of a saving!) always assuming you have a 64 bit OS.

How can I manage bits/binary in c++?

What I need to do is open a text file with 0s and 1s to find patterns between the columns in the file.
So my first thought was to parse each column into a big array of bools, and then do the logic between the columns (now in arrays). Until I found that the size of bools is actually a byte not a bit, so i would be wasting 1/8 of memory, assigning each value to a bool.
Is it even relevant in a grid of 800x800 values? What would be the best way to handle this?
I would appreciate a code snippet in case its a complicated answer
You could use std::bitset or Boosts dynamic_bitset which provide different methods which will help you manage your bits.
They for example support constructors which create bitsets from other default types like int or char. You can also export the bitset into an ulong or into a string (which then could be turned into a bitset again etc)
I once asked about concatenating those, which wasn't performantly possible to do. But perhaps you could use the info in that question too.
you can use std::vector<bool> which is a specialization of vector that uses a compact store for booleans....1 bit not 8 bits.
I think it was Knuth who said "premature optimization is the root of all evil." Let's find out a little bit more about the problem. Your array is 800**2 == 640,000 bytes, which is no big deal on anything more powerful than a digital watch.
While storing it as bytes may seem wasteful -- as you say, 7/8ths of the memory is redundant -- but on the other hand, most machines don't do bit operations as efficiently as bytes; by saving the memory, you might waste so much effort masking and testing that you would have been better off with the bytes model.
On the other hand, if what you want to do with it is look for larger patterns, you might want to use a bitwise representation because you can do things with 8 bits at a time.
The real point here is that there are several possibilities, but no one can tell you the "right" representation without knowing what the problem is.
For that size grid your array of bools would be about 640KB. Depends how much memory you have if that will be a problem. It would probably be the simplest for the logic analysis code.
By grouping the bits and storing in an array of int you could drop the memory requirement to 80KB, but the logic code would be more complicated as you'd be always isolating the bits you wanted to check.

When to use STL bitsets instead of separate variables?

In what situation would it be more appropriate for me to use a bitset (STL container) to manage a set of flags rather than having them declared as a number of separate (bool) variables?
Will I get a significant performance gain if I used a bitset for 50 flags rather than using 50 separate bool variables?
Well, 50 bools as a bitset will take 7 bytes, while 50 bools as bools will take 50 bytes. These days that's not really a big deal, so using bools is probably fine.
However, one place a bitset might be useful is if you need to pass those bools around a lot, especially if you need to return the set from a function. Using a bitset you have less data that has to be moved around on the stack for returns. Then again, you could just use refs instead and have even less data to pass around. :)
std::bitset will give you extra points when you need to serialize / deserialize it. You can just write it to a stream or read from a stream with it. But certainly, the separate bools are going to be faster. They are optimized for this kind of use after all, while a bitset is optimized for space, and has still function calls involved. It will never be faster than separate bools.
Bitset
Very space efficient
Less efficient due to bit fiddling
Provides serialize / de-serialize with op<< and op>>
All bits packed together: You will have the flags at one place.
Separate bools
Very fast
Bools are not packed together. They will be members somewhere.
Decide on the facts. I, personally, would use std::bitset for some not-performance critical, and would use bools if I either have only a few bools (and thus it's quite overview-able), or if I need the extra performance.
It depends what you mean by 'performance gain'. If you only need 50 of them, and you're not low on memory then separate bools is pretty much always a better choice than a bitset. They will take more memory, but the bools will be much faster. A bitset is usually implemented as an array of ints (the bools are packed into those ints). So the first 32 bools (bits) in your bitset will only take up a single 32bit int, but to read each value you have to do some bitwise operations first to mask out all the values you don't want. E.g. to read the 2nd bit of a bitset, you need to:
Find the int that contains the bit you want (in this case, it's the first int)
Bitwise And that int with '2' (i.e. value & 0x02) to find out if that bit is set
However, if memory is a bottleneck and you have a lot of bools using a bitset could make sense (e.g. if you're target platform is a mobile phone, or it's some state in a very busy web service)
NOTE: A std::vector of bool usually has a specialisation to use the equivalent of a bitset, thus making it much smaller and also slower for the same reasons. So if speed is an issue, you'll be better off using a vector of char (or even int), or even just use an old school bool array.
RE #Wilka:
Actually, bitsets are supported by C/C++ in a way that doesn't require you to do your own masking. I don't remember the exact syntax, but it's something like this:
struct MyBitset {
bool firstOption:1;
bool secondOption:1;
bool thirdOption:1;
int fourBitNumber:4;
};
You can reference any value in that struct by just using dot notation, and the right things will happen:
MyBitset bits;
bits.firstOption = true;
bits.fourBitNumber = 2;
if(bits.thirdOption) {
// Whatever!
}
You can use arbitrary bit sizes for things. The resulting struct can be up to 7 bits larger than the data you define (its size is always the minimum number of bytes needed to store the data you defined).