I'm writing a program that is supposed to manipulate very long strings of boolean values. I was originally storing them as a dynamic array of unsigned long long int variables and running C-style bitwise operations on them.
However, I don't want the overhead that comes with having to iterate over an array even if the processor is doing it at the machine code level - i.e. it is my belief that the compiler is probably more efficient than I am.
So, I'm wondering if there's a way to store them as a bitfield. The only problem with that is that I heard you needed to declare a constant at runtime for that to work and I don't particularly care to do that as I don't know how many bits I need when the program starts. Is there a way to do this?
As per the comments, std::bitset or std::vector<bool> are probably what you need. bitset is fixed-length, vector<bool> is dynamic.
vector<bool> is a specialization of vector that only uses one bit per value, rather than sizeof(bool), like you might expect... While good for memory use, this exception is actually disliked by the standards body these days, because (among other things) vector<bool> doesn't fulfil the same contract that vector<T> does - it returns proxy objects instead of references, which wreaks havoc in generic code.
Related
In a couple of projects of mine I have had an increasing need to deal with contiguous sequences of bits in memory - efficiently (*). So far I've written a bunch of inline-able standalone functions, templated on the choice of a "bit container" type (e.g. uint32_t), for getting and setting bits, applying 'or' and 'and' to their values, locating the container, converting lengths in bits to sizes in bytes or lengths in containers, etc. ... it looks like it's class-writing time.
I know the C++ standard library has a specialization of std::vector<bool>, which is considered by many to be a design flaw - as its iterators do not expose actual bools, but rather proxy objects. Whether that's a good idea or a bad one for a specialization, it's definitely something I'm considering - an explicit bit proxy class, which will hopefully "always" be optimized away (with a nice greasing-up with constexpr, noexcept and inline). So, I was thinking of possibly adapting std::vector code from one of the standard library implementation.
On the other hand, my intended class:
Will never own the data / the bits - it'll receive a starting bit container address (assuming alignment) and a length in bits, and won't allocate or free.
It will not be able resize the data dynamically or otherwise - not even while retaining the same amount of space like std::vector::resize(); its length will be fixed during its lifespan/scope.
It shouldn't anything know about the heap (and work when there is no heap)
In this sense, it's more like a span class for bits. So maybe start out with a span then? I don't know, spans are still not standard; and there are no proxies in spans...
So what would be a good basis (edit: NOT a base class) for my implementation? std::vector<bool>? std::span? Both? None? Or - maybe I'm reinventing the wheel and this is already a solved problem?
Notes:
The bit sequence length is known at run time, not compile time; otherwise, as #SomeProgrammerDude suggests I could use std::bitset.
My class doesn't need to "be-a" span or "be-a" vector, so I'm not thinking of specializing any of them.
(*) - So far not SIMD-efficiently but that may come later. Also, this may be used in CUDA code where we don't SIMDize but pretend the lanes are proper threads.
Rather than std::vector or std::span I suspect an implementation of your class would share more in common with std::bitset, since it is pretty much the same thing, except with a (fixed) runtime-determined size.
In fact, you could probably take a typical std::bitset implementation and move the <size_t N> template parameter into the class as a size_t size_ member (or whatever name you like) and you'll have your dynamic bitset class with almost no changes. You may want to get rid anything of you consider cruft like the constructors that take std::string and friends.
The last step is then to remove ownership of the underlying data: basically you'll remove the creation of the underlying array in the constructor and maintain a view of an existing array with some pointers.
If your clients disagree on what the underlying unsigned integer type to use for storage (what you call the "bit container"), then you may also need to make your class a template on this type, although it would be simpler if everyone agreed on say uint64_t.
As far as std::vector<bool> goes, you don't need much from that: everything that vector does that you want, std::bitset probably does too: the main thing that vector adds is dynamic growth - but you've said you don't want that. vector<bool> has the proxy object concept to represent a single bit, but so does std::bitset.
From std::span you take the idea of non-ownership of the underlying data, but I don't think this actually represents a lot of underlying code. You might want to consider the std::span approach of having either a compile-time known size or a runtime provided size (indicated by Extent == std::dynamic_extent) if that would be useful for you (mostly if you sometimes use compile-time sizes and could specialize some methods to be more efficient in that case).
As a beginner, I'm really confused about size_t. I can use int, float or other types. Why still declare size_t type. I don't feel its advantages.
I've viewed some pages, but I still can't understand it.
Its main advantage is that it's the right tool for the job.
size_t is literally defined to be big enough to represent the size of any object on your platform. The others are not. So, when you want to store the size of an object, why would you use anything else?
You can use int if you like, but you'll be deliberately choosing the inferior option that leads to bugs. I don't quite understand why you'd want to do so, but hey it's your code.
If you choose to use float, though, please tell us what program you're writing so we can avoid it. :)
Using a float would be horrible since that would be a misuse of floating point types, plus type promotion would mean that multiplying the size of anything would take place in floating point!
Using a int would also be horrible since the specifics of an int are intentionally loosely defined by the C++ standard. (It could be as small as 16 bits).
But a size_t type is guaranteed to adequately represent the size of pretty much anything and certainly the sizes of containers in the C++ standard library. Its specific details are dependent on a particular platform and architecture. The fact that it's an unsigned type is the subject of much debate. (I personally believe it was a mistake to make it unsigned as it can mess up code using relational operators and introduce pernicious bugs that are difficult to spot).
I would advise you to use size_t whenever you want to store sizes of classes or structures or when you deal with raw memory(e.g. storing size of raw memory or using as an index of a raw array). However for indexing/iterating over standard containers (such as std::vector), I recommend using underlying size type of a given container(e.g. vector::size_type).
The vector<bool> class in the C++ STL is optimized for memory to allocate one bit per bool stored, rather than one byte. Every time I output sizeof(x) for vector<bool> x, the result is 40 bytes creating the vector structure. sizeof(x.at(0)) always returns 16 bytes, which must be the allocated memory for many bool values, not just the one at position zero. How many elements do the 16 bytes cover? 128 exactly? What if my vector has more or less elements?
I would like to measure the size of the vector and all of its contents. How would I do that accurately? Is there a C++ library available for viewing allocated memory per variable?
I don't think there's any standard way to do this. The only information a vector<bool> implementation gives you about how it works is the reference member type, but there's no reason to assume that this has any congruence with how the data are actually stored internally; it's just that you get a reference back when you dereference an iterator into the container.
So you've got the size of the container itself, and that's fine, but to get the amount of memory taken up by the data, you're going to have to inspect your implementation's standard library source code and derive a solution from that. Though, honestly, this seems like a strange thing to want in the first place.
Actually, using vector<bool> is kind of a strange thing to want in the first place. All of the above is essentially why its use is frowned upon nowadays: it's almost entirely incompatible with conventions set by other standard containers… or even those set by other vector specialisations.
I am currently converting an application to 64 bit.
I have some occurrences of the following pattern:
class SomeOtherClass;
class SomeClass
{
std::vector<SomeOtherClass*> mListOfThings;
void SomeMemberFunction (void)
{
// Needs to know the size of the member list variable
unsigned int listSize = mListOfThings.size();
// use listSize in further computations
//...
}
}
Obviously in a practical case I will not have more then MAX_INT items in my list. But I wondered if there is consensus about the 'best' way to represent this type.
Each collection defines its own return type for size(), so my first approximation would be:
std::vector<SomeOtherClass*>::size_type listSize = mListOfThings.size()
I would assume this to be correct, but (personally) I dont find this 'easy reading', so -1 for clarity.
For a c++011 aware compiler I could write
auto listSize = mListOfThings.size()
which is clearly more readable.
So my question, is the latter indeed the best way to handle storing container sizes in a variable and using them in computations, regardless of underlying architecture (win32, win64, linux, macosx) ?
What exactly you want to use is a matter of how "purist" you want your code to be.
If you're on C++11, you can just use auto and be done with.
Otherwise, in extremely generic code (which is designed to work with arbitrary allocators), you can use the container's nested typedef size_type. That is taken verbatim from the container's allocator.
In normal use of standard library containers, you can use std::size_t. That is the size_type used by the default allocators, and is the type guaranteed to be able to store any object size.
I wouldn't recommend using [unsigned] int, as that will likely be smaller than necessary on 64-bit platforms (it's usually left at 32 bits, although this of course depends on compiler and settings). I've actually seen production code fail due to unsigned int not being enough to index a container.
It depends on why you need the size, and what is going to be
in the vector. Internally, vector uses std::size_t. But
that's an unsigned type, inappropriate for numerical values. If
you just want to display the value, or something, fine, but if
you're using it in any way as a numerical value, the
unsignedness will end up biting you.
Realistically, there are a lot of times the semantics of the
code ensure that the number of values cannot be more than
INT_MAX. For example, when evaluating financial instruments,
the maximum number of elements is less than 20000, so there's no
need to worry about overflowing an int. In other cases,
you'll validate your input first, to ensure that there will
never be overflow. If you can't do this, the best solution is
probably ptrdiff_t (which is the type you get from subtracting
to iterators). Or if you're using non-standard allocators,
MyVectorType::difference_type.
Not sure if you've already considered this, but what is wrong with size_t?
It is what you compiler uses for sizes of builtin containers (i.e. arrays).
I got some code that I'd like to improve. It's a simple app for one of the variations of 2DBPP and you can take a look at the source at https://gist.github.com/892951
Here's an outline of things that I use chars for (I'd like to switch to binary values instead.) Initialize a block of memory with '0's ():
...
char* bin;
bin = new (nothrow) char[area];
memset(bin, '\0', area);
sometimes I check particular values:
if (!bin[j*height+k]) {...}
or blocks:
if (memchr(bin+i*height+pos.y, '\1', pos.height)) {...}
or set values to '1's:
memset(bin+i*height+best.y,'\1',best.height);
I don't know of any standart types or methods to work with binary values. How do I get to use the bits instead of bytes?
There's a related question that you might be interested in -
C++ performance: checking a block of memory for having specific values in specific cells
Thank you!
Edit: There's still a bigger question - would it be an improvement? I'm only concerned with time.
For starters, you can refer to this post:
How do you set, clear, and toggle a single bit?
Also, try looking into the C++ Std Bitset, or bit field.
I recommend reading up on boost.dynamic_bitset, which is a runtime-sized version of std::bitset.
Alternatively, if you don't want to use boost for some reason, consider using a std::vector<bool>. Quoting cppreference.com:
Note that a boolean vector (std::vector<bool>) is a specialization of the vector template that is designed to use less memory. A normal boolean variable usually uses 1-4 bytes of memory, but a boolean vector uses only one bit per boolean value.
Unless memory space is an issue, I would stay away from bit twiddling. You may save some memory space but extend performance time. Packing and unpacking bits takes time, and extra code.
Get the code more robust and correct before attempting bit twiddling. Play with different (high level) designs that can improve performance and memory usage.
If you are going to the bit level, study up on boolean arithmetic and logic. Redesign your data to be easier to manipulate at the bit level.