C++ 2.5 bytes (20-bit) integer - c++

I know it's ridiculous, but I need it for storage optimization. Is there any good way to implement it in C++?
It has to be flexible enough so that I can use it as a normal data type e.g Vector< int20 >, operator overloading, etc..

If storage is your main concern, I suspect you need quite a few 20-bit variables. How about storing them in pairs? You could create a class representing two such variables and store them in 2.5+2.5 = 5 bytes.
To access the variables conveniently you could override the []-operator so you could write:
int fst = pair[0];
int snd = pair[1];
Since you may want to allow for manipulations such as
pair[1] += 5;
you would not want to return a copy of the backing bytes, but a reference. However, you can't return a direct reference to the backing bytes (since it would mess up it's neighboring value), so you'd actually need to return a proxy for the backing bytes (which in turn has a reference to the backing bytes) and let the proxy overload the relevant operators.
As a metter of fact, as #Tony suggest, you could generalize this to have a general container holding N such 20-bit variables.
(I've done this myself in a specialization of a vector for efficient storage of booleans (as single bits).)

No... you can't do that as a single value-semantic type... any class data must be a multiple of the 8-bit character size (inviting all the usual quips about CHAR_BITS etc).
That said, let's clutch at straws...
Unfortunately, you're obviously handling very many data items. If this is more than 64k, any proxy object into a custom container of packed values will probably need a >16 bit index/handle too, but still one of the few possibilities I can see worth further consideration. It might be suitable if you're only actively working with and needing value semantic behaviour for a small subset of the values at one point in time.
struct Proxy
{
Int20_Container& container_; // might not need if a singleton
Int20_Container::size_type index_;
...
};
So, the proxy might be 32, 64 or more bits - the potential benefit is only if you can create them on the fly from indices into the container, have them write directly back into the container, and keep them short-lived with few concurrently. (One simple way - not necessarily the fastest - to implement this model is to use an STL bitset or vector as the Int20_Container, and either store 20 times the logical index in index_, or multiply on the fly.)
It's also vaguely possible that although your values range over a 20-bit space, you've less than say 64k distinct values in actual use. If you have some such insight into your data set, you can create a lookup table where 16-bit array indices map to 20-bit values.

Use a class. As long as you respect the copy/assign/clone/etc... STL semantics, you won't have any problem.
But it will not optimize the memory space on your computer. Especially if you put in in a flat array, the 20bit will likely be aligned on a 32bit boundary, so the benefit of a 20bit type there is useless.
In that case, you will need to define your own optimized array type, that could be compatible with the STL. But don't expect it to be fast. It won't be.

Use a bitfield. (I'm really surprised nobody has suggested this.)
struct int20_and_something_else {
int less_than_a_million : 20;
int less_than_four_thousand : 12; // total 32 bits
};
This only works as a mutual optimization of elements in a structure, where you can spackle the gaps with some other data. But it works very well!
If you truly need to optimize a gigantic array of 20-bit numbers and nothing else, there is:
struct int20_x3 {
int one : 20;
int two : 20;
int three : 20; // 60 bits is almost 64
void set( int index, int value );
int get( int index );
};
You can add getter/setter functions to make it prettier if you like, but you can't take the address of a bitfield, and they can't participate in an array. (Of course, you can have an array of the struct.)
Use as:
int20_x3 *big_array = new int20_x3[ array_size / 3 + 1 ];
big_array[ index / 3 ].set( index % 3, value );

You can use C++ std::bitset. Store everything in a bitset and access your data using the correct index (x20).

Your not going to be able to get exactly 20 bits as a type(even with a bit packed struct), as it will always be aligned (at smallest grainularity) to a byte. Imo the only way to go, if you must have 20 bits, is to create a bitstream to handle the data(which you can overload to accept indexing etc)

You can use the union keyword to create a bit field. I've used it way back when bit fields were a necessity. Otherwise, you can create a class that holds 3 bytes, but through bitwise operations exposes just the most significant 20.

As far as I know that isn't possible.
The easiest option would be to define a custom type, that uses an int32_t as the backing storage, and implements appropriate maths as override operators.
For better storage density, you could store 3 int20 in a single int64_t value.

Just an idea: use optimized storage (5 bytes for two instances), and for operations, convert it into 32-bit int and then back.

While its possible to do this a number of ways.
One possibilty would be to use bit twidling to store them as the left and right parts of a 5 byte array with a class to store/retrieve which converts yoiur desired array entry to an array entry in byte5[] array and extracts the left ot right half as appropriate.
However on most hardware requires integers to be word aligned so as well as the bit twiddling to extract the integer you would need some bit shifiting to align it properly.
I think it would be more efficient to increase your swap space and let virtual memory take care of your large array (after all 20 vs 32 is not much of a saving!) always assuming you have a 64 bit OS.

Related

Is bitset the right container to manipulate big data then move the results into memory?

I am trying to generate a 512bit pattern where the word 0xdeadbeef keeps rotating (shifted left by one) across the 512bits, each time I want to right the data to memory.
Baiscally, 0xffffffff.......deadbeefffffffff (512 bits total). Keep shifting the deadbeef part by one and after each time write the whole pattern to memory.
Is bitset the correct container in this case? I was able to use all the needed operations (<< ^ ...etc) but I can't find a way to translate the 512bit data into 64bit long long variables to write to memory.
Bitset is not really a container, it's a representation class. Iternally it may represented by array or list of bool. Only way to export its content to array is to do it manually. It's arguably more effective to do aforementioned shifting without bitset, provided that all you need is the actual offset, which gives you address of pattern, rest of bits would be defaulted to 1.
Question is, how endianness should be handled in your application: does positiob of pattern represent lsb-msb sequence within each word, or it should be byte-wise aligned.

C++: Group integers by common radix to save memory

Consider a (sorted) vector of unsigned int values.
std::vector<unsigned int> data = {1234,1254,1264,1265,1267,1268,1271,1819,1832,1856,
1867,1892,3210,3214,3256,3289};
Assuming each unsigned int is 4 bytes, this vector of 16 elements would take at least 64 bytes of RAM.
I am thinking it would be possible to reduce the memory usage by grouping those values by common radix. Consider for example a data representation of the kind
data =
{
{12..
..34, ..54, ..64, ..65, ..67, ..68, ..71
},
{18..
..19, ..32, ..56, ..67, ..92
}
{32..
..10, ..14, ..56, ..89
}
};
In the above example, I grouped the values by blocks of a 100. It would be more logical to group data by a group of 2^8=256 or of 2^16=65536.
Is there a data type (in std:: or boost:: or other) that can do this kind of trick for me or do I have to code my own type of container for that? Does it sound like a potentially good idea?
It is a workable idea for particular inputs in very constrained scenarios. Alas the std:: does not provide that structure. Your particular suggestion could reasonably be implemented in a tree like fashion.
Beware of general radix aware ordering as they generally use big lists, hence using more memory than a regular vector would.

should I use a bit set or a vector? C++

I need to be able to store an array of binary numbers in c++ which will be passed through different methods and eventually outputted to file and terminal,
What are the key differences between vectors and bit sets and which would be easiest and/or more efficient to use?
(I do not know how many bits I need to store)
std::bitset size should be known at compile time, so your choice is obvious - use std::vector<bool>.
Since it's implemented not the same as std::vector<char> (since a single element takes a bit, not a full char), it should be a good solution in terms of memory use.
It all depends on what you want to do binary.
You could also use boost.dynamic_bitset which is like std::bitset but not with fixed bits.
The main drawback is dependency on boost if you don't already use it.
You could also store your input in std::vector<char> and use a bitset per char to convert binary notation.
As others already told: std::bitset uses a fixed number of bits.
std::vector<bool> is not always advised because it has its quirks, as it is not a real container (gotw).
If you don't know how many bits you need to store at compilation time, you cannot use bitset, as it's size is fixed. Because of this, you should you vector<bool>, as it can be resized dynamically. If you want an array of bits saved like this, you can then use vector< vector<bool> >.
If you don't have any specific upper bound for the size of the numbers you need to store, then you need to store your data in two different dimensions:
The first dimension will be the number, whose size varies.
The second dimension will be your array of numbers.
For the latter, using std:vector is fine if you require your values to be contiguous in memory. For the numbers themselves you don't need any data structure at all: just allocate memory using new and an unsigned primitive type such as unsigned char, uint8_t or any other if you have alignment constraints.
If, on the other hand, you know that your numbers won't be larger than let's say 64 bits, then use a data type that you know will hold this amount of data such as uint64_t.
PS: remember that what you store are numbers. The computer will store them in binary whether you use them as such or with any other representation.
I think that in your case you should use std::vector with the value type of std::bitset. Using such an approach you can consider your "binary numbers" either like strings or like objects of some integer type and at the same time it is easy to do binary operations like setting or resetting bits.

Bitset Reference

From http://www.cplusplus.com/reference/stl/bitset/:
Because no such small elemental type exists in most C++ environments, the individual elements are accessed as special references which mimic bool elements.
How, exactly, does this bit reference work?
The only way I could think of would be to use a static array of chars, but then each instance would need to store its index in the array. Since each reference instance would have at least the size of a size_t, that would destroy the compactness of the bitset. Additionally, resizing may be slow, and bit manipulation is expected to be fast.
I think you are confusing two things.
The bitset class stores the bits in a compact representations, e.g. in a char array, typically 8 bits per char (but YMMV on "exotic" platforms).
The bitset::reference class is provided to allow users of the bitset class to have reference-like objects to the bits stored in a bitset.
Because regular pointers and references don't have enough granularity to point to the single bits stored in the bitset (their minimum granularity is the char), such class mimics the semantic of a reference to fake reference-like lvalue operations on the bits. This is needed, in particular, to allow the value returned by operator[] to work "normally" as an lvalue (and it probably costitutes 99% of its "normal" use). In this case it can be seen as a "proxy-object".
This behavior is achieved by overloading the assignment operator and the conversion-to-bool operator; the bitset::reference class will probably encapsulate a reference to the parent bitset object and the offset (bytes+bit) of the referenced bit, that are used by such operators to retrieve and store the value of the bit.
---EDIT---
Actually, the g++ implementation makes the bitset::reference store directly a pointer to the memory word in which the byte is stored, and the bit number in such word. This however is just an implementation detail to boost its performance.
By the way, in the library sources I found a very compact but clear explanation of what bitset::reference is and what it does:
/**
* This encapsulates the concept of a single bit. An instance of this
* class is a proxy for an actual bit; this way the individual bit
* operations are done as faster word-size bitwise instructions.
*
* Most users will never need to use this class directly; conversions
* to and from bool are automatic and should be transparent. Overloaded
* operators help to preserve the illusion.
*
* (On a typical system, this <em>bit %reference</em> is 64
* times the size of an actual bit. Ha.)
*/
I haven't looked at the STL source, but I would expect a Bitset reference to contain a pointer to the actual bitset, and a bit number of size size_t. The references are only created when you attempt to get a reference to a bitset element.
Normal use of bitsets is most unlikely to use references extensively (if at all), so there shouldn't be much of a performance issue. And, it's conceptually similar to char types. A char is normally 8 bits, but to store a 'reference' to a char requires a pointer, so typically 32 or 64 bits.
I've never looked at the reference implementation, but obviously it must know the bitset it is referring to via a reference, and the index of the bit it is responsible for changing. It then can use the rest of the bitsets interface to make the required changes. This can be quite efficient. Note bitsets cannot be resized.
I am not quite sure what you are asking, but I can tell you a way to access individual bits in a byte, which is perhaps what bitsets do. Mind you that the following code is not my own and is Microsoft spec (!).
Create a struct as such:
struct Byte
{
bool bit1:1;
bool bit2:1;
bool bit3:1;
bool bit4:1;
bool bit5:1;
bool bit6:1;
bool bit7:1;
bool bit8:1;
}
The ':1' part of this code are bitfields. http://msdn.microsoft.com/en-us/library/ewwyfdbe(v=vs.80).aspx
They define how many bits a variable is desired to occupy, so in this struct, there are 8 bools that occupy 1 bit each. In total, the 'Byte' struct is therefore 1 byte in size.
Now if you have a byte of data, such as a char, you can store this data in a Byte object as follows:
char a = 'a';
Byte oneByte;
oneByte = *(Byte*)(&a); // Get the address of a (a pointer, basically), cast this
// char* pointer to a Byte*,
// then use the reference operator to store the data that
// this points to in the variable oneByte.
Now you can access (and alter) the individual bits by accessing the bool member variables of oneByte. In order to store the altered data in a char again, you can do as follows:
char b;
b = *(char*)(&oneByte); // Basically, this is the reverse of what you do to
// store the char in a Byte.
I'll try to find the source of this technique, to give credit where credit is due.
Also, again I am not entirely sure whether this answer is of any use to you. I interpreted your question as being 'how would access to individual bits be handled internally?'.

When to use STL bitsets instead of separate variables?

In what situation would it be more appropriate for me to use a bitset (STL container) to manage a set of flags rather than having them declared as a number of separate (bool) variables?
Will I get a significant performance gain if I used a bitset for 50 flags rather than using 50 separate bool variables?
Well, 50 bools as a bitset will take 7 bytes, while 50 bools as bools will take 50 bytes. These days that's not really a big deal, so using bools is probably fine.
However, one place a bitset might be useful is if you need to pass those bools around a lot, especially if you need to return the set from a function. Using a bitset you have less data that has to be moved around on the stack for returns. Then again, you could just use refs instead and have even less data to pass around. :)
std::bitset will give you extra points when you need to serialize / deserialize it. You can just write it to a stream or read from a stream with it. But certainly, the separate bools are going to be faster. They are optimized for this kind of use after all, while a bitset is optimized for space, and has still function calls involved. It will never be faster than separate bools.
Bitset
Very space efficient
Less efficient due to bit fiddling
Provides serialize / de-serialize with op<< and op>>
All bits packed together: You will have the flags at one place.
Separate bools
Very fast
Bools are not packed together. They will be members somewhere.
Decide on the facts. I, personally, would use std::bitset for some not-performance critical, and would use bools if I either have only a few bools (and thus it's quite overview-able), or if I need the extra performance.
It depends what you mean by 'performance gain'. If you only need 50 of them, and you're not low on memory then separate bools is pretty much always a better choice than a bitset. They will take more memory, but the bools will be much faster. A bitset is usually implemented as an array of ints (the bools are packed into those ints). So the first 32 bools (bits) in your bitset will only take up a single 32bit int, but to read each value you have to do some bitwise operations first to mask out all the values you don't want. E.g. to read the 2nd bit of a bitset, you need to:
Find the int that contains the bit you want (in this case, it's the first int)
Bitwise And that int with '2' (i.e. value & 0x02) to find out if that bit is set
However, if memory is a bottleneck and you have a lot of bools using a bitset could make sense (e.g. if you're target platform is a mobile phone, or it's some state in a very busy web service)
NOTE: A std::vector of bool usually has a specialisation to use the equivalent of a bitset, thus making it much smaller and also slower for the same reasons. So if speed is an issue, you'll be better off using a vector of char (or even int), or even just use an old school bool array.
RE #Wilka:
Actually, bitsets are supported by C/C++ in a way that doesn't require you to do your own masking. I don't remember the exact syntax, but it's something like this:
struct MyBitset {
bool firstOption:1;
bool secondOption:1;
bool thirdOption:1;
int fourBitNumber:4;
};
You can reference any value in that struct by just using dot notation, and the right things will happen:
MyBitset bits;
bits.firstOption = true;
bits.fourBitNumber = 2;
if(bits.thirdOption) {
// Whatever!
}
You can use arbitrary bit sizes for things. The resulting struct can be up to 7 bits larger than the data you define (its size is always the minimum number of bytes needed to store the data you defined).