A memcpy()-like function for bit vectors? - c++

I have a vector of bits, and I want to copy a slice of it to another vector (say, for simplicity, to the beginning of another vector). Note that all the bits may need to be shifted (or rather, rotated) in some direction, not just the first element, since the alignment of bits within each byte changes.
Suppose, for clarity, that the signature is:
void *memcpy_bits(
char* destination,
char* source,
size_t offset_into_source_in_bits,
size_t num_bits_to_copy);
And that data is stored in bytes, so no endianness issues, and the lower bits come first in the vector. We could make the signature more complex to accommodate for other assumptions but never mind that for now.
So,
Is there some hardware support for doing this (on x86 or x86_64 CPUs I mean)?
Is there some standard/idiomatic/widely-used implementation of this function (or something similar enough)?

First you have to define how the data is stored. Is it stored in an array of uint8_t, uint16_t, uint32_t or uint64_t? Is bit #0 stored as a value 1u << 0? You should probably not use void* but the underlying type that is used for storing the data.
Second, you can obviously assume that offset_into_source_in_bits is less than the number of bits in the underlying data type (if it's not, what would you do? )
Third, if that offset is 0 then you can just call memcpy. That's an important thing to do, because the following code won't work if the offset is 0.
Fourth, as long as num_bits_to_copy >= number of bits in the underlying type, you can calculate the next unit to store into destination using two shifts.
Fifth, if 0 < num_bits_to_copy < number of bits in the underlying type, then you need to be careful not to read any source bits that don't actually exist.
You'd probably want to be careful not to overwrite any bits that you shouldn't overwrite, and personally I would have an offset into the destination bits as well, so you can copy arbitrary ranges of bits. I might implement a memmove_bits function as well.

Related

Is int byte size fixed or it occupy it accordingly in C/C++?

I have seen some program that use int instead of other type like int16_t or uint8_t even though there is no need to use int
let me give an example, when you assign 9 to an int, i know that 9 takes only 1 byte to store, so is other 3 bytes free to use or are they occupied?
all i'm saying is, does int always takes 4-bytes in memory or it takes byte accordingly and 4-bytes is the max-size
i hope you understand what im saying.
The size of all types is constant. The value that you store in an integer has no effect on the size of the type. If you store a positive value smaller than maximum value representable by a single byte, then the more significant bytes (if any) will contain a zero value.
The size of int is not necessarily 4 bytes. The byte size of integer types is implementation defined.
The size of types is fixed at compile time. There is no "dynamic resizing". If you tell the compiler to use int it will use an integer type that is guaranteed to have at least 16bit width. However, it may be (and is most of the time) more depending on the platform and compiler you are using. You can query the byte width on your platform by using sizeof(int).
There is a neat overview about the width of the different integer types at cppreference.
The int16_t or uint8_t are not core language types but convenience library defined types, that can be used if an exact bitwidth is required (e.g. for bitwise arithmetic)
An int has no "free bytes". An int is at least 16 bits wide and the exact size depends on the target platform (see https://en.cppreference.com/w/cpp/language/types). sizeof(int) is a compile time constant though. It always occupies the same number of bytes, no matter what value it holds.
The fixed width integer types (https://en.cppreference.com/w/cpp/types/integer) are useful for code that assumes a certain size of integers, because assuming certain size of int is usually a bug. int16_t is exactly 16 bits wide and uint8_t is exactly 8 bits wide, independent of the target platform.
I have seen some program that use int instead of other type like int16_t or uint8_t even though there is no need to use int
This is sometimes called "sloppy typing". int has the drawback that its size is implementation-defined, so it isn't portable. It can in theory even use an exotic signedness format (at least until the C23 standard).
when you assign 9 to an int, i know that 9 takes only 1 byte to store
That is not correct and there are no free bytes anywhere. Given some code int x = 9; then the integer constant 9 is of type int and takes up as much space as one, unless the compiler decided to optimize it into a smaller type. The 9 is stored in read-only memory, typically together with the executable code in the .text segment.
The variable x takes exactly sizeof(int) bytes (4 bytes on 32 bit systems) no matter the value stored. The compiler cannot do any sensible optimization regarding the size, other than when it is possible to remove the variable completely.

Unpacking a stream of values of bit size not divisible by 8

I've spent too many hours on this, and at this point I think I need some help from the experts.
I have a const uint8_t* buffer, an integer data type (say, uint16_t), and I know that the buffer contains packed samples m bits each where m is not divisible by 8 (say, m=12 bits). Knowing that buffer holds N samples, I need to return an std::vector<uint16_t> containing the values of these N samples expanded to uint16_t.
So, each three bytes (24 bits) of the buffer contain two 12-bits samples I need to process. I want to implement a generalized function
template <typename OutputType, int BitsPerSample>
std::vector<OutputType> unpack(const uint8_t* data, const size_t numSamplesToUnpack);
Assume the data is big endian and OutputType is some integer type that can hold the sample value without truncating it.
I understand bit manipulation. I understand how this can be implemented, in principle. But I don't understand how to implement it elegantly and concisely. Got any ideas?
Also, is there a special name or term for this problem?
Maybe you can try reading single bits at a time, and keep a running counter of how many bits you have processed. When you consume 8 bits, you can increment your buffer pointer.
This doesn't mean you have finished unpacking that sample, so you'll need to also keep a "bits_left" counter in case you need to shift the buffer pointer before you are done unpacking a sample.
Use a 32-bit word as a buffer. If it has less than 12 bits, read another byte. Otherwise, output a 12-bit word.

Is this the correct way of writing bits to big endian?

Currently, it's for a Huffman compression algorithm that assigns binary codes to characters used in a text file. Fewer bits for more frequent- and more bits for less-frequent characters.
Currently, I'm trying to save the binary code big-endian in a byte.
So let's say I'm using an unsigned char to hold it.
00000000
And I want to store some binary code that's 1101.
In advance, I want to apologize if this seems trivial or is a dupe but I've browsed dozens of other posts and can't seem to find what I need. If anyone could link or quickly explain, it'd be greatly appreciated.
Would this be the correct syntax?
I'll have some external method like
int length = 0;
unsigned char byte = (some default value);
void pushBit(unsigned int bit){
if (bit == 1){
byte |= 1;
}
byte <<= 1;
length++;
if (length == 8) {
//Output the byte
length = 0;
}
}
I've seen some videos explaining endianess and my understanding is the most significant bit (the first one) is placed in the lowest memory address.
Some videos showed the byte from left to right which makes me think I need to left shift everything over but whenever I set, toggle, erase a bit, it's from the rightmost is it not? I'm sorry once again if this is trivial.
So after my method finishes pushing the 1101 into this method, byte would be something like 00001101. Is this big endian? My knowledge of address locations is very weak and I'm not sure whether
**-->00001101 or 00001101<-- **
location is considered the most significant.
Would I need to left shift the remaining amount?
So since I used 4 bits, I would left shift 4 bits to make 11010000. Is this big endian?
First off, as the Killzone Kid noted, endianess and the bit ordering of a binary code are two entirely different things. Endianess refers to the order in which a multi-byte integer is stored in the bytes of memory. For little endian, the least significant byte is stored first. For big endian, the most significant byte is stored first. The bits in the bytes don't change order. Endianess has nothing to do with what you're asking.
As for accumulating bits until you have a byte's worth to write, you have the basic idea, but your code is incorrect. You need to shift first, and then or the bit. The way you're doing it, you are losing the first bit you put in off the top, and the low bit of what you write is always zero. Just put the byte <<= 1; before the if.
You also need to deal with ending the stream somehow, writing out the last bits if there are less than eight left. So you'll need a flushBits() to write out you bit buffer if it has more than one bit in it. Your bit stream would need to be self terminating, or you need to first send the number of bits, so that you don't misinterpret the filler bits in the last byte as a code or codes.
There are two types of endianness, Big-endian and Little-endian (technically there are more, like middle-endian, but big and little are the most common). If you want to have the big-endian format, (as it seems like you do), then the most significant byte comes first, with little-endian the least significant byte comes first.
Wikipedia has some good examples
It looks like what you are trying to do is store the bits themselves within the byte to be in reverse order, which is not what you want. A byte is endian agnostic and does not need to be flipped. Multi-byte types such as uint32_t may need their byte order changed, depending on what endianness you want to achieve.
Maybe what you are referring to is bit numbering, in which case the code you have should largely work (although you should compare length to 7, not 8). The order you place the bits in pushBit would end up with the first bit you pass being the most significant bit.
Bits aren't addressable by definition (if we're talking about C++, not C51 or its C++ successor), so from point of high level language, even from POV of assembler pseudo-code, no matter what the direction LSB -> MSB is, bit-wise << would perform shift from LSB to MSB. Bit order referred as bit numbering and is a separate feature from endian-ness, related to hardware implementation.
Bit fields in C++ change order because in most common use-cases usually bits do have an opposite order, e.g. in network communication, but in fact way how bit fields are packed into byte is implementation dependent, there is no consistency guarantee that there is no gaps or that order is preserved.
Minimal addressable unit of memory in C++ is of char size , and that's where your concern with endian-ness ends. The rare case if you actually should change bit order (when? working with some incompatible hardware?), you have to do explicitly so.
Note, that when working with Ethernet or other network protocol you should not do so, order change is done by hardware (first bit sent over wire is least significant one on the platform).

Naming convention used in `<cstdint>`

The <cstdint> (<stdint.h>) header defines several integral types and their names follow this pattern: intN_t, where N is the number of bits, not bytes.
Given that a byte is not strictly defined as being 8 bits in length, why aren't these types defined as, for example, int1_t instead of int8_t? Wouldn't that be more appropriate since it takes into account machines that have bytes of unusual lengths?
On machines that don't have exactly those sizes the types are not defined. That is, if you machine doesn't have an 8-bit byte then int8_t would not be available. You would still however have the least versions available, such as int_least16_t.
The reason one suspects is that if you want a precise size you usually want a bit-size and not really an abstract byte size. For example all internet protocols deal with an 8-bit byte, thus you'd want to have 8-bits, whether that is a native byte size or not.
This answer is also quite informative in this regards.
int32_t could be a 4-byte 8-bits-per-byte type, or it could be a 2-byte 16-bits-per-byte type, or it could be a 1-byte 32-bits-per-byte type. It doesn't matter for the values you can store in it.
The idea of using those types is to make explicit the number of bits you can store into the variable. As you pointed, different architectures may have different byte sizes, so having the number of bytes doesn't guarantee the number of bits your variable can handle.

C++: a scalable number class - bitset<1>* or unsigned char*

I'm planning on creating a number class. The purpose is to hold any amount of numbers without worrying about getting too much (like with int, or long). But at the same time not USING too much. For example:
If I have data that only really needs 1-10, I don't need a int (4 bytes), a short(2 bytes) or even a char(1 byte). So why allocate so much?
If i want to hold data that requires an extremely large amount (only integers in this scenario) like past the billions, I cannot.
My goal is to create a number class that can handle this problem like strings do, sizing to fit the number. But before I begin, I was wondering..
bitset<1>, bitset is a template class that allows me to minipulate bits in C++, quite useful, but is it efficient?, bitset<1> would define 1 bit, but do I want to make an array of them? C++ can allocate a byte minimum, does bitset<1> allocate a byte and provide 1 bit OF that byte? if thats the case I'd rather create my number class with unsigned char*'s.
unsigned char, or BYTE holds 8 bytes, anything from 0 - 256 would only need one, more would require two, then 3, it would simply keep expanding when needed in byte intervals rather than bit intervals.
Which do you think is MORE efficient?, the bits would be if bitset actually allocated 1 bit, but I have a feeling that it isn't even possible. In fact, it may actually be more efficient to allocate in bytes until 4 bytes, (32 bits), on a 32 bit processor 32 bit allocation is most efficient thus I would use 4 bytes at a time from then on out.
Basically my question is, what are your thoughts? how should I go about this implementation, bitset<1>, or unsigned char (or BYTE)??
Optimizing for bits is silly unless you're target architecture is a DigiComp-1. Reading individual bits is always slower than reading ints - 4 bits isn't more efficient than 8.
Use unsigned char if you want to store it as a decimal number. This will be the most efficient.
Or, you could just use GMP.
The bitset template requires a compile-time const integer for its template argument. This could be a drawback when you have to determine the max bits size at run-time. Another thing is that most of the compilers / libraries use unsigned int or unsigned long long to store the bits for faster memory access. If your application would run in a environment with limited memory, you should create a new class like bitset or use a different library.
While it won't directly help you with arithmetic on giant numbers, if this kind of space-saving is your goal then you might find my Nstate library useful (boost license):
http://hostilefork.com/nstate/
For instance: if you have a value that can be between 0 and 2...then so long as you are going to be storing a bunch of these in an array you can exploit the "wasted" space for the unused 4th state (3) to pack more values. In that particular case, you can get 20 tristates in a 32-bit word instead of the 16 that you would get with 2-bits per tristate.