How can I change the order of a packed structure in C or C++?
struct myStruct {
uint32_t A;
uint16_t B1;
uint16_t B2;
} __attribute__((packed));
The address 0x0 of the structure (or the LSB) is A.
My app communicates with hardware and the structure in hardware is defined like this:
struct packed {
logic [31:0] A;
logic [15:0] B1;
logic [15:0] B2;
} myStruct;
But in SystemVerilog the "address 0x0" or more accurately the LSB of the structure is the LSB of B2 = B2[0].
The order is reversed.
To stay consistent and to avoid changing the hardware part, I'd like to inverse the "endianness" of the whole C/C++ structure.
I could just inverse all the fields:
struct myStruct {
uint16_t B2;
uint16_t B1;
uint32_t A;
} __attribute__((packed));
but it's error-prone and not so convenient.
For datatype, both SystemVerilog and Intel CPUs are little-endian, that's not an issue.
How can I do it?
How can I change the byte orders of a struct?
You cannot change the order of bytes within members. And you cannot change the memory order of the members in relation to other members to be different from the order of their declaration.
But, you can change the declaration order of members which is what determines their memory order. The first member is always in lowest memory position, second is after that and so on.
If correct order of members can be known based on the verilog source, then ideally the C struct definition should be generated with meta-programming to ensure matching order.
it's error-prone
Relying on particular memory order is error-prone indeed.
It is possible to rely only on the known memory order of the source data (presumably an array of bytes) without relying on the memory order of the members at all:
unsigned char* data = read_hardware();
myStruct s;
s.B2 = data[0] << 0u
| data[1] << 8u;
s.B1 = data[2] << 0u
| data[3] << 8u;
s.A = data[4] << 0u
| data[5] << 8u
| data[6] << 16u
| data[7] << 24u;
This relies neither memory layout of the members, nor on the endianness of CPU. It relies only on order of the source data (assumed to be little endian in this case).
If possible, this function should also ideally be generated based on the verilog source.
How can I change the order of a packed structure in C or C++?
C specifies that the members of a struct are laid out in memory in the order in which they are declared, with the address of the first-declared, when converted to the appropriate pointer type, being equal to the address of the overall struct. At least for struct types expressible in C, such as yours, conforming C++ implementations will follow the same member-order rule. Those implementations that support packed structure layout as an extension are pretty consistent in what they mean by that: packed structure layouts will have no padding between members, and the overall size is the sum of the sizes of the members. And no other effects.
I am not aware of any implementation that provides an extention allowing members to be ordered differently than declaration order, and who would bother to implement that? The order of members is well-defined. If you want a different order, then the solution is to change the declaration order of the members.
If VeriLog indeed orders the members differently (to which I cannot speak) then I think you're just going to need to make peace with that. Implement it as you need to do or as otherwise makes the most sense, document on both sides, and move on. I'm inclined to think that the number of people who ever notice that the declaration order differs in the two languages will be very small. As long as appropriate the documentation is present, those that do notice won't be inclined to think there's an error.
You know I just looked AMD does in it's open source drivers to handle endianness.
First of all they detect if the system is big endian/little endian using cmake.
#if !defined (__GFX10_GB_REG_H__)
#define __GFX10_GB_REG_H__
/*
* gfx10_gb_reg.h
*
* Register Spec Release: 1.0
*
*/
//
// Make sure the necessary endian defines are there.
//
#if defined(LITTLEENDIAN_CPU)
#elif defined(BIGENDIAN_CPU)
#else
#error "BIGENDIAN_CPU or LITTLEENDIAN_CPU must be defined"
#endif
union GB_ADDR_CONFIG
{
struct
{
#if defined(LITTLEENDIAN_CPU)
unsigned int NUM_PIPES : 3;
unsigned int PIPE_INTERLEAVE_SIZE : 3;
unsigned int MAX_COMPRESSED_FRAGS : 2;
unsigned int NUM_PKRS : 3;
unsigned int : 21;
#elif defined(BIGENDIAN_CPU)
unsigned int : 21;
unsigned int NUM_PKRS : 3;
unsigned int MAX_COMPRESSED_FRAGS : 2;
unsigned int PIPE_INTERLEAVE_SIZE : 3;
unsigned int NUM_PIPES : 3;
#endif
} bitfields, bits;
unsigned int u32All;
int i32All;
float f32All;
};
#endif
Yes there is some code duplication as mentioned above. But I'm not aware of a universally better solution either.
Independent of the endian issue, I wouldn't recommend C++ bit fields for this kind of purpose, or any purpose in which you actually need explicit control of bit alignment. A long time ago, the decision to put performance over portability ruined this possibility. Alignment of bit fields (and structs in general for that matter) is not well defined in C++, making bit fields useless for many purposes. IMO would be better to let programmers make such decisions for optimization, or implement a strictly portable (non-machine dependent) packed keyword. If this means the compiler has to emit code that combines multiple shift-and operations once in a while, so be it.
As far as I know, the only general solution for this kind of thing is to add a layer that implements bit fields explicitly using shift-and logic. Of course this will likely ruin performance because you really want the conditionals to be handled at compile time, which is ironic because performance is what motivated this situation in the first place.
The code base I work in is quite old. While we compile nearly everything with c++11. Much of the code was written in c many years ago. When developing new classes in old areas I always find myself in a situation where I have to choose between matching old methodologies, or going with a more modern approach.
In most cases, I prefer sticking with more modern techniques when possible. However, one common old practice I often see, which I have a hard time arguing the use of, is bitfields. We pass a lot of messages, here, many times, they are full of single bit values. Take the example below:
class NewStructure
{
public:
const bool getValue1() const
{
return value1;
}
void setValue1(const bool input)
{
value1 = input;
}
private:
bool value1;
bool value2;
bool value3;
bool value4;
bool value5;
bool value6;
bool value7;
bool value8;
};
struct OldStructure
{
const bool getValue1() const
{
return value1;
}
void setValue1(const bool input)
{
value1 = input;
}
unsigned char value1 : 1;
unsigned char value2 : 1;
unsigned char value3 : 1;
unsigned char value4 : 1;
unsigned char value5 : 1;
unsigned char value6 : 1;
unsigned char value7 : 1;
unsigned char value8 : 1;
};
In this case the sizes are 8 bytes for the New Structure and 1 for the old.
I added a "getter" and "setter" to illustrate the point that from a user perspective, they can be identical. I realize that perhaps you could make the case of readability for the next developer, but other than that, is there a reason to avoid bit fields? I know that packed fields take a performance hit, but because these are all characters, padding rules are still in place.
There are several things to consider when using bitfields. Those are (order of importance would depend on situation)
Performance
Bitfields operation incur performance penalty when set or read (as compared to direct types). A simple example of codegen shows the extra instructions emitted: https://gcc.godbolt.org/z/DpcErN However, bitfields provide for more compact data, which becomes more cache-friendly, and that could completely outweigh any drawbacks of additional operations. The only way to understand the real performance impact is by benchmarking actual application in the real use case.
ABI Interoperability
Endiannes of bit fields is implementation defined, so layout of the same struct produced by two compiler can differ.
Usability
There is no reference binding to a bitfield, nor you can take it's address. This could affect code and make it less clear.
For you, as the programmer, there's not much difference. But the machine code to access a whole byte is much simpler/shorter than to access an individual bit, so using bitfields bulks up the generated code.
In a pseudo-assembly-language, your setter might turn into something like:
ldb input1,b ; get the new value into accumulator b
movb b,value1 ; put it into the variable
rts ; return from subroutine
But it's not so easy for bitfields:
ldb input1,b ; get the new value into accumulator b
movb bitfields,a ; get current bitfield values into accumulator a
cmpb b,#0 ; See what to do.
brz clearvalue1: ; If it's zero, go to clearing the bit
orb #$80,a ; set the bit representing value1.
bra resume: ; skip the clearing code.
clearvalue1:
andb #$7f,a ; clear the bit representing value1
resume:
movb a,bitfields ; put the value back
rts ; return
And it has to do that for each of your 8 members' setters, and something similar for getters. It adds up. Additionally, even today's dumbest compilers would probably inline the full-byte setter code, rather than actually make a subroutine call. For the bitfield setter, it may depend whether you're compiling optimizing for speed vs. space.
And you've only asked about booleans. If they were integer bitfields, then the compiler has to deal with loading, masking out prior values, shifting the value to align into its field, masking out unused bits, and/or the value into place, then writing it back to memory.
So why would you use one vs. the other?
Bitfields are slower, but pack data more efficiently.
Non-bitfields are faster, and require less machine code to access.
As the developer, it's your judgement call. If you will be keeping many instances of Structure in memory at once, the memory savings may be worth it. If you aren't going to have many instances of that structure in memory at once, the compiled code bloat offsets memory savings and you're sacrificing speed.
template<typename enum_type,size_t n_bits>
class bit_flags{
std::bitset<n_bits> bits;
auto operator[](enum_type bit){return bits[bit];};
auto& set(enum_type bit)){return set(bit);};
auto& reset(enum_type bit)){return set(bit);};
//go on with flip et al...
static_assert(std::is_enum<enum_type>{});
};
enum class v_flags{v1,v2,/*...*/vN};
bit_flags<v_flags,v_flags::vN+1> my_flags;
my_flags.set(v_flags::v1);
my_flags.[v_flags::v2]=true;
std::bitset is as efficient as bool bit fields. You can wrap it in a class to force using every bit by names defined in an enum. Now you have a small but scalable utility to use for multiple defferent sets of bool flags. C++17 makes it even more convenient:
template<auto last_flag, typename enum_type=decltype(last_flag)>
class bit_flags{
std::bitset<last_flag+1> bits;
//...
};
bit_flags<v_flags::vN+1> my_flags;
I'm learning about bit flags and creating bit fields manually using bit-wise operators. I then came across bitsets, seemingly an easier and cleaner way of storing a field of bits. I understand the value of using bit fields as far as minimizing memory usage. After testing the sizeof(bitset), though, I'm having a hard time understanding how this is a better approach.
Consider:
#include <bitset>
#include <iostream>
int main ()
{
// Using Bit Set, Size = 8 Bytes
const unsigned int i1 = 0;
const unsigned int i2 = 1;
std::bitset<8> mySet(0);
mySet.set(i1);
mySet.set(i2);
std::cout << sizeof(mySet) << std::endl;
// Manually managing bit flags
const unsigned char t1 = 1 << 0;
const unsigned char t2 = 1 << 1;
unsigned char bitField = 0;
bitField |= t1 | t2;
std::cout << sizeof(bitField) << std::endl;
return 0;
}
The output is:
The mySet is 8 bytes. The bitField is 1 byte.
Should I not use std::bitset if minimal memory usage is desired?
For the lowest possible memory footprint you shouldn't use std::bitset. It will likely require more memory than a plain built in type like char or int of equivalent effective size. Thus, it probably has memory overhead, but how much will depend on the implementation.
One major advantage of std::bitset is that it frees you from hardware-dependent implementations of various types. In theory, the hardware can use any representation for any type, as long as it fulfills some requirements in the C++ standard. Thus when you rely on unsigned char t1 = 1 to be 00000001 in memory, this is not actually guaranteed. But if you create a bitset and initialize it properly, it won't give you any nasty surprises.
A sidenote on bitfiddling: considering the pitfalls to fiddling with bits in this way, can you really justify this error-prone method instead of using std::bitset or even types like int and bool? Unless you're extremely resource constrained (e.g. MCU / DSP programming), I don't think you can.
Those who play with bits will be bitten, and those who play with bytes will be bytten.
By the way, the char bitField you declare and manipulate using bitwise operators is a bit field, but it's not the C++ language notion of a bit field, which looks like this:
struct BitField{
unsigned char flag1 : 1, flag2 : 1, flag3 : 1;
}
Loosely speaking, it is a data structure whose data member(s) are subdivided into separate variables. In this case the unsigned char of (presumably) 8 bits is used to create three 1-bit variables (flag1, flag2 and flag3). It is explicitly subdivided, but at the end of the day this is just compiler-/language-assisted bit fiddling similar to what you did above. You can read more about bit fields here.
If, say, a 32-bit integer is overflowing, instead of upgrading int to long, can we make use of some 40-bit type if we need a range only within 240, so that we save 24 (64-40) bits for every integer?
If so, how?
I have to deal with billions and space is a bigger constraint.
Yes, but...
It is certainly possible, but it is usually nonsensical (for any program that doesn't use billions of these numbers):
#include <stdint.h> // don't want to rely on something like long long
struct bad_idea
{
uint64_t var : 40;
};
Here, var will indeed have a width of 40 bits at the expense of much less efficient code generated (it turns out that "much" is very much wrong -- the measured overhead is a mere 1-2%, see timings below), and usually to no avail. Unless you have need for another 24-bit value (or an 8 and 16 bit value) which you wish to pack into the same structure, alignment will forfeit anything that you may gain.
In any case, unless you have billions of these, the effective difference in memory consumption will not be noticeable (but the extra code needed to manage the bit field will be noticeable!).
Note:
The question has in the mean time been updated to reflect that indeed billions of numbers are needed, so this may be a viable thing to do, presumed that you take measures not to lose the gains due to structure alignment and padding, i.e. either by storing something else in the remaining 24 bits or by storing your 40-bit values in structures of 8 each or multiples thereof).
Saving three bytes a billion times is worthwhile as it will require noticeably fewer memory pages and thus cause fewer cache and TLB misses, and above all page faults (a single page fault weighting tens of millions instructions).
While the above snippet does not make use of the remaining 24 bits (it merely demonstrates the "use 40 bits" part), something akin to the following will be necessary to really make the approach useful in a sense of preserving memory -- presumed that you indeed have other "useful" data to put in the holes:
struct using_gaps
{
uint64_t var : 40;
uint64_t useful_uint16 : 16;
uint64_t char_or_bool : 8;
};
Structure size and alignment will be equal to a 64 bit integer, so nothing is wasted if you make e.g. an array of a billion such structures (even without using compiler-specific extensions). If you don't have use for an 8-bit value, you could also use an 48-bit and a 16-bit value (giving a bigger overflow margin).
Alternatively you could, at the expense of usability, put 8 40-bit values into a structure (least common multiple of 40 and 64 being 320 = 8*40). Of course then your code which accesses elements in the array of structures will become much more complicated (though one could probably implement an operator[] that restores the linear array functionality and hides the structure complexity).
Update:
Wrote a quick test suite, just to see what overhead the bitfields (and operator overloading with bitfield refs) would have. Posted code (due to length) at gcc.godbolt.org, test output from my Win7-64 machine is:
Running test for array size = 1048576
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 2 1 35 35 1
uint64_t 0 3 3 35 35 1
bad40_t 0 5 3 35 35 1
packed40_t 0 7 4 48 49 1
Running test for array size = 16777216
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 38 14 560 555 8
uint64_t 0 81 22 565 554 17
bad40_t 0 85 25 565 561 16
packed40_t 0 151 75 765 774 16
Running test for array size = 134217728
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 312 100 4480 4441 65
uint64_t 0 648 172 4482 4490 130
bad40_t 0 682 193 4573 4492 130
packed40_t 0 1164 552 6181 6176 130
What one can see is that the extra overhead of bitfields is neglegible, but the operator overloading with bitfield reference as a convenience thing is rather drastic (about 3x increase) when accessing data linearly in a cache-friendly manner. On the other hand, on random access it barely even matters.
These timings suggest that simply using 64-bit integers would be better since they are still faster overall than bitfields (despite touching more memory), but of course they do not take into account the cost of page faults with much bigger datasets. It might look very different once you run out of physical RAM (I didn't test that).
You can quite effectively pack 4*40bits integers into a 160-bit struct like this:
struct Val4 {
char hi[4];
unsigned int low[4];
}
long getLong( const Val4 &pack, int ix ) {
int hi= pack.hi[ix]; // preserve sign into 32 bit
return long( (((unsigned long)hi) << 32) + (unsigned long)pack.low[i]);
}
void setLong( Val4 &pack, int ix, long val ) {
pack.low[ix]= (unsigned)val;
pack.hi[ix]= (char)(val>>32);
}
These again can be used like this:
Val4[SIZE] vals;
long getLong( int ix ) {
return getLong( vals[ix>>2], ix&0x3 )
}
void setLong( int ix, long val ) {
setLong( vals[ix>>2], ix&0x3, val )
}
You might want to consider Variable-Lenght Encoding (VLE)
Presumably, you have store a lot of those numbers somewhere (in RAM, on disk, send them over the network, etc), and then take them one by one and do some processing.
One approach would be to encode them using VLE.
From Google's protobuf documentation (CreativeCommons licence)
Varints are a method of serializing integers using
one or more bytes. Smaller numbers take a smaller number of bytes.
Each byte in a varint, except the last byte, has the most significant
bit (msb) set – this indicates that there are further bytes to come.
The lower 7 bits of each byte are used to store the two's complement
representation of the number in groups of 7 bits, least significant
group first.
So, for example, here is the number 1 – it's a single byte, so the msb
is not set:
0000 0001
And here is 300 – this is a bit more complicated:
1010 1100 0000 0010
How do you figure out that this is 300? First you drop the msb from
each byte, as this is just there to tell us whether we've reached the
end of the number (as you can see, it's set in the first byte as there
is more than one byte in the varint)
Pros
If you have lots of small numbers, you'll probably use less than 40 bytes per integer, in average. Possibly much less.
You are able to store bigger numbers (with more than 40 bits) in the future, without having to pay a penalty for the small ones
Cons
You pay an extra bit for each 7 significant bits of your numbers. That means a number with 40 significant bits will need 6 bytes. If most of your numbers have 40 significant bits, you are better of with a bit field approach.
You will lose the ability to easily jump to a number given its index (you have to at least partially parse all previous elements in an array in order to access the current one.
You will need some form of decoding before doing anything useful with the numbers (although that is true for other approaches as well, like bit fields)
(Edit: First of all - what you want is possible, and makes sense in some cases; I have had to do similar things when I tried to do something for the Netflix challenge and only had 1GB of memory; Second - it is probably best to use a char array for the 40-bit storage to avoid any alignment issues and the need to mess with struct packing pragmas; Third - this design assumes that you're OK with 64-bit arithmetic for intermediate results, it is only for large array storage that you would use Int40; Fourth: I don't get all the suggestions that this is a bad idea, just read up on what people go through to pack mesh data structures and this looks like child's play by comparison).
What you want is a struct that is only used for storing data as 40-bit ints but implicitly converts to int64_t for arithmetic. The only trick is doing the sign extension from 40 to 64 bits right. If you're fine with unsigned ints, the code can be even simpler. This should be able to get you started.
#include <cstdint>
#include <iostream>
// Only intended for storage, automatically promotes to 64-bit for evaluation
struct Int40
{
Int40(int64_t x) { set(static_cast<uint64_t>(x)); } // implicit constructor
operator int64_t() const { return get(); } // implicit conversion to 64-bit
private:
void set(uint64_t x)
{
setb<0>(x); setb<1>(x); setb<2>(x); setb<3>(x); setb<4>(x);
};
int64_t get() const
{
return static_cast<int64_t>(getb<0>() | getb<1>() | getb<2>() | getb<3>() | getb<4>() | signx());
};
uint64_t signx() const
{
return (data[4] >> 7) * (uint64_t(((1 << 25) - 1)) << 39);
};
template <int idx> uint64_t getb() const
{
return static_cast<uint64_t>(data[idx]) << (8 * idx);
}
template <int idx> void setb(uint64_t x)
{
data[idx] = (x >> (8 * idx)) & 0xFF;
}
unsigned char data[5];
};
int main()
{
Int40 a = -1;
Int40 b = -2;
Int40 c = 1 << 16;
std::cout << "sizeof(Int40) = " << sizeof(Int40) << std::endl;
std::cout << a << "+" << b << "=" << (a+b) << std::endl;
std::cout << c << "*" << c << "=" << (c*c) << std::endl;
}
Here is the link to try it live: http://rextester.com/QWKQU25252
You can use a bit-field structure, but it's not going to save you any memory:
struct my_struct
{
unsigned long long a : 40;
unsigned long long b : 24;
};
You can squeeze any multiple of 8 such 40-bit variables into one structure:
struct bits_16_16_8
{
unsigned short x : 16;
unsigned short y : 16;
unsigned short z : 8;
};
struct bits_8_16_16
{
unsigned short x : 8;
unsigned short y : 16;
unsigned short z : 16;
};
struct my_struct
{
struct bits_16_16_8 a1;
struct bits_8_16_16 a2;
struct bits_16_16_8 a3;
struct bits_8_16_16 a4;
struct bits_16_16_8 a5;
struct bits_8_16_16 a6;
struct bits_16_16_8 a7;
struct bits_8_16_16 a8;
};
This will save you some memory (in comparison with using 8 "standard" 64-bit variables), but you will have to split every operation (and in particular arithmetic ones) on each of these variables into several operations.
So the memory-optimization will be "traded" for runtime-performance.
As the comments suggest, this is quite a task.
Probably an unnecessary hassle unless you want to save alot of RAM - then it makes much more sense. (RAM saving would be the sum total of bits saved across millions of long values stored in RAM)
I would consider using an array of 5 bytes/char (5 * 8 bits = 40 bits). Then you will need to shift bits from your (overflowed int - hence a long) value into the array of bytes to store them.
To use the values, then shift the bits back out into a long and you can use the value.
Then your RAM and file storage of the value will be 40 bits (5 bytes), BUT you must consider data alignment if you plan to use a struct to hold the 5 bytes. Let me know if you need elaboration on this bit shifting and data alignment implications.
Similarly, you could use the 64 bit long, and hide other values (3 chars perhaps) in the residual 24 bits that you do not want to use. Again - using bit shifting to add and remove the 24 bit values.
Another variation that may be helpful would be to use a structure:
typedef struct TRIPLE_40 {
uint32_t low[3];
uint8_t hi[3];
uint8_t padding;
};
Such a structure would take 16 bytes and, if 16-byte aligned, would fit entirely within a single cache line. While identifying which of the parts of the structure to use may be more expensive than it would be if the structure held four elements instead of three, accessing one cache line may be much cheaper than accessing two. If performance is important, one should use some benchmarks since some machines may perform a divmod-3 operation cheaply and have a high cost per cache-line fetch, while others might have have cheaper memory access and more expensive divmod-3.
If you have to deal with billions of integers, I'd try to encapuslate arrays of 40-bit numbers instead of single 40-bit numbers. That way, you can test different array implementations (e.g. an implementation that compresses data on the fly, or maybe one that stores less-used data to disk.) without changing the rest of your code.
Here's a sample implementation (http://rextester.com/SVITH57679):
class Int64Array
{
char* buffer;
public:
static const int BYTE_PER_ITEM = 5;
Int64Array(size_t s)
{
buffer=(char*)malloc(s*BYTE_PER_ITEM);
}
~Int64Array()
{
free(buffer);
}
class Item
{
char* dataPtr;
public:
Item(char* dataPtr) : dataPtr(dataPtr){}
inline operator int64_t()
{
int64_t value=0;
memcpy(&value, dataPtr, BYTE_PER_ITEM); // Assumes little endian byte order!
return value;
}
inline Item& operator = (int64_t value)
{
memcpy(dataPtr, &value, BYTE_PER_ITEM); // Assumes little endian byte order!
return *this;
}
};
inline Item operator[](size_t index)
{
return Item(buffer+index*BYTE_PER_ITEM);
}
};
Note: The memcpy-conversion from 40-bit to 64-bit is basically undefined behavior, as it assumes litte-endianness. It should work on x86-platforms, though.
Note 2: Obviously, this is proof-of-concept code, not production-ready code. To use it in real projects, you'd have to add (among other things):
error handling (malloc can fail!)
copy constructor (e.g. by copying data, add reference counting or by making the copy constructor private)
move constructor
const overloads
STL-compatible iterators
bounds checks for indices (in debug build)
range checks for values (in debug build)
asserts for the implicit assumptions (little-endianness)
As it is, Item has reference semantics, not value semantics, which is unusual for operator[]; You could probably work around that with some clever C++ type conversion tricks
All of those should be straightforward for a C++ programmer, but they would make the sample code much longer without making it clearer, so I've decided to omit them.
I'll assume that
this is C, and
you need a single, large array of 40 bit numbers, and
you are on a machine that is little-endian, and
your machine is smart enough to handle alignment
you have defined size to be the number of 40-bit numbers you need
unsigned char hugearray[5*size+3]; // +3 avoids overfetch of last element
__int64 get_huge(unsigned index)
{
__int64 t;
t = *(__int64 *)(&hugearray[index*5]);
if (t & 0x0000008000000000LL)
t |= 0xffffff0000000000LL;
else
t &= 0x000000ffffffffffLL;
return t;
}
void set_huge(unsigned index, __int64 value)
{
unsigned char *p = &hugearray[index*5];
*(long *)p = value;
p[4] = (value >> 32);
}
It may be faster to handle the get with two shifts.
__int64 get_huge(unsigned index)
{
return (((*(__int64 *)(&hugearray[index*5])) << 24) >> 24);
}
For the case of storing some billions of 40-bit signed integers, and assuming 8-bit bytes, you can pack 8 40-bit signed integers in a struct (in the code below using an array of bytes to do that), and, since this struct is ordinarily aligned, you can then create a logical array of such packed groups, and provide ordinary sequential indexing of that:
#include <limits.h> // CHAR_BIT
#include <stdint.h> // int64_t
#include <stdlib.h> // div, div_t, ptrdiff_t
#include <vector> // std::vector
#define STATIC_ASSERT( e ) static_assert( e, #e )
namespace cppx {
using Byte = unsigned char;
using Index = ptrdiff_t;
using Size = Index;
// For non-negative values:
auto roundup_div( const int64_t a, const int64_t b )
-> int64_t
{ return (a + b - 1)/b; }
} // namespace cppx
namespace int40 {
using cppx::Byte;
using cppx::Index;
using cppx::Size;
using cppx::roundup_div;
using std::vector;
STATIC_ASSERT( CHAR_BIT == 8 );
STATIC_ASSERT( sizeof( int64_t ) == 8 );
const int bits_per_value = 40;
const int bytes_per_value = bits_per_value/8;
struct Packed_values
{
enum{ n = sizeof( int64_t ) };
Byte bytes[n*bytes_per_value];
auto value( const int i ) const
-> int64_t
{
int64_t result = 0;
for( int j = bytes_per_value - 1; j >= 0; --j )
{
result = (result << 8) | bytes[i*bytes_per_value + j];
}
const int64_t first_negative = int64_t( 1 ) << (bits_per_value - 1);
if( result >= first_negative )
{
result = (int64_t( -1 ) << bits_per_value) | result;
}
return result;
}
void set_value( const int i, int64_t value )
{
for( int j = 0; j < bytes_per_value; ++j )
{
bytes[i*bytes_per_value + j] = value & 0xFF;
value >>= 8;
}
}
};
STATIC_ASSERT( sizeof( Packed_values ) == bytes_per_value*Packed_values::n );
class Packed_vector
{
private:
Size size_;
vector<Packed_values> data_;
public:
auto size() const -> Size { return size_; }
auto value( const Index i ) const
-> int64_t
{
const auto where = div( i, Packed_values::n );
return data_[where.quot].value( where.rem );
}
void set_value( const Index i, const int64_t value )
{
const auto where = div( i, Packed_values::n );
data_[where.quot].set_value( where.rem, value );
}
Packed_vector( const Size size )
: size_( size )
, data_( roundup_div( size, Packed_values::n ) )
{}
};
} // namespace int40
#include <iostream>
auto main() -> int
{
using namespace std;
cout << "Size of struct is " << sizeof( int40::Packed_values ) << endl;
int40::Packed_vector values( 25 );
for( int i = 0; i < values.size(); ++i )
{
values.set_value( i, i - 10 );
}
for( int i = 0; i < values.size(); ++i )
{
cout << values.value( i ) << " ";
}
cout << endl;
}
Yes, you can do that, and it will save some space for large quantities of numbers
You need a class that contains a std::vector of an unsigned integer type.
You will need member functions to store and to retrieve an integer. For example, if you want do store 64 integers of 40 bit each, use a vector of 40 integers of 64 bits each. Then you need a method that stores an integer with index in [0,64] and a method to retrieve such an integer.
These methods will execute some shift operations, and also some binary | and & .
I am not adding any more details here yet because your question is not very specific. Do you know how many integers you want to store? Do you know it during compile time? Do you know it when the program starts? How should the integers be organized? Like an array? Like a map? You should know all this before trying to squeeze the integers into less storage.
There are quite a few answers here covering implementation, so I'd like to talk about architecture.
We usually expand 32-bit values to 64-bit values to avoid overflowing because our architectures are designed to handle 64-bit values.
Most architectures are designed to work with integers whose size is a power of 2 because this makes the hardware vastly simpler. Tasks such as caching are much simpler this way: there are a large number of divisions and modulus operations which can be replaced with bit masking and shifts if you stick to powers of 2.
As an example of just how much this matters, The C++11 specification defines multithreading race-cases based on "memory locations." A memory location is defined in 1.7.3:
A memory location is either an object of scalar type or a maximal
sequence of adjacent bit-fields all having non-zero width.
In other words, if you use C++'s bitfields, you have to do all of your multithreading carefully. Two adjacent bitfields must be treated as the same memory location, even if you wish computations across them could be spread across multiple threads. This is very unusual for C++, so likely to cause developer frustration if you have to worry about it.
Most processors have a memory architecture which fetches 32-bit or 64-bit blocks of memory at a time. Thus use of 40-bit values will have a surprising number of extra memory accesses, dramatically affecting run-time. Consider the alignment issues:
40-bit word to access: 32-bit accesses 64bit-accesses
word 0: [0,40) 2 1
word 1: [40,80) 2 2
word 2: [80,120) 2 2
word 3: [120,160) 2 2
word 4: [160,200) 2 2
word 5: [200,240) 2 2
word 6: [240,280) 2 2
word 7: [280,320) 2 1
On a 64 bit architecture, one out of every 4 words will be "normal speed." The rest will require fetching twice as much data. If you get a lot of cache misses, this could destroy performance. Even if you get cache hits, you are going to have to unpack the data and repack it into a 64-bit register to use it (which might even involve a difficult to predict branch).
It is entirely possible this is worth the cost
There are situations where these penalties are acceptable. If you have a large amount of memory-resident data which is well indexed, you may find the memory savings worth the performance penalty. If you do a large amount of computation on each value, you may find the costs are minimal. If so, feel free to implement one of the above solutions. However, here are a few recommendations.
Do not use bitfields unless you are ready to pay their cost. For example, if you have an array of bitfields, and wish to divide it up for processing across multiple threads, you're stuck. By the rules of C++11, the bitfields all form one memory location, so may only be accessed by one thread at a time (this is because the method of packing the bitfields is implementation defined, so C++11 can't help you distribute them in a non-implementation defined manner)
Do not use a structure containing a 32-bit integer and a char to make 40 bytes. Most processors will enforce alignment and you wont save a single byte.
Do use homogenous data structures, such as an array of chars or array of 64-bit integers. It is far easier to get the alignment correct. (And you also retain control of the packing, which means you can divide an array up amongst several threads for computation if you are careful)
Do design separate solutions for 32-bit and 64-bit processors, if you have to support both platforms. Because you are doing something very low level and very ill-supported, you'll need to custom tailor each algorithm to its memory architecture.
Do remember that multiplication of 40-bit numbers is different from multiplication of 64-bit expansions of 40-bit numbers reduced back to 40-bits. Just like when dealing with the x87 FPU, you have to remember that marshalling your data between bit-sizes changes your result.
This begs for streaming in-memory lossless compression. If this is for a Big Data application, dense packing tricks are tactical solutions at best for what seems to require fairly decent middleware or system-level support. They'd need thorough testing to make sure one is able to recover all the bits unharmed. And the performance implications are highly non-trivial and very hardware-dependent because of interference with the CPU caching architecture (e.g. cache lines vs packing structure). Someone mentioned complex meshing structures : these are often fine-tuned to cooperate with particular caching architectures.
It's not clear from the requirements whether the OP needs random access. Given the size of the data it's more likely one would only need local random access on relatively small chunks, organised hierarchically for retrieval. Even the hardware does this at large memory sizes (NUMA). Like lossless movie formats show, it should be possible to get random access in chunks ('frames') without having to load the whole dataset into hot memory (from the compressed in-memory backing store).
I know of one fast database system (kdb from KX Systems to name one but I know there are others) that can handle extremely large datasets by seemlessly memory-mapping large datasets from backing store. It has the option to transparently compress and expand the data on-the-fly.
If what you really want is an array of 40 bit integers (which obviously you can't have), I'd just combine one array of 32 bit and one array of 8 bit integers.
To read a value x at index i:
uint64_t x = (((uint64_t) array8 [i]) << 32) + array32 [i];
To write a value x to index i:
array8 [i] = x >> 32; array32 [i] = x;
Obviously nicely encapsulated into a class using inline functions for maximum speed.
There is one situation where this is suboptimal, and that is when you do truly random access to many items, so that each access to an int array would be a cache miss - here you would get two cache misses every time. To avoid this, define a 32 byte struct containing an array of six uint32_t, an array of six uint8_t, and two unused bytes (41 2/3rd bits per number); the code to access an item is slightly more complicated, but both components of the item are in the same cache line.
I want to store bits in an array (like structure). So I can follow either of the following two approaches
Approach number 1 (AN 1)
struct BIT
{
int data : 1
};
int main()
{
BIT a[100];
return 0;
}
Approach number 2 (AN 2)
int main()
{
std::bitset<100> BITS;
return 0;
}
Why would someone prefer AN 2 over AN 1?
Because approach nr. 2 actually uses 100 bits of storage, plus some very minor (constant) overhead, while nr. 1 typically uses four bytes of storage per Bit structure. In general, a struct is at least one byte large per the C++ standard.
#include <bitset>
#include <iostream>
struct Bit { int data : 1; };
int main()
{
Bit a[100];
std::bitset<100> b;
std::cout << sizeof(a) << "\n";
std::cout << sizeof(b) << "\n";
}
prints
400
16
Apart from this, bitset wraps your bit array in a nice object representation with many useful operations.
A good choice depends on how you're going to use the bits.
std::bitset<N> is of fixed size. Visual C++ 10.0 is non-conforming wrt. to constructors; in general you have to provide a workaround. This was, ironically, due to what Microsoft thought was a bug-fix -- they introduced a constructor taking int argument, as I recall.
std::vector<bool> is optimized in much the same way as std::bitset. Cost: indexing doesn't directly provide a reference (there are no references to individual bits in C++), but instead returns a proxy object -- which isn't something you notice until you try to use it as a reference. Advantage: minimal storage, and the vector can be resized as required.
Simply using e.g. unsigned is also an option, if you're going to deal with a small number of bits (in practice, 32 or less, although the formal guarantee is just 16 bits).
Finally, ALL UPPERCASE identifiers are by convention (except Microsoft) reserved for macros, in order to reduce the probability of name collisions. It's therefore a good idea to not use ALL UPPERCASE identifiers for anything else than macros. And to always use ALL UPPERCASE identifiers for macros (this also makes it easier to recognize them).
Cheers & hth.,
bitset has more operations
Approach number 1 will most likely be compiled as an array of 4-byte integers, and one bit of each will be used to store your data. Theoretically a smart compiler could optimize this, but I wouldn't count on it.
Is there a reason you don't want to use std::bitset?
To quote cplusplus.com's page on bitset, "The class is very similar to a regular array, but optimizing for space allocation". If your ints are 4 bytes, a bitset uses 32 times less space.
Even doing bool bits[100], as sbi suggested, is still worse than bitset, because most implementations have >= 1-byte bools.
If, for reasons of intellectual curiosity only, you wanted to implement your own bitset, you could do so using bit masks:
typedef struct {
unsigned char bytes[100];
} MyBitset;
bool getBit(MyBitset *bitset, int index)
{
int whichByte = index / 8;
return bitset->bytes[whichByte] && (1 << (index = % 8));
}
bool setBit(MyBitset *bitset, int index, bool newVal)
{
int whichByte = index / 8;
if (newVal)
{
bitset->bytes[whichByte] |= (1 << (index = % 8));
}
else
{
bitset->bytes[whichByte] &= ~(1 << (index = % 8));
}
}
(Sorry for using a struct instead of a class by the way. I'm thinking in straight C because I'm in the middle of a low-level assignment for school. Obviously two huge benefits of using a class are operator overloading and the ability to have a variable-sized array.)