String container for large N

String container for large N - c++

I am looking for a string container suited for large number of strings (> 10^9). Strings have variable length. It must be fast for insertions and lookup and have frugal memory use. Strings are unordered when container is filled. Average string length is about 10 bytes. Lookup done on exact string value. Erasability - optional. N is unknown in advance. For 64bit architecture. Use case - think about associative array for AWK.
map<string> have about 20-40 bites overhead per string and each insertion calls one malloc (or two). So it is not fast and not frugal.
Can someone point me to a C/C++ library, a data structure or a paper?
Relavant -- Comparison of Hash Table Libraries
EDIT
I've removed "big data", bumped N to larger value, clarified requirements.

There is no silver bullet, but a radix tree gives the advantages of a trie (fast look up and insert, at least asymptotically) - with a better space consumption.
However - both are considered not "cache efficient" - which might be significant especially if iteration over the data is required at some point.

For your problem, a pointer on a 64bit machine nearly matches the length of your data. So using multiple pointers per string in your problem (average length less than 10 bytes) would make the size of the data structure dominate the size of your input.
One general way to deal with that is to not use pointers to represent your strings. A specialized representation using a 32bit offset into a large page where all your strings get stored would halve the pointer memory requirements for you, at the cost of needing to do an addition to a pointer to retrieve your string.
Edit: Below is a sample (untested) implementation of such a representation (using struct for the sake of simplicity, actual implementation would of course only make the user interface public). The representation assumes a hash table insertion, hence leaving room for a next_. Note that the offsets are scaled by size of hash_node to allow for representation in a 32bit offset.
struct hash_node {
uint32_t next_;
char * str () { return (const char *)(&next+1); }
const char * str () const { return (const char *)(&next+1); }
};
struct hash_node_store {
std::vector<hash_node> page_; /* holds the allocated memory for nodes */
uint32_t free_;
hash_node * ptr (uint32_t offset) {
if (offset == 0) return 0;
return &page_[offset-1];
}
uint32_t allocate (const char *str) {
hash_node *hn = ptr(free_);
uint32_t len = strlen(str) + 1;
uint32_t node_size =
1 + (len / sizeof(hash_node)) + !!(len % sizeof(hash_node));
strcpy(hn->str(), str);
free_ += node_size;
return 1 + (hn - &page_[0]);
}
};
A hash table would contain a node store, and a hash bucket vector.
struct hash_table {
hash_node_store store_;
std::vector<uint32_t> table_; /* holds allocated memory for buckets */
uint32_t hash_func (const char *s) { /* ... */ }
uint32_t find_at (uint32_t node_offset, const char *str);
bool insert_at (uint32_t &node_offset, const char *str);
bool insert (const char *str) {
uint32_t bucket = hash_func(s) % table_.size();
return insert_at(table_[bucket], str);
}
bool find (const char *str) {
uint32_t bucket = hash_func(s) % table_.size();
return find_at(table_[bucket], str);
}
};
Where find_at and insert_at are just simple functions implemented in the expected way.
uint32_t hash_table::find_at (uint32_t node_offset, const char *str) {
hash_node *hn = store_.ptr(node_offset);
while (hn) {
if (strcmp(hn->str(), str) == 0) break;
node_offset = hn->next_;
hn = store_.ptr(node_offset);
}
return node_offset;
}
bool hash_table::insert_at (uint32_t &node_offset, const char *str) {
if (! find_at(node_offset, str)) {
uint32_t new_node = store_.allocate(str);
store_.ptr(new_node)->next_ = node_offset;
node_offset = new_node;
return true;
}
return false;
}

As you're only inserting values, the string data itself can be concatenated as it's inserted - each with a delimiter character such as NUL. The character offset into that single buffer uniquely identifies the string. This means that sets of strings that shared a common substring will each be completely redundantly individually specified, but countering that no effort will be spent trying to find or encoding such factoring: that could backfire for highly unrelated string values (e.g. random text).
To find the strings, a hash table could be used. Given your aim of avoiding frequent dynamic memory allocations, to handle collisions efficiently you'd need to use displacement lists: the idea is that when inserting a string that hashes to an already used bucket, you add an offset (wrapping around the table if necessary) and try that other bucket, continuing until an empty bucket it found. This means you need a list of displacements to try: you can hand-code a finite list(s) to get you started, or even potentially nesting loops over a "big displacement" list whose values are added to those from a "small displacement" list until an empty bucket is found, e.g. two hand coded lists of 10 displacements yield 100 combinations. (Alternative hashing algorithms can be used instead of or combined with displacement lists.) You do need to have a reasonable ratio of total to used buckets though... I'd expect something around 1.2 to work ok typically, with larger values prioritorising speed over space - you could populate your system with sample data and tune to taste.
So, the space requirement is:
total_bytes_of_string_data + N delimiters + total_buckets * sizeof(string_offset)
Where sizeof(string_offset) probably needs 8 bytes as 10^9 * 10 is already more than 2^32.
For 10^9 strings of ~10 characters and 1.2*10^9 buckets, this is around 10^9 * (10+1) + 1.2*10^9 * 8 bytes = 20.6^10^9 bytes or 19.1 GB.
It's worth noting that 64 bit virtual address space means you can safely allocate much more space for the concatenated string data and hash table than you actually need, and only those pages actually accessed will require virtual memory (initially physical memory, but it could later be swapped to disk through normal virtual memory mechanisms).
Discussion
There's no way to guarantee reduction in the string memory usage without assumptions / insights about repetitions in the string data or character set used.
If all the insertions were followed by a huge number of searches, sorting the string data and using binary searches would be an ideal solution. But, for quick insertions interspersed with searches the above is reasonable.
You could also have an index based around balanced binary trees, but to avoid memory allocations for each insertion you'd need to group a lot of nodes into one memory page and manually manage the ordering and splitting thereof on a less granular level: painful to implement. There might be a library doing it already but I haven't heard of one.
You've added "associative arrays in AWK" as an example of what this could be used for. You could simply embed each mapped-to value immediately after its string key in the concatenated data.

Is a (low) false positive rate acceptable? If so, then Bloom filters would be an option. If you'd be content with a false positive rate of one in a million, or 2^(-20), you'd want to use a buffer size in bits of around 30 times the number of strings you expect, or 3*10^10 bits. That's less than 4GB. You'd also need around 20 independent hash functions.
If you can't accept false positives, you should consider putting a Bloom filter in front of whatever other solution you build, to weed out most negatives really quickly.

Related

Why reserve memory in the structure?

I often see structures in the code, at the end of which there is a memory reserve.
struct STAT_10K4
{
int32_t npos; // position number
...
float Plts;
Pxts;
float Plto [NUM];
uint32_t reserv [(NUM * 3)% 2 + 1];
};
Why do they do this?
Why are some of the reserve values dependent on constants?
What can happen if you do not make such reserves? Or make a mistake in their size?

This is a form of manual padding of a class to make its size a multiple of some number. In your case:
uint32_t reserv [(NUM * 3)% 2 + 1];
NUM * 3 % 2 is actually nonsensical, as it would be equivalent to NUM % 2 (not considering overflow). So if the array size is odd, we pad the struct with one additional uint32_t, on top of + 1 additional ones. This padding means that STAT_10K4's size is always a multiple of 8 bytes.
You will have to consult the documentation of your software to see why exactly this is done. Perhaps padding this struct with up to 8 bytes makes some algorithm easier to implement. Or maybe it has some perceived performance benefit. But this is pure speculation.
Typically, the compiler will pad your structs to 64-bit boundaries if you use any 64-bit types, so you don't need to do this manually.
Note: This answer is specific to mainstream compilers and x86. Obviously this does not apply to compiling for TI-calculators with 20-bit char & co.

This would typically be to support variable-length records. A couple of ways this could be used will be:
1 If the maximum number of records is known then a simple structure definition can accomodate all cases.
2 In many protocols there is a "header-data" idiom. The header will be a fixed size but the data variable. The data will be received as a "blob". Thus the structure of the header can be declared and accessed by a pointer to the blob, and the data will follow on from that. For example:
typedef struct
{
uint32_t messageId;
uint32_t dataType;
uint32_t dataLenBytes;
uint8_t data[MAX_PAYLOAD];
}
tsMessageFormat;
The data is received in a blob, so a void* ptr, size_t len.
The buffer pointer is then cast so the message can be read as follows:
tsMessageFormat* pMessage = (psMessageFormat*) ptr;
for (int i = 0; i < pMessage->dataLenBytes; i++)
{
//do something with pMessage->data[i];
}
In some languages the "data" could be specified as being an empty record, but C++ does not allow this. Sometimes you will see the "data" omitted and you have to perform pointer arithmetic to access the data.
The alternative to this would be to use a builder pattern and/or streams.
Windows uses this pattern a lot; many structures have a cbSize field which allows additional data to be conveyed beyond the structure. The structure accomodates most cases, but having cbSize allows additional data to be provided if necessary.

Bool Flags vs. Unsigned Char Flags

Disclaimer: Please correct me in the event that I make any false claims in this post.
Consider a struct that contains eight bool member variables.
/*
* Struct uses one byte for each flag.
*/
struct WithBools
{
bool f0 = true;
bool f1 = true;
bool f2 = true;
bool f3 = true;
bool f4 = true;
bool f5 = true;
bool f6 = true;
bool f7 = true;
};
The space allocated to each variable is a byte in length, which seems like a waste if the variables are used solely as flags. One solution to reduce this wasted space, as far as the variables are concerned, is to encapsulate the eight flags into a single member variable of unsigned char.
/*
* Struct uses a single byte for eight flags; retrieval and
* manipulation of data is achieved through accessor functions.
*/
struct WithoutBools
{
unsigned char getFlag(unsigned index)
{
return flags & (1 << (index % 8));
}
void toggleFlag(unsigned index)
{
flags ^= (1 << (index % 8));
}
private:
unsigned char flags = 0xFF;
};
The flags are retrieved and manipulated via. bitwise operators, and the struct provides an interface for the user to retrieve and manipulate the flags. While flag sizes have been reduced, we now have the two additional methods that add to the size of the struct. I do not know how to benchmark this difference, therefore I could not be certain of any fluctuation between the above structs.
My questions are:
1) Would the difference in space between these two structs be negligible?
2) Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
3) Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.

we now have the two additional methods that add to the size of the
struct
Methods are code and do not increase the size of the struct. Only data makes up size on the structure.
3) Would a C++ compiler make such an optimisation that compacts a
collection of bools wherever possible and appropriate.
That is a sound resounding no. The compiler is not allowed to change data types.
1) Would the difference in space between these two structs be
negligible?
No, there definitely is a size difference between the two approaches.
2) Generally, is this approach of "optimising" a collection of bools
by compacting them into a single byte a good idea? Either in an
embedded systems context or otherwise.
Generally yes, the idiomatic way to model flags is with bit-wise manipulation inside an unsigned integer. Depending on the number of flags needed you can use std::uint8_t, std::uint16_t and so on.
However the most common way to model this is not via index as you've done, but via masks.

Would the difference in space between these two structs be negligible?
That depends on how many values you are storing and how much space you have to store them in. The size difference is 1 to 8.
Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
Again, it depends on how many values and how much space. Also note that dealing with bits instead of bytes increases code size and execution time.
Many embedded systems have relatively little RAM and plenty of Flash. Code is stored in Flash, so the increased code size can be ignored, and the saved memory could be important on small RAM systems.
Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.
Hypothetically it could. I would consider that an aggressive space optimization, at the expense of execution time.
STL has a specialization for vector<bool> that I frequently avoid for performance reasons - vector<char> is much faster.

If a 32-bit integer overflows, can we use a 40-bit structure instead of a 64-bit long one?

If, say, a 32-bit integer is overflowing, instead of upgrading int to long, can we make use of some 40-bit type if we need a range only within 240, so that we save 24 (64-40) bits for every integer?
If so, how?
I have to deal with billions and space is a bigger constraint.

Yes, but...
It is certainly possible, but it is usually nonsensical (for any program that doesn't use billions of these numbers):
#include <stdint.h> // don't want to rely on something like long long
struct bad_idea
{
uint64_t var : 40;
};
Here, var will indeed have a width of 40 bits at the expense of much less efficient code generated (it turns out that "much" is very much wrong -- the measured overhead is a mere 1-2%, see timings below), and usually to no avail. Unless you have need for another 24-bit value (or an 8 and 16 bit value) which you wish to pack into the same structure, alignment will forfeit anything that you may gain.
In any case, unless you have billions of these, the effective difference in memory consumption will not be noticeable (but the extra code needed to manage the bit field will be noticeable!).
Note:
The question has in the mean time been updated to reflect that indeed billions of numbers are needed, so this may be a viable thing to do, presumed that you take measures not to lose the gains due to structure alignment and padding, i.e. either by storing something else in the remaining 24 bits or by storing your 40-bit values in structures of 8 each or multiples thereof).
Saving three bytes a billion times is worthwhile as it will require noticeably fewer memory pages and thus cause fewer cache and TLB misses, and above all page faults (a single page fault weighting tens of millions instructions).
While the above snippet does not make use of the remaining 24 bits (it merely demonstrates the "use 40 bits" part), something akin to the following will be necessary to really make the approach useful in a sense of preserving memory -- presumed that you indeed have other "useful" data to put in the holes:
struct using_gaps
{
uint64_t var : 40;
uint64_t useful_uint16 : 16;
uint64_t char_or_bool : 8;
};
Structure size and alignment will be equal to a 64 bit integer, so nothing is wasted if you make e.g. an array of a billion such structures (even without using compiler-specific extensions). If you don't have use for an 8-bit value, you could also use an 48-bit and a 16-bit value (giving a bigger overflow margin).
Alternatively you could, at the expense of usability, put 8 40-bit values into a structure (least common multiple of 40 and 64 being 320 = 8*40). Of course then your code which accesses elements in the array of structures will become much more complicated (though one could probably implement an operator[] that restores the linear array functionality and hides the structure complexity).
Update:
Wrote a quick test suite, just to see what overhead the bitfields (and operator overloading with bitfield refs) would have. Posted code (due to length) at gcc.godbolt.org, test output from my Win7-64 machine is:
Running test for array size = 1048576
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 2 1 35 35 1
uint64_t 0 3 3 35 35 1
bad40_t 0 5 3 35 35 1
packed40_t 0 7 4 48 49 1
Running test for array size = 16777216
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 38 14 560 555 8
uint64_t 0 81 22 565 554 17
bad40_t 0 85 25 565 561 16
packed40_t 0 151 75 765 774 16
Running test for array size = 134217728
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 312 100 4480 4441 65
uint64_t 0 648 172 4482 4490 130
bad40_t 0 682 193 4573 4492 130
packed40_t 0 1164 552 6181 6176 130
What one can see is that the extra overhead of bitfields is neglegible, but the operator overloading with bitfield reference as a convenience thing is rather drastic (about 3x increase) when accessing data linearly in a cache-friendly manner. On the other hand, on random access it barely even matters.
These timings suggest that simply using 64-bit integers would be better since they are still faster overall than bitfields (despite touching more memory), but of course they do not take into account the cost of page faults with much bigger datasets. It might look very different once you run out of physical RAM (I didn't test that).

You can quite effectively pack 4*40bits integers into a 160-bit struct like this:
struct Val4 {
char hi[4];
unsigned int low[4];
}
long getLong( const Val4 &pack, int ix ) {
int hi= pack.hi[ix]; // preserve sign into 32 bit
return long( (((unsigned long)hi) << 32) + (unsigned long)pack.low[i]);
}
void setLong( Val4 &pack, int ix, long val ) {
pack.low[ix]= (unsigned)val;
pack.hi[ix]= (char)(val>>32);
}
These again can be used like this:
Val4[SIZE] vals;
long getLong( int ix ) {
return getLong( vals[ix>>2], ix&0x3 )
}
void setLong( int ix, long val ) {
setLong( vals[ix>>2], ix&0x3, val )
}

You might want to consider Variable-Lenght Encoding (VLE)
Presumably, you have store a lot of those numbers somewhere (in RAM, on disk, send them over the network, etc), and then take them one by one and do some processing.
One approach would be to encode them using VLE.
From Google's protobuf documentation (CreativeCommons licence)
Varints are a method of serializing integers using
one or more bytes. Smaller numbers take a smaller number of bytes.
Each byte in a varint, except the last byte, has the most significant
bit (msb) set – this indicates that there are further bytes to come.
The lower 7 bits of each byte are used to store the two's complement
representation of the number in groups of 7 bits, least significant
group first.
So, for example, here is the number 1 – it's a single byte, so the msb
is not set:
0000 0001
And here is 300 – this is a bit more complicated:
1010 1100 0000 0010
How do you figure out that this is 300? First you drop the msb from
each byte, as this is just there to tell us whether we've reached the
end of the number (as you can see, it's set in the first byte as there
is more than one byte in the varint)
Pros
If you have lots of small numbers, you'll probably use less than 40 bytes per integer, in average. Possibly much less.
You are able to store bigger numbers (with more than 40 bits) in the future, without having to pay a penalty for the small ones
Cons
You pay an extra bit for each 7 significant bits of your numbers. That means a number with 40 significant bits will need 6 bytes. If most of your numbers have 40 significant bits, you are better of with a bit field approach.
You will lose the ability to easily jump to a number given its index (you have to at least partially parse all previous elements in an array in order to access the current one.
You will need some form of decoding before doing anything useful with the numbers (although that is true for other approaches as well, like bit fields)

(Edit: First of all - what you want is possible, and makes sense in some cases; I have had to do similar things when I tried to do something for the Netflix challenge and only had 1GB of memory; Second - it is probably best to use a char array for the 40-bit storage to avoid any alignment issues and the need to mess with struct packing pragmas; Third - this design assumes that you're OK with 64-bit arithmetic for intermediate results, it is only for large array storage that you would use Int40; Fourth: I don't get all the suggestions that this is a bad idea, just read up on what people go through to pack mesh data structures and this looks like child's play by comparison).
What you want is a struct that is only used for storing data as 40-bit ints but implicitly converts to int64_t for arithmetic. The only trick is doing the sign extension from 40 to 64 bits right. If you're fine with unsigned ints, the code can be even simpler. This should be able to get you started.
#include <cstdint>
#include <iostream>
// Only intended for storage, automatically promotes to 64-bit for evaluation
struct Int40
{
Int40(int64_t x) { set(static_cast<uint64_t>(x)); } // implicit constructor
operator int64_t() const { return get(); } // implicit conversion to 64-bit
private:
void set(uint64_t x)
{
setb<0>(x); setb<1>(x); setb<2>(x); setb<3>(x); setb<4>(x);
};
int64_t get() const
{
return static_cast<int64_t>(getb<0>() | getb<1>() | getb<2>() | getb<3>() | getb<4>() | signx());
};
uint64_t signx() const
{
return (data[4] >> 7) * (uint64_t(((1 << 25) - 1)) << 39);
};
template <int idx> uint64_t getb() const
{
return static_cast<uint64_t>(data[idx]) << (8 * idx);
}
template <int idx> void setb(uint64_t x)
{
data[idx] = (x >> (8 * idx)) & 0xFF;
}
unsigned char data[5];
};
int main()
{
Int40 a = -1;
Int40 b = -2;
Int40 c = 1 << 16;
std::cout << "sizeof(Int40) = " << sizeof(Int40) << std::endl;
std::cout << a << "+" << b << "=" << (a+b) << std::endl;
std::cout << c << "*" << c << "=" << (c*c) << std::endl;
}
Here is the link to try it live: http://rextester.com/QWKQU25252

You can use a bit-field structure, but it's not going to save you any memory:
struct my_struct
{
unsigned long long a : 40;
unsigned long long b : 24;
};
You can squeeze any multiple of 8 such 40-bit variables into one structure:
struct bits_16_16_8
{
unsigned short x : 16;
unsigned short y : 16;
unsigned short z : 8;
};
struct bits_8_16_16
{
unsigned short x : 8;
unsigned short y : 16;
unsigned short z : 16;
};
struct my_struct
{
struct bits_16_16_8 a1;
struct bits_8_16_16 a2;
struct bits_16_16_8 a3;
struct bits_8_16_16 a4;
struct bits_16_16_8 a5;
struct bits_8_16_16 a6;
struct bits_16_16_8 a7;
struct bits_8_16_16 a8;
};
This will save you some memory (in comparison with using 8 "standard" 64-bit variables), but you will have to split every operation (and in particular arithmetic ones) on each of these variables into several operations.
So the memory-optimization will be "traded" for runtime-performance.

As the comments suggest, this is quite a task.
Probably an unnecessary hassle unless you want to save alot of RAM - then it makes much more sense. (RAM saving would be the sum total of bits saved across millions of long values stored in RAM)
I would consider using an array of 5 bytes/char (5 * 8 bits = 40 bits). Then you will need to shift bits from your (overflowed int - hence a long) value into the array of bytes to store them.
To use the values, then shift the bits back out into a long and you can use the value.
Then your RAM and file storage of the value will be 40 bits (5 bytes), BUT you must consider data alignment if you plan to use a struct to hold the 5 bytes. Let me know if you need elaboration on this bit shifting and data alignment implications.
Similarly, you could use the 64 bit long, and hide other values (3 chars perhaps) in the residual 24 bits that you do not want to use. Again - using bit shifting to add and remove the 24 bit values.

Another variation that may be helpful would be to use a structure:
typedef struct TRIPLE_40 {
uint32_t low[3];
uint8_t hi[3];
uint8_t padding;
};
Such a structure would take 16 bytes and, if 16-byte aligned, would fit entirely within a single cache line. While identifying which of the parts of the structure to use may be more expensive than it would be if the structure held four elements instead of three, accessing one cache line may be much cheaper than accessing two. If performance is important, one should use some benchmarks since some machines may perform a divmod-3 operation cheaply and have a high cost per cache-line fetch, while others might have have cheaper memory access and more expensive divmod-3.

If you have to deal with billions of integers, I'd try to encapuslate arrays of 40-bit numbers instead of single 40-bit numbers. That way, you can test different array implementations (e.g. an implementation that compresses data on the fly, or maybe one that stores less-used data to disk.) without changing the rest of your code.
Here's a sample implementation (http://rextester.com/SVITH57679):
class Int64Array
{
char* buffer;
public:
static const int BYTE_PER_ITEM = 5;
Int64Array(size_t s)
{
buffer=(char*)malloc(s*BYTE_PER_ITEM);
}
~Int64Array()
{
free(buffer);
}
class Item
{
char* dataPtr;
public:
Item(char* dataPtr) : dataPtr(dataPtr){}
inline operator int64_t()
{
int64_t value=0;
memcpy(&value, dataPtr, BYTE_PER_ITEM); // Assumes little endian byte order!
return value;
}
inline Item& operator = (int64_t value)
{
memcpy(dataPtr, &value, BYTE_PER_ITEM); // Assumes little endian byte order!
return *this;
}
};
inline Item operator[](size_t index)
{
return Item(buffer+index*BYTE_PER_ITEM);
}
};
Note: The memcpy-conversion from 40-bit to 64-bit is basically undefined behavior, as it assumes litte-endianness. It should work on x86-platforms, though.
Note 2: Obviously, this is proof-of-concept code, not production-ready code. To use it in real projects, you'd have to add (among other things):
error handling (malloc can fail!)
copy constructor (e.g. by copying data, add reference counting or by making the copy constructor private)
move constructor
const overloads
STL-compatible iterators
bounds checks for indices (in debug build)
range checks for values (in debug build)
asserts for the implicit assumptions (little-endianness)
As it is, Item has reference semantics, not value semantics, which is unusual for operator[]; You could probably work around that with some clever C++ type conversion tricks
All of those should be straightforward for a C++ programmer, but they would make the sample code much longer without making it clearer, so I've decided to omit them.

I'll assume that
this is C, and
you need a single, large array of 40 bit numbers, and
you are on a machine that is little-endian, and
your machine is smart enough to handle alignment
you have defined size to be the number of 40-bit numbers you need
unsigned char hugearray[5*size+3]; // +3 avoids overfetch of last element
__int64 get_huge(unsigned index)
{
__int64 t;
t = *(__int64 *)(&hugearray[index*5]);
if (t & 0x0000008000000000LL)
t |= 0xffffff0000000000LL;
else
t &= 0x000000ffffffffffLL;
return t;
}
void set_huge(unsigned index, __int64 value)
{
unsigned char *p = &hugearray[index*5];
*(long *)p = value;
p[4] = (value >> 32);
}
It may be faster to handle the get with two shifts.
__int64 get_huge(unsigned index)
{
return (((*(__int64 *)(&hugearray[index*5])) << 24) >> 24);
}

For the case of storing some billions of 40-bit signed integers, and assuming 8-bit bytes, you can pack 8 40-bit signed integers in a struct (in the code below using an array of bytes to do that), and, since this struct is ordinarily aligned, you can then create a logical array of such packed groups, and provide ordinary sequential indexing of that:
#include <limits.h> // CHAR_BIT
#include <stdint.h> // int64_t
#include <stdlib.h> // div, div_t, ptrdiff_t
#include <vector> // std::vector
#define STATIC_ASSERT( e ) static_assert( e, #e )
namespace cppx {
using Byte = unsigned char;
using Index = ptrdiff_t;
using Size = Index;
// For non-negative values:
auto roundup_div( const int64_t a, const int64_t b )
-> int64_t
{ return (a + b - 1)/b; }
} // namespace cppx
namespace int40 {
using cppx::Byte;
using cppx::Index;
using cppx::Size;
using cppx::roundup_div;
using std::vector;
STATIC_ASSERT( CHAR_BIT == 8 );
STATIC_ASSERT( sizeof( int64_t ) == 8 );
const int bits_per_value = 40;
const int bytes_per_value = bits_per_value/8;
struct Packed_values
{
enum{ n = sizeof( int64_t ) };
Byte bytes[n*bytes_per_value];
auto value( const int i ) const
-> int64_t
{
int64_t result = 0;
for( int j = bytes_per_value - 1; j >= 0; --j )
{
result = (result << 8) | bytes[i*bytes_per_value + j];
}
const int64_t first_negative = int64_t( 1 ) << (bits_per_value - 1);
if( result >= first_negative )
{
result = (int64_t( -1 ) << bits_per_value) | result;
}
return result;
}
void set_value( const int i, int64_t value )
{
for( int j = 0; j < bytes_per_value; ++j )
{
bytes[i*bytes_per_value + j] = value & 0xFF;
value >>= 8;
}
}
};
STATIC_ASSERT( sizeof( Packed_values ) == bytes_per_value*Packed_values::n );
class Packed_vector
{
private:
Size size_;
vector<Packed_values> data_;
public:
auto size() const -> Size { return size_; }
auto value( const Index i ) const
-> int64_t
{
const auto where = div( i, Packed_values::n );
return data_[where.quot].value( where.rem );
}
void set_value( const Index i, const int64_t value )
{
const auto where = div( i, Packed_values::n );
data_[where.quot].set_value( where.rem, value );
}
Packed_vector( const Size size )
: size_( size )
, data_( roundup_div( size, Packed_values::n ) )
{}
};
} // namespace int40
#include <iostream>
auto main() -> int
{
using namespace std;
cout << "Size of struct is " << sizeof( int40::Packed_values ) << endl;
int40::Packed_vector values( 25 );
for( int i = 0; i < values.size(); ++i )
{
values.set_value( i, i - 10 );
}
for( int i = 0; i < values.size(); ++i )
{
cout << values.value( i ) << " ";
}
cout << endl;
}

Yes, you can do that, and it will save some space for large quantities of numbers
You need a class that contains a std::vector of an unsigned integer type.
You will need member functions to store and to retrieve an integer. For example, if you want do store 64 integers of 40 bit each, use a vector of 40 integers of 64 bits each. Then you need a method that stores an integer with index in [0,64] and a method to retrieve such an integer.
These methods will execute some shift operations, and also some binary | and & .
I am not adding any more details here yet because your question is not very specific. Do you know how many integers you want to store? Do you know it during compile time? Do you know it when the program starts? How should the integers be organized? Like an array? Like a map? You should know all this before trying to squeeze the integers into less storage.

There are quite a few answers here covering implementation, so I'd like to talk about architecture.
We usually expand 32-bit values to 64-bit values to avoid overflowing because our architectures are designed to handle 64-bit values.
Most architectures are designed to work with integers whose size is a power of 2 because this makes the hardware vastly simpler. Tasks such as caching are much simpler this way: there are a large number of divisions and modulus operations which can be replaced with bit masking and shifts if you stick to powers of 2.
As an example of just how much this matters, The C++11 specification defines multithreading race-cases based on "memory locations." A memory location is defined in 1.7.3:
A memory location is either an object of scalar type or a maximal
sequence of adjacent bit-fields all having non-zero width.
In other words, if you use C++'s bitfields, you have to do all of your multithreading carefully. Two adjacent bitfields must be treated as the same memory location, even if you wish computations across them could be spread across multiple threads. This is very unusual for C++, so likely to cause developer frustration if you have to worry about it.
Most processors have a memory architecture which fetches 32-bit or 64-bit blocks of memory at a time. Thus use of 40-bit values will have a surprising number of extra memory accesses, dramatically affecting run-time. Consider the alignment issues:
40-bit word to access: 32-bit accesses 64bit-accesses
word 0: [0,40) 2 1
word 1: [40,80) 2 2
word 2: [80,120) 2 2
word 3: [120,160) 2 2
word 4: [160,200) 2 2
word 5: [200,240) 2 2
word 6: [240,280) 2 2
word 7: [280,320) 2 1
On a 64 bit architecture, one out of every 4 words will be "normal speed." The rest will require fetching twice as much data. If you get a lot of cache misses, this could destroy performance. Even if you get cache hits, you are going to have to unpack the data and repack it into a 64-bit register to use it (which might even involve a difficult to predict branch).
It is entirely possible this is worth the cost
There are situations where these penalties are acceptable. If you have a large amount of memory-resident data which is well indexed, you may find the memory savings worth the performance penalty. If you do a large amount of computation on each value, you may find the costs are minimal. If so, feel free to implement one of the above solutions. However, here are a few recommendations.
Do not use bitfields unless you are ready to pay their cost. For example, if you have an array of bitfields, and wish to divide it up for processing across multiple threads, you're stuck. By the rules of C++11, the bitfields all form one memory location, so may only be accessed by one thread at a time (this is because the method of packing the bitfields is implementation defined, so C++11 can't help you distribute them in a non-implementation defined manner)
Do not use a structure containing a 32-bit integer and a char to make 40 bytes. Most processors will enforce alignment and you wont save a single byte.
Do use homogenous data structures, such as an array of chars or array of 64-bit integers. It is far easier to get the alignment correct. (And you also retain control of the packing, which means you can divide an array up amongst several threads for computation if you are careful)
Do design separate solutions for 32-bit and 64-bit processors, if you have to support both platforms. Because you are doing something very low level and very ill-supported, you'll need to custom tailor each algorithm to its memory architecture.
Do remember that multiplication of 40-bit numbers is different from multiplication of 64-bit expansions of 40-bit numbers reduced back to 40-bits. Just like when dealing with the x87 FPU, you have to remember that marshalling your data between bit-sizes changes your result.

This begs for streaming in-memory lossless compression. If this is for a Big Data application, dense packing tricks are tactical solutions at best for what seems to require fairly decent middleware or system-level support. They'd need thorough testing to make sure one is able to recover all the bits unharmed. And the performance implications are highly non-trivial and very hardware-dependent because of interference with the CPU caching architecture (e.g. cache lines vs packing structure). Someone mentioned complex meshing structures : these are often fine-tuned to cooperate with particular caching architectures.
It's not clear from the requirements whether the OP needs random access. Given the size of the data it's more likely one would only need local random access on relatively small chunks, organised hierarchically for retrieval. Even the hardware does this at large memory sizes (NUMA). Like lossless movie formats show, it should be possible to get random access in chunks ('frames') without having to load the whole dataset into hot memory (from the compressed in-memory backing store).
I know of one fast database system (kdb from KX Systems to name one but I know there are others) that can handle extremely large datasets by seemlessly memory-mapping large datasets from backing store. It has the option to transparently compress and expand the data on-the-fly.

If what you really want is an array of 40 bit integers (which obviously you can't have), I'd just combine one array of 32 bit and one array of 8 bit integers.
To read a value x at index i:
uint64_t x = (((uint64_t) array8 [i]) << 32) + array32 [i];
To write a value x to index i:
array8 [i] = x >> 32; array32 [i] = x;
Obviously nicely encapsulated into a class using inline functions for maximum speed.
There is one situation where this is suboptimal, and that is when you do truly random access to many items, so that each access to an int array would be a cache miss - here you would get two cache misses every time. To avoid this, define a 32 byte struct containing an array of six uint32_t, an array of six uint8_t, and two unused bytes (41 2/3rd bits per number); the code to access an item is slightly more complicated, but both components of the item are in the same cache line.

constructing a large multi dimesional vector array

Right now I have a vector std::vector<char> myVector(4) containing any combination of a set of char lets say {#,#,O,*,%,$,!} may be more or less but not many more than that, might not always be 4 members either, but will be constant for any instance one instance.
now I stuck trying to create a data structure that can use an indefinite number of those combination as an index, to another vector.
in pseudo-code I am trying to accomplish:
SomeDataStructure['*']['#']['#']['O'] = someData
(someData is going to be a small class, but that shouldn't matter)
This is an operation critical piece that needs to run quickly, and will be run very often.
some approached i've tried to reason with were:
a 4 dimensional array, but I can access those without numeric indices. Maybe some form of enumeration could solve this. Edit: would maps be a way to do this?
edit:
I resolved this using a map:
std::map<std::vector<char>, someData> myMap;

Since the number of possible characters is limited to 8, you can use an enumeration instead. You'd therefore only need 3-bits to represent each "character". You can pack several of these 3-bit "characters" into a short integer using bitfields. The resulting packed integer becomes the index into your vector<SomeData>.
The space occupied by this vector would be space_of_SomeData * 2^(3*number_of_spaces). If, for example, number_of_spaces is 4, this results in 4096*space_of_SomeData. This might result in some wasted memory space, but lookups and insertions should be very fast.
Here's some sample code:
#include <vector>
enum CharSet
{
ampersand,
pound,
letterOh,
percent,
dollar,
exclamation
};
struct CompositeIndex
{
union
{
struct // Bitfield
{
unsigned c0 : 3; // 3 bits
unsigned c1 : 3; // 3 bits
unsigned c2 : 3; // 3 bits
unsigned c3 : 3; // 3 bits
} chars;
unsigned int index;
};
};
unsigned int lookup(CharSet c0, CharSet c1, CharSet c2, CharSet c3)
{
CompositeIndex ci;
ci.chars.c0 = c0;
ci.chars.c1 = c1;
ci.chars.c2 = c2;
ci.chars.c3 = c3;
return ci.index;
}
typedef int SomeClass;
int main(int argc, char* argv[])
{
std::vector<SomeClass> vec(100);
vec[lookup(ampersand, percent, dollar, pound)] = 42;
}
If you must absolutely work with char characters, you can easily create a 256-element lookup table that quickly converts 'char' characters into CharSet values.
As already discussed by others, you can use a std::map<std::string, SomeData> or even (the possibly faster) std::map<char[4], SomeData, Comparitor>. If the approximate frequency distribution of different character sequences is known, try inserting the most frequent patterns first in the map. Depending on the internal implementation of the map, this may speed up lookups for the most frequent patterns (they are near the top of the underlying binary search tree).

Does the order of the characters itself have any bearing on what someData might be? If not (and I suspect that to be the case), then it sounds like what you really want is a hashtable matching strings to a small class. Hash functions are quick (O(1)) operations, so performance ought not to be a problem.
Take a look at map class - it should meet your needs.

In C++, a char is a number (typically an 8-bit number). As such, you can theoretically do a 4-D array with those as the indices. The obvious problem with doing that will be that with a total of 4 bytes for indexing, your array ends up with 232 entries. If, for example, that someData occupies 32 bits, the array would occupy around 16 gigabytes (of which, apparently, only a minuscule percentage would really be used).
The obvious alternative would be to concatenate the individual characters together into a string, and use that as the key for a map:
std::map<std::string, SomeData_t> mymap;
mymap["*##O"] = someData;
Depending on how often you insert versus look up items, you could consider using an unordered_map instead. This typically gives slightly faster lookup in exchange for slightly slower insertion.

Copying array of ints vs pointers to bools

I'm working on a program that requires an array to be copied many thousands/millions of times. Right now I have two ways of representing the data in the array:
An array of ints:
int someArray[8][8];
where someArray[a][b] can have a value of 0, 1, or 2, or
An array of pointers to booleans:
bool * someArray[8][8];
where someArray[a][b] can be 0 (null pointer), otherwise *someArray[a][b] can be true (corresponding to 1), or false (corresponding to 2).
Which array would be copied faster (and yes, if I made the pointers to booleans array, I would have to declare new bools every time I copy the array)?

Which would copy faster is beside the point, The overhead of allocating and freeing entries, and dereferencing the pointer to retrieve each value, for your bool* approach will swamp the cost of copying.
If you just have 3 possible values, use an array of char and that will copy 4 times faster than int. OK, that's not a scientifically proven statement but the array will be 4 times smaller.

Actually, both look more or less the same in terms of copying - an array of 32-bit ints vs an array of 32-bit pointers. If you compile as 64-bit, then the pointer would probably be bigger.
BTW, if you store pointers, you probably don't want to have a SEPARATE instance of "bool" for every field of that array, do you? That would be certainly much slower.
If you want a fast copy, reduce the size as much as possible, Either:
use char instead of int, or
devise a custom class with bit manipulations for this array. If you represent one value as two bits - a "null" bit and "value-if-not-null" bit, then you'd need 128 bits = 4 ints for this whole array of 64 values. This would certainly be copied very fast! But the access to any individual bit would be a bit more complex - just a few cycles more.
OK, you made me curious :) I rolled up something like this:
struct BitArray {
public:
static const int DIMENSION = 8;
enum BitValue {
BitNull = -1,
BitTrue = 1,
BitFalse = 0
};
BitArray() {for (int i=0; i<DIMENSION; ++i) data[i] = 0;}
BitValue get(int x, int y) {
int k = x+y*DIMENSION; // [0 .. 64)
int n = k/16; // [0 .. 4)
unsigned bit1 = 1 << ((k%16)*2);
unsigned bit2 = 1 << ((k%16)*2+1);
int isnull = data[n] & bit1;
int value = data[n] & bit2;
return static_cast<BitValue>( (!!isnull)*-1 + (!isnull)*!!value );
}
void set(int x, int y, BitValue value) {
int k = x+y*DIMENSION; // [0 .. 64)
int n = k/16; // [0 .. 4)
unsigned bit1 = 1 << ((k%16)*2);
unsigned bit2 = 1 << ((k%16)*2+1);
char v = static_cast<char>(value);
// set nullbit to 1 if v== -1, else 0
if (v == -1) {
data[n] |= bit1;
} else {
data[n] &= ~bit1;
}
// set valuebit to 1 if v== 1, else 0
if (v == 1) {
data[n] |= bit2;
} else {
data[n] &= ~bit2;
}
}
private:
unsigned data[DIMENSION*DIMENSION/16];
};
The size of this object for an 8x8 array is 16 bytes, which is a nice improvement compared to 64 bytes with the solution of char array[8][8] and 256 bytes of int array[8][8].
This is probably as low as one can go here without delving into greater magic.

I would say you need to redesign your program. Converting between int x[8][8] and bool *b[8][8] "millions" of times cannot be "right" however your definition of "right" is lax.

The answer to your question will be linked to the size of the data types. Typically bool is one byte while int is not. A pointer varies in length depending on the architecture, but these days is usually 32- or 64-bits.
Not taking caching or other processor-specific optimizations into consideration, the data type that is larger will take longer to copy.
Given that you have three possible states (0, 1, 2) and 64 entries you can represent your entire structure in 128 bits. Using some utility routines and two unsigned 64-bit integers you can efficiently copy your array around very quickly.

I am not 100% sure, but I think they will take roughly the same time, though I prefer using stack allocation (since dynamic allocation might take some time looking for a free space).
Consider using short type instead of int since you do not need a wide range of numbers.
I think it might be better to use one dimension array if you really want maximum speed since using the for loops in the wrong order which the compiler use for storing multidimensional arrays (raw major or column major) could cause performance penalty!

Without knowing too much about how you use the arrays, this is a possible solution:
typedef char Array[8][8];
Array someArray, otherArray;
memcpy(someArray, otherArray, sizeof(Array));
These arrays are only 64 bytes and should copy fairly fast. You can change the data type to int but that means copying at least 256 bytes.

"copying" this array with the pointers would require a deep copy, since otherwise changing the copy will affect the original, which is probably not what you want. This is going to slow things down immensely due to the memory allocation overhead.
You can get around this by using boost::optional to represent "optional" quantities - which is the only reason you're adding the level of indirection here. There are very few situations in modern C++ where a raw pointer is really the best thing to be using :) However, since you only need a char to store the values {0, 1, 2} anyway, that will probably be better in terms of space. I am pretty sure that sizeof(boost::optional<bool>) > 1, though I haven't tested it. I would be impressed if they specialized for this :)
You could even bit-pack an array of 2-bit quantities, or use two bit-packed boolean arrays (one "mask" and then another set of actual true-false values) - using std::bitset for example. That will certainly save space and reduce copying time, although it would probably increase access time (assuming you really do need to access one value at a time).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js