Any way to "factor out" common fields to save space?

Any way to "factor out" common fields to save space? - c++

I have a large array (> millions) of Items, where each Item has the form:
struct Item { void *a; size_t b; };
There are a handful of distinct a fields—meaning there are many items with the same a field.
I would like to "factor" this information out to save about 50% memory usage.
However, the trouble is that these Items have a significant ordering, and that may change over time. Therefore, I can't just go ahead make a separate Item[] for each distinct a, because that will lose the relative ordering of the items with respect to each other.
On the other hand, if I store the orderings of all the items in a size_t index; field, then I lose any memory savings from the removal of the void *a; field.
So is there a way for me to actually save memory here, or no?
(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way. That one will require me to either use unaligned memory or to split every Item[] into two, which isn't great for memory locality, so I'd prefer something else.)

(Note: I can already think of e.g. using an unsigned char for a to index into a small array, but I'm wondering if there's a better way.)
This thinking is on the right track, but it's not that simple, since you will run into some nasty alignment/padding issues that will negate your memory gains.
At that point, when you start trying to scratch the last few bytes of a structure like this, you will probably want to use bit fields.
#define A_INDEX_BITS 3
struct Item {
size_t a_index : A_INDEX_BITS;
size_t b : (sizeof(size_t) * CHAR_BIT) - A_INDEX_BITS;
};
Note that this will limit how many bits are available for b, but on modern platforms, where sizeof(size_t) is 8, stripping 3-4 bits from it is rarely an issue.

Use a combination of lightweight compression schemes (see this for examples and some references) to represent the a* values. #Frank's answer employes DICT followed by NS, for example. If you have long runs of the same pointer, you could consider RLE (Run-Length Encoding) on top of that.

This is a bit of a hack, but I've used it in the past with some success. The extra overhead for object access was compensated for by the significant memory reduction.
A typical use case is an environment where (a) values are actually discriminated unions (that is, they include a type indicator) with a limited number of different types and (b) values are mostly kept in large contiguous vectors.
With that environment, it is quite likely that the payload part of (some kinds of) values uses up all the bits allocated for it. It is also possible that the datatype requires (or benefits from) being stored in aligned memory.
In practice, now that aligned access is not required by most mainstream CPUs, I would just used a packed struct instead of the following hack. If you don't pay for unaligned access, then storing a { one-byte type + eight-byte value } as nine contiguous bytes is probably optimal; the only cost is that you need to multiply by 9 instead of 8 for indexed access, and that is trivial since the 9 is a compile-time constant.
If you do have to pay for unaligned access, then the following is possible. Vectors of "augmented" values have the type:
// Assume that Payload has already been typedef'd. In my application,
// it would be a union of, eg., uint64_t, int64_t, double, pointer, etc.
// In your application, it would be b.
// Eight-byte payload version:
typedef struct Chunk8 { uint8_t kind[8]; Payload value[8]; }
// Four-byte payload version:
typedef struct Chunk4 { uint8_t kind[4]; Payload value[4]; }
Vectors are then vectors of Chunks. For the hack to work, they must be allocated on 8- (or 4-)byte aligned memory addresses, but we've already assumed that alignment is required for the Payload types.
The key to the hack is how we represent a pointer to an individual value, because the value is not contiguous in memory. We use a pointer to it's kind member as a proxy:
typedef uint8_t ValuePointer;
And then use the following low-but-not-zero-overhead functions:
#define P_SIZE 8U
#define P_MASK P_SIZE - 1U
// Internal function used to get the low-order bits of a ValuePointer.
static inline size_t vpMask(ValuePointer vp) {
return (uintptr_t)vp & P_MASK;
}
// Getters / setters. This version returns the address so it can be
// used both as a getter and a setter
static inline uint8_t* kindOf(ValuePointer vp) { return vp; }
static inline Payload* valueOf(ValuePointer vp) {
return (Payload*)(vp + 1 + (vpMask(vp) + 1) * (P_SIZE - 1));
}
// Increment / Decrement
static inline ValuePointer inc(ValuePointer vp) {
return vpMask(++vp) ? vp : vp + P_SIZE * P_SIZE;
}
static inline ValuePointer dec(ValuePointer vp) {
return vpMask(vp--) ? vp - P_SIZE * P_SIZE : vp;
}
// Simple indexed access from a Chunk pointer
static inline ValuePointer eltk(Chunk* ch, size_t k) {
return &ch[k / P_SIZE].kind[k % P_SIZE];
}
// Increment a value pointer by an arbitrary (non-negative) amount
static inline ValuePointer inck(ValuePointer vp, size_t k) {
size_t off = vpMask(vp);
return eltk((Chunk*)(vp - off), k + off);
}
I left out a bunch of the other hacks but I'm sure you can figure them out.
One cool thing about interleaving the pieces of the value is that it has moderately good locality of reference. For the 8-byte version, almost half of the time a random access to a kind and a value will only hit one 64-byte cacheline; the rest of the time two consecutive cachelines are hit, with the result that walking forwards (or backwards) through a vector is just as cache-friendly as walking through an ordinary vector, except that it uses fewer cachelines because the objects are half the size. The four byte version is even cache-friendlier.

I think I figured out the information-theoretically-optimal way to do this myself... it's not quite worth the gains in my case, but I'll explain it here in case it helps someone else.
However, it requires unaligned memory (in some sense).
And perhaps more importantly, you lose the ability easily add new values of a dynamically.
What really matters here is the number of distinct Items, i.e. the number of distinct (a,b) pairs. After all, it could be that for one a there are a billion different bs, but for the other ones there are only a handful, so you want to take advantage of that.
If we assume that there are N distinct items to choose from, then we need n = ceil(log2(N)) bits to represent each Item. So what we really want is an array of n-bit integers, with n computed at run time. Then, once you get the n-bit integer, you can do a binary search in log(n) time to figure out which a it corresponds to, based on your knowledge of the count of bs for each a. (This may be a bit of a performance hit, but it depends on the number of distinct as.)
You can't do this in a nice memory-aligned fashion, but that isn't too bad. What you would do is make a uint_vector data structure with the number of bits per element being a dynamically-specifiable quantity. Then, to randomly access into it, you'd do a few divisions or mod operations along with bit-shifts to extract the required integer.
The caveat here is that the dividing by a variable will probably severely damage your random-access performance (although it'll still be O(1)). The way to mitigate that would probably be to write a few different procedures for common values of n (C++ templates help here!) and then branch into them with various if (n == 33) { handle_case<33>(i); } or switch (n) { case 33: handle_case<33>(i); }, etc. so that the compiler sees the divisor as a constant and generates shifts/adds/multiplies as needed, rather than division.
This is information-theoretically optimal as long as you require a constant number of bits per element, which is what you would want for random-accessing. However, you could do better if you relax that constraint: you could pack multiple integers into k * n bits, then extract them with more math. This will probably kill performance too.
(Or, long story short: C and C++ really need a high-performance uint_vector data structure...)

An Array-of-Structures approach may be helpful. That is, have three vectors...
vector<A> vec_a;
vector<B> vec_b;
SomeType b_to_a_map;
You access your data as...
Item Get(int index)
{
Item retval;
retval.a = vec_a[b_to_a_map[index]];
retval.b = vec_b[index];
return retval;
}
Now all you need to do is choose something sensible for SomeType. For example, if vec_a.size() were 2, you could use vector<bool> or boost::dynamic_bitset. For more complex cases you could try bit-packing, for example to support 4-values of A, we simple change our function with...
int a_index = b_to_a_map[index*2]*2 + b_to_a_map[index*2+1];
retval.a = vec_a[a_index];
You can always beat bit-packing by using range-packing, using div/mod to store a fractional bit length per item, but the complexity grows quickly.
A good guide can be found here http://number-none.com/product/Packing%20Integers/index.html

Related

Dealing with a contiguous vector of fixed-size matrices for both storage layouts in Eigen

An external library gives me a raw pointer of doubles that I want to map to an Eigen type. The raw array is logically a big ordered collection of small dense fixed-size matrices, all of the same size. The main issue is that the small dense matrices may be in row-major or column-major ordering and I want to accommodate them both.
My current approach is as follows. Note that all the entries of a small fixed-size block (in the array of blocks) need to be contiguous in memory.
template<int bs, class Mattype>
void block_operation(double *const vals, const int numblocks)
{
Eigen::Map<Mattype> mappedvals(vals,
Mattype::IsRowMajor ? numblocks*bs : bs,
Mattype::IsRowMajor ? bs : numblocks*bs
);
for(int i = 0; i < numblocks; i++)
if(Mattype::isRowMajor)
mappedvals.template block<bs,bs>(i*bs,0) = block_operation_rowmajor(mappedvals);
else
mappedvals.template block<bs,bs>(0,i*bs) = block_operation_colmajor(mappedvals);
}
The calling function first figures out the Mattype (out of 2 options) and then calls the above function with the correct template parameter.
Thus all my algorithms need to be written twice and my code is interspersed with these layout checks. Is there a way to do this in a layout-agnostic way? Keep in mind that this code needs to be as fast as possible.
Ideally, I would Map the data just once and use it for all the operations needed. However, the only solution I could come up with was invoking the Map constructor once for every small block, whenever I need to access the block.
template<int bs, StorageOptions layout>
inline Map<Matrix<double,bs,bs,layout>> extractBlock(double *const vals,
const int bindex)
{
return Map<Matrix<double,bs,bs,layout>>(vals+bindex*bs*bs);
}
Would this function be optimized away to nothing (by a modern compiler like GCC 7.3 or Intel 2017 under -std=c++14 -O3), or would I be paying a small penalty every time I invoke this function (once for each block, and there are a LOT of small blocks)? Is there a better way to do this?

Your extractBlock is fine, a simpler but somewhat uglier solution is to use a reinterpret cast at the start of block_operation:
using BlockType = Matrix<double,bs,bs,layout|DontAlign>;
BlockType* blocks = reinterpret_cast<BlockType*>(vals);
for(int i...)
block[i] = ...;
This will work for fixed sizes matrices only. Also note the DontAlign which is important unless you can guaranty that vals is aligned on a 16 or even 32 bytes depending on the presence of AVX and bs.... so just use DontAlign!

What is the performance of std::bitset?

I recently asked a question on Programmers regarding reasons to use manual bit manipulation of primitive types over std::bitset.
From that discussion I have concluded that the main reason is its comparatively poorer performance, although I'm not aware of any measured basis for this opinion. So next question is:
what is the performance hit, if any, likely to be incurred by using std::bitset over bit-manipulation of a primitive?
The question is intentionally broad, because after looking online I haven't been able to find anything, so I'll take what I can get. Basically I'm after a resource that provides some profiling of std::bitset vs 'pre-bitset' alternatives to the same problems on some common machine architecture using GCC, Clang and/or VC++. There is a very comprehensive paper which attempts to answer this question for bit vectors:
http://www.cs.up.ac.za/cs/vpieterse/pub/PieterseEtAl_SAICSIT2010.pdf
Unfortunately, it either predates or considered out of scope std::bitset, so it focuses on vectors/dynamic array implementations instead.
I really just want to know whether std::bitset is better than the alternatives for the use cases it is intended to solve. I already know that it is easier and clearer than bit-fiddling on an integer, but is it as fast?

Update
It's been ages since I posted this one, but:
I already know that it is easier and clearer than bit-fiddling on an
integer, but is it as fast?
If you are using bitset in a way that does actually make it clearer and cleaner than bit-fiddling, like checking for one bit at a time instead of using a bit mask, then inevitably you lose all those benefits that bitwise operations provide, like being able to check to see if 64 bits are set at one time against a mask, or using FFS instructions to quickly determine which bit is set among 64-bits.
I'm not sure that bitset incurs a penalty to use in all ways possible (ex: using its bitwise operator&), but if you use it like a fixed-size boolean array which is pretty much the way I always see people using it, then you generally lose all those benefits described above. We unfortunately can't get that level of expressiveness of just accessing one bit at a time with operator[] and have the optimizer figure out all the bitwise manipulations and FFS and FFZ and so forth going on for us, at least not since the last time I checked (otherwise bitset would be one of my favorite structures).
Now if you are going to use bitset<N> bits interchangeably with like, say, uint64_t bits[N/64] as in accessing both the same way using bitwise operations, it might be on par (haven't checked since this ancient post). But then you lose many of the benefits of using bitset in the first place.
for_each method
In the past I got into some misunderstandings, I think, when I proposed a for_each method to iterate through things like vector<bool>, deque, and bitset. The point of such a method is to utilize the internal knowledge of the container to iterate through elements more efficiently while invoking a functor, just as some associative containers offer a find method of their own instead of using std::find to do a better than linear-time search.
For example, you can iterate through all set bits of a vector<bool> or bitset if you had internal knowledge of these containers by checking for 64 elements at a time using a 64-bit mask when 64 contiguous indices are occupied, and likewise use FFS instructions when that's not the case.
But an iterator design having to do this type of scalar logic in operator++ would inevitably have to do something considerably more expensive, just by the nature in which iterators are designed in these peculiar cases. bitset lacks iterators outright and that often makes people wanting to use it to avoid dealing with bitwise logic to use operator[] to check each bit individually in a sequential loop that just wants to find out which bits are set. That too is not nearly as efficient as what a for_each method implementation could do.
Double/Nested Iterators
Another alternative to the for_each container-specific method proposed above would be to use double/nested iterators: that is, an outer iterator which points to a sub-range of a different type of iterator. Client code example:
for (auto outer_it = bitset.nbegin(); outer_it != bitset.nend(); ++outer_it)
{
for (auto inner_it = outer_it->first; inner_it != outer_it->last; ++inner_it)
// do something with *inner_it (bit index)
}
While not conforming to the flat type of iterator design available now in standard containers, this can allow some very interesting optimizations. As an example, imagine a case like this:
bitset<64> bits = 0x1fbf; // 0b1111110111111;
In that case, the outer iterator can, with just a few bitwise iterations ((FFZ/or/complement), deduce that the first range of bits to process would be bits [0, 6), at which point we can iterate through that sub-range very cheaply through the inner/nested iterator (it would just increment an integer, making ++inner_it equivalent to just ++int). Then when we increment the outer iterator, it can then very quickly, and again with a few bitwise instructions, determine that the next range would be [7, 13). After we iterate through that sub-range, we're done. Take this as another example:
bitset<16> bits = 0xffff;
In such a case, the first and last sub-range would be [0, 16), and the bitset could determine that with a single bitwise instruction at which point we can iterate through all set bits and then we're done.
This type of nested iterator design would map particularly well to vector<bool>, deque, and bitset as well as other data structures people might create like unrolled lists.
I say that in a way that goes beyond just armchair speculation, since I have a set of data structures which resemble the likes of deque which are actually on par with sequential iteration of vector (still noticeably slower for random-access, especially if we're just storing a bunch of primitives and doing trivial processing). However, to achieve the comparable times to vector for sequential iteration, I had to use these types of techniques (for_each method and double/nested iterators) to reduce the amount of processing and branching going on in each iteration. I could not rival the times otherwise using just the flat iterator design and/or operator[]. And I'm certainly not smarter than the standard library implementers but came up with a deque-like container which can be sequentially iterated much faster, and that strongly suggests to me that it's an issue with the standard interface design of iterators in this case which come with some overhead in these peculiar cases that the optimizer cannot optimize away.
Old Answer
I'm one of those who would give you a similar performance answer, but I'll try to give you something a bit more in-depth than "just because". It is something I came across through actual profiling and timing, not merely distrust and paranoia.
One of the biggest problems with bitset and vector<bool> is that their interface design is "too convenient" if you want to use them like an array of booleans. Optimizers are great at obliterating all that structure you establish to provide safety, reduce maintenance cost, make changes less intrusive, etc. They do an especially fine job with selecting instructions and allocating the minimal number of registers to make such code run as fast as the not-so-safe, not-so-easy-to-maintain/change alternatives.
The part that makes the bitset interface "too convenient" at the cost of efficiency is the random-access operator[] as well as the iterator design for vector<bool>. When you access one of these at index n, the code has to first figure out which byte the nth bit belongs to, and then the sub-index to the bit within that. That first phase typically involves a division/rshifts against an lvalue along with modulo/bitwise and which is more costly than the actual bit operation you're trying to perform.
The iterator design for vector<bool> faces a similar awkward dilemma where it either has to branch into different code every 8+ times you iterate through it or pay that kind of indexing cost described above. If the former is done, it makes the logic asymmetrical across iterations, and iterator designs tend to take a performance hit in those rare cases. To exemplify, if vector had a for_each method of its own, you could iterate through, say, a range of 64 elements at once by just masking the bits against a 64-bit mask for vector<bool> if all the bits are set without checking each bit individually. It could even use FFS to figure out the range all at once. An iterator design would tend to inevitably have to do it in a scalar fashion or store more state which has to be redundantly checked every iteration.
For random access, optimizers can't seem to optimize away this indexing overhead to figure out which byte and relative bit to access (perhaps a bit too runtime-dependent) when it's not needed, and you tend to see significant performance gains with that more manual code processing bits sequentially with advanced knowledge of which byte/word/dword/qword it's working on. It's somewhat of an unfair comparison, but the difficulty with std::bitset is that there's no way to make a fair comparison in such cases where the code knows what byte it wants to access in advance, and more often than not, you tend to have this info in advance. It's an apples to orange comparison in the random-access case, but you often only need oranges.
Perhaps that wouldn't be the case if the interface design involved a bitset where operator[] returned a proxy, requiring a two-index access pattern to use. For example, in such a case, you would access bit 8 by writing bitset[0][6] = true; bitset[0][7] = true; with a template parameter to indicate the size of the proxy (64-bits, e.g.). A good optimizer may be able to take such a design and make it rival the manual, old school kind of way of doing the bit manipulation by hand by translating that into: bitset |= 0x60;
Another design that might help is if bitsets provided a for_each_bit kind of method, passing a bit proxy to the functor you provide. That might actually be able to rival the manual method.
std::deque has a similar interface problem. Its performance shouldn't be that much slower than std::vector for sequential access. Yet unfortunately we access it sequentially using operator[] which is designed for random access or through an iterator, and the internal rep of deques simply don't map very efficiently to an iterator-based design. If deque provided a for_each kind of method of its own, then there it could potentially start to get a lot closer to std::vector's sequential access performance. These are some of the rare cases where that Sequence interface design comes with some efficiency overhead that optimizers often can't obliterate. Often good optimizers can make convenience come free of runtime cost in a production build, but unfortunately not in all cases.
Sorry!
Also sorry, in retrospect I wandered a bit with this post talking about vector<bool> and deque in addition to bitset. It's because we had a codebase where the use of these three, and particularly iterating through them or using them with random-access, were often hotspots.
Apples to Oranges
As emphasized in the old answer, comparing straightforward usage of bitset to primitive types with low-level bitwise logic is comparing apples to oranges. It's not like bitset is implemented very inefficiently for what it does. If you genuinely need to access a bunch of bits with a random access pattern which, for some reason or other, needs to check and set just one bit a time, then it might be ideally implemented for such a purpose. But my point is that almost all use cases I've encountered didn't require that, and when it's not required, the old school way involving bitwise operations tends to be significantly more efficient.

Did a short test profiling std::bitset vs bool arrays for sequential and random access - you can too:
#include <iostream>
#include <bitset>
#include <cstdlib> // rand
#include <ctime> // timer
inline unsigned long get_time_in_ms()
{
return (unsigned long)((double(clock()) / CLOCKS_PER_SEC) * 1000);
}
void one_sec_delay()
{
unsigned long end_time = get_time_in_ms() + 1000;
while(get_time_in_ms() < end_time)
{
}
}
int main(int argc, char **argv)
{
srand(get_time_in_ms());
using namespace std;
bitset<5000000> bits;
bool *bools = new bool[5000000];
unsigned long current_time, difference1, difference2;
double total;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bools[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bools[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bool:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
one_sec_delay();
total = 0;
current_time = get_time_in_ms();
for (unsigned int num = 0; num != 200000000; ++num)
{
bits[rand() % 5000000] = rand() % 2;
}
difference1 = get_time_in_ms() - current_time;
current_time = get_time_in_ms();
for (unsigned int num2 = 0; num2 != 100; ++num2)
{
for (unsigned int num = 0; num != 5000000; ++num)
{
total += bits[num];
}
}
difference2 = get_time_in_ms() - current_time;
cout << "Bitset:" << endl << "sum total = " << total << ", random access time = " << difference1 << ", sequential access time = " << difference2 << endl << endl;
delete [] bools;
cin.get();
return 0;
}
Please note: the outputting of the sum total is necessary so the compiler doesn't optimise out the for loop - which some do if the result of the loop isn't used.
Under GCC x64 with the following flags: -O2;-Wall;-march=native;-fomit-frame-pointer;-std=c++11;
I get the following results:
Bool array:
random access time = 4695, sequential access time = 390
Bitset:
random access time = 5382, sequential access time = 749

Not a great answer here, but rather a related anecdote:
A few years ago I was working on real-time software and we ran into scheduling problems. There was a module which was way over time-budget, and this was very surprising because the module was only responsible for some mapping and packing/unpacking of bits into/from 32-bit words.
It turned out that the module was using std::bitset. We replaced this with manual operations and the execution time decreased from 3 milliseconds to 25 microseconds. That was a significant performance issue and a significant improvement.
The point is, the performance issues caused by this class can be very real.

In addition to what the other answers said about the performance of access, there may also be a significant space overhead: Typical bitset<> implementations simply use the longest integer type to back their bits. Thus, the following code
#include <bitset>
#include <stdio.h>
struct Bitfield {
unsigned char a:1, b:1, c:1, d:1, e:1, f:1, g:1, h:1;
};
struct Bitset {
std::bitset<8> bits;
};
int main() {
printf("sizeof(Bitfield) = %zd\n", sizeof(Bitfield));
printf("sizeof(Bitset) = %zd\n", sizeof(Bitset));
printf("sizeof(std::bitset<1>) = %zd\n", sizeof(std::bitset<1>));
}
produces the following output on my machine:
sizeof(Bitfield) = 1
sizeof(Bitset) = 8
sizeof(std::bitset<1>) = 8
As you see, my compiler allocates a whopping 64 bits to store a single one, with the bitfield approach, I only need to round up to eight bits.
This factor eight in space usage can become important if you have a lot of small bitsets.

Rhetorical question: Why std::bitset is written in that inefficacy way?
Answer: It is not.
Another rhetorical question: What is difference between:
std::bitset<128> a = src;
a[i] = true;
a = a << 64;
and
std::bitset<129> a = src;
a[i] = true;
a = a << 63;
Answer: 50 times difference in performance http://quick-bench.com/iRokweQ6JqF2Il-T-9JSmR0bdyw
You need be very careful what you ask for, bitset support lot of things but each have it own cost. With correct handling you will have exactly same behavior as raw code:
void f(std::bitset<64>& b, int i)
{
b |= 1L << i;
b = b << 15;
}
void f(unsigned long& b, int i)
{
b |= 1L << i;
b = b << 15;
}
Both generate same assembly: https://godbolt.org/g/PUUUyd (64 bit GCC)
Another thing is that bitset is more portable but this have cost too:
void h(std::bitset<64>& b, unsigned i)
{
b = b << i;
}
void h(unsigned long& b, unsigned i)
{
b = b << i;
}
If i > 64 then bit set will be zero and in case of unsigned we have UB.
void h(std::bitset<64>& b, unsigned i)
{
if (i < 64) b = b << i;
}
void h(unsigned long& b, unsigned i)
{
if (i < 64) b = b << i;
}
With check preventing UB both generate same code.
Another place is set and [], first one is safe and mean you will never get UB but this will cost you a branch. [] have UB if you use wrong value but is fast as using var |= 1L<< i;. Of corse if std::bitset do not need have more bits than biggest int available on system because other wise you need split value to get correct element in internal table. This mean for std::bitset<N> size N is very important for performance. If is bigger or smaller than optimal one you will pay cost of it.
Overall I find that best way is use something like that:
constexpr size_t minBitSet = sizeof(std::bitset<1>)*8;
template<size_t N>
using fasterBitSet = std::bitset<minBitSet * ((N + minBitSet - 1) / minBitSet)>;
This will remove cost of trimming exceeding bits: http://quick-bench.com/Di1tE0vyhFNQERvucAHLaOgucAY

Is there a better implementation for keeping a count for unique integer pairs?

This is in C++. I need to keep a count for every pair of numbers. The two numbers are of type "int". I sort the two numbers, so (n1 n2) pair is the same as (n2 n1) pair. I'm using the std::unordered_map as the container.
I have been using the elegant pairing function by Matthew Szudzik, Wolfram Research, Inc.. In my implementation, the function gives me a unique number of type "long" (64 bits on my machine) for every pair of two numbers of type "int". I use this long as my key for the unordered_map (std::unordered_map). Is there a better way to keep count of such pairs? By better I mean, faster and if possible with lesser memory usage.
Also, I don't need all the bits of long. Even though you can assume that the two numbers can range up to max value for 32 bits, I anticipate the max possible value of my pairing function to require at most 36 bits. If nothing else, at least is there a way to have just 36 bits as key for the unordered_map? (some other data type)
I thought of using bitset, but I'm not exactly sure if the std::hash will generate a unique key for any given bitset of 36 bits, which can be used as key for unordered_map.
I would greatly appreciate any thoughts, suggestions etc.

First of all I think you came with wrong assumption. For std::unordered_map and std::unordered_set hash does not have to be unique (and it cannot be in principle for data types like std::string for example), there should be low probability that 2 different keys will generate the same hash value. But if there is a collision it would not be end of the world, just access would be slower. I would generate 32bit hash from 2 numbers and if you have an idea of typical values just test for probability of hash collision and choose hash function accordingly.
For that to work you should use pair of 32bit numbers as a key in std::unordered_map and provide a proper hash function. Calculating unique 64bit key and use it with hash map is controversal as hash_map will then calculate another hash of this key, so it is possible you are making it slower.
About 36 bits key, this is not a good idea unless you have a special CPU that handles 36 bit data. Your data either will be aligned on 64bit boundary and you would not have any benefits of saving memory, or you will get penalty of unaligned data access otherwise. In first case you would just have extra code to get 36 bits from 64bit data (if processor supports it). In the second your code will be slower than 32 bit hash even if there are some collisions.
If that hash_map is a bottleneck you may consider different implementation of hash map like goog-sparsehash.sourceforge.net

Just my two cents, the pairing functions that you've got in the article are WAY more complicated than you actually need. Mapping 2 32 bit UNISIGNED values to 64 uniquely is easy. The following does that, and even handles the non-pair states, without hitting the math peripheral too heavily (if at all).
uint64_t map(uint32_t a, uint32_t b)
{
uint64_t x = a+b;
uint64_t y = abs((int32_t)(a-b));
uint64_t ans = (x<<32)|(y);
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
uint64_t x = map>>32;
uint64_t y = map&0xFFFFFFFFL;
*a = (x+y)>>1;
*b = (x-*a);
}
Another alternative:
uint64_t map(uint32_t a, uint32_t b)
{
bool bb = a>b;
uint64_t x = ((uint64_t)a)<<(32*(bb));
uint64_t y = ((uint64_t)b)<<(32*!(bb));
uint64_t ans = x|y;
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
*a = map>>32;
*b = map&0xFFFFFFFF;
}
That works as a unique key. You can easily modify that to be a hash function provider for unordered map, though whether or not that will be faster than std::map is dependent on the number of values you've got.
NOTE: this will fail if the values a+b > 32 bits.

Hashing an unordered sequence of small integers

Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/

Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)

Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).

I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}

You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.

I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.

How to zero out array in O(1)?

Is there an way to zero out an array in with time complexsity O(1)? It's obvious that this can be done by for-loop, memset. But their time complexity are not O(1).

Yes
However not any array. It takes an array that has been crafted for this to work.
template <typename T, size_t N>
class Array {
public:
Array(): generation(0) {}
void clear() {
// FIXME: deal with overflow
++generation;
}
T get(std::size_t i) const {
if (i >= N) { throw std::runtime_error("out of range"); }
TimedT const& t = data[i];
return t.second == generation ? t.first : T{};
}
void set(std::size_t i, T t) {
if (i >= N) { throw std::runtime_error("out of range"); }
data[i] = std::make_pair(t, generation);
}
private:
typedef std::pair<T, unsigned> TimedT;
TimedT data[N];
unsigned generation;
};
The principle is simple:
we define an epoch using the generation attribute
when an item is set, the epoch in which it has been set is recorded
only items of the current epoch can be seen
clearing is thus equivalent to incrementing the epoch counter
The method has two issues:
storage increase: for each item we store an epoch
generation counter overflow: there is something as a maximum number of epochs
The latter can be thwarted using a real big integer (uint64_t at the cost of more storage).
The former is a natural consequence, one possible solution is to use buckets to downplay the issue by having for example up to 64 items associated to a single counter and a bitmask identifying which are valid within this counter.
EDIT: just wanted to get back on the buckets idea.
The original solution has an overhead of 8 bytes (64 bits) per element (if already 8-bytes aligned). Depending on the elements stored it might or might not be a big deal.
If it is a big deal, the idea is to use buckets; of course like all trade-off it slows down access even more.
template <typename T>
class BucketArray {
public:
BucketArray(): generation(0), mask(0) {}
T get(std::size_t index, std::size_t gen) const {
assert(index < 64);
return gen == generation and (mask & (1 << index)) ?
data[index] : T{};
}
void set(std::size_t index, T t, std::size_t gen) {
assert(index < 64);
if (generation < gen) { mask = 0; generation = gen; }
mask |= (1 << index);
data[index] = t;
}
private:
std::uint64_t generation;
std::uint64_t mask;
T data[64];
};
Note that this small array of a fixed number of elements (we could actually template this and statically check it's inferior or equal to 64) only has 16 bytes of overhead. This means we have an overhead of 2 bits per element.
template <typename T, size_t N>
class Array {
typedef BucketArray<T> Bucket;
public:
Array(): generation(0) {}
void clear() { ++generation; }
T get(std::size_t i) const {
if (i >= N) { throw ... }
Bucket const& bucket = data[i / 64];
return bucket.get(i % 64, generation);
}
void set(std::size_t i, T t) {
if (i >= N) { throw ... }
Bucket& bucket = data[i / 64];
bucket.set(i % 64, t, generation);
}
private:
std::uint64_t generation;
Bucket data[N / 64 + 1];
};
We got the space overhead down by a factor of... 32. Now the array can even be used to store char for example, whereas before it would have been prohibitive. The cost is that access got slower, as we get a division and modulo (when we will get a standardized operation that returns both results in one shot ?).

You can't modify n locations in memory in less than O(n) (even if your hardware, for sufficiently small n, maybe allows a constant-time operation to zero certain nicely-aligned blocks of memory, like for example flash memory does).
However, if the object of the exercise is a bit of lateral thinking, then you can write a class representing a "sparse" array. The general idea of a sparse array is that you keep a collection (perhaps a map, although depending on usage that might not be all there is to it), and when you look up an index, if it's not in the underlying collection then you return 0.
If you can clear the underlying collection in O(1), then you can zero out your sparse array in O(1). Clearing a std::map isn't usually constant-time in the size of the map, because all those nodes need to be freed. But you could design a collection that can be cleared in O(1) by moving the whole tree over from "the contents of my map", to "a tree of nodes that I have reserved for future use". The disadvantage would just be that this "reserved" space is still allocated, a bit like what happens when a vector gets smaller.

It's certainly possible to zero out an array in O(1) as long as you accept a very large constant factor:
void zero_out_array_in_constant_time(void* a, size_t n)
{
char* p = (char*) a;
for (size_t i = 0; i < std::numeric_limits<size_t>::max(); ++i)
{
p[i % n] = 0;
}
}
This will always take the same number of steps, regardless of the size of the array, hence it's O(1).

No.
You can't visit every member of an N-element collection in anything less than O(N) time.
You might, as Mike Kwan has observed, shift the cost from run- to compile-time, but that doesn't alter the computational complexity of the operation.

It's clearly not possible to initialize an arbitrarily sized array in a fixed length of time. However, it is entirely possible to create an array-like ADT which amortizes the cost of initializing the array across its use. The usual construction for this takes upwards of 3x the storage, however. To whit:
template <typename T, size_t arr_size>
class NoInitArray
{
std::vector<T> storage;
// Note that 'lookup' and 'check' are not initialized, and may contain
// arbitrary garbage.
size_t lookup[arr_size];
size_t check[arr_size];
public:
T& operator[](size_t pos)
{
// find out where we actually stored the entry for (*this)[pos].
// This could be garbage.
size_t storage_loc=lookup[pos];
// Check to see that the storage_loc we found is valid
if (storage_loc < storage.size() && check[storage_loc] == pos)
{
// everything checks, return the reference.
return storage[storage_loc];
}
else
{
// storage hasn't yet been allocated/initialized for (*this)[pos].
// allocate storage:
storage_loc=storage.size();
storage.push_back(T());
// put entries in lookup and check so we can find
// the proper spot later:
lookup[pos]=storage_loc;
check[storage_loc]=pos;
// everything's set up, return appropriate reference:
return storage.back();
}
}
};
One could add a clear() member to empty the contents of such an array fairly easily if T is some type that doesn't require destruction, at least in concept.

I like Eli Bendersky's webpage http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time, with a solution that he attributes to the famous book Design and Analysis of Computer Algorithms by Aho, Hopcroft and Ullman. This is genuinely O(1) time complexity for initialization, rather than O(N). The space demands are O(N) additional storage, but allocating this space is also O(1), since the space is full of garbage. I enjoyed this for theoretical reasons, but I think it might also be of practical value for implementing some algorithms, if you need to repeatedly initialize a very large array, and each time you access only a relatively small number of positions in the array. Bendersky provides a C++ implementation of the algorithm.
A very pure theorist might start worrying about the fact that N needs O(log(N)) digits, but I have ignored that detail, which would presumably require looking carefully at the mathematical model of the computer. Volume 1 of The Art of Computer Programming probably gives Knuth's view of this issue.

It is impossible at runtime to zero out an array in O(1). This is intuitive given that there is no language mechanism which allows the value setting of arbitrary size blocks of memory in fixed time. The closest you can do is:
int blah[100] = {0};
That will allow the initialisation to happen at compiletime. At runtime, memset is generally the fastest but would be O(N). However, there are problems associated with using memset on particular array types.

While still O(N), implementations that map to hardware-assisted operations like clearing whole cache lines or memory pages can run at <1 cycle per word.
Actually, riffing on Steve Jessop's idea...
You can do it if you have hardware support to clear an arbitrarily large amount of memory all at the same time. If you posit an arbitrarily large array, then you can also posit an arbitrarily large memory with hardware parallelism so that a single reset pin will simultaneously clear every register at once. That line will have to be driven by an arbitrarily large logic gate (that dissipates arbitrarily large power), and the circuit traces will have to be arbitrarily short (to overcome R/C delay) (or superconducting), but these things are quite common in extradimensional spaces.

It is possible to do it with O(1) time, and even with O(1) extra space.
I explained the O(1) time solution David mentioned in another answer, but it uses 2n extra memory.
A better algorithm exists, which requires only 1 bit of extra memory.
See the Article I just wrote about that subject.
It also explains the algorithm David mentioned, some more, and the state-of-the-art algorithm today. It also features an implementation of the latter.
In a short explanation (as I'll just be repeating the article) it cleverly takes the algorithm presented in David's answer and does it all in-place, while using only a (very) little extra memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js