Performance detoriation for certain array sizes - c++

I'm having an issue with the following code, and I fail to understand where is the problem. The problem is occurring however only with V2 intel processor and not V3.
Consider the following code in C++:
struct Tuple{
size_t _a;
size_t _b;
size_t _c;
size_t _d;
size_t _e;
size_t _f;
size_t _g;
size_t _h;
};
void
deref_A(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
}
void
deref_AB(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
}
void
deref_ABC(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
}
....
void
deref_ABCDEFG(Tuple& aTuple, const size_t& aIdx) {
aTuple._a = A[aIdx];
aTuple._b = B[aIdx];
aTuple._c = C[aIdx];
aTuple._d = D[aIdx];
aTuple._e = E[aIdx];
aTuple._f = F[aIdx];
aTuple._g = G[aIdx];
}
Note that A, B, C, ..., G are simple arrays (declared globally). Arrays are filled with integers.
The methods "deref_*", simply assign some values from arrays (accessed via index - aIdx) to the given struct parameter "aTuple". I first start by assigning to a single field of the given struct as parameter, and continue all the way to all fields. That is, each method assigns one more field than the previous one. The methods "deref_*" are called with index (aIdx) starting from 0, to MAX size of the arrays (arrays have the same size by the way). The index is used to access array elements, as shown in the code -- pretty simple.
Now, consider the graph (http://docdro.id/AUSil1f), which depicts the performance for array sizes starting with 20 million (size_t = 8 bytes) integers, up to 24 m (x-axes denote the arrays size).
For arrays with 21 million integers (size_t), the performance degrades for the methods touching at least 5 different arrays (i.e., deref_ACDE...G), therefore you will see peaks in the graph. The performance then improves again for arrays with 22 m integers and onwards. I'm wondering why this is happening for array size of 21 m only? This is happening only when I'm testing on a server with CPU: Intel(R) Xeon(R) CPU E5-2690 v2 # 3.00GHz, but not with Haswell, i.e., v3. Clearly this is a known issue to Intel and has been resolved, but I don't know what it is, and how to improve the code for v2.
I would highly appreciate any hint from your side.

I suspect you might be seeing cache-bank conflicts. Sandybridge/Ivybridge (Xeon Exxxx v1/v2) have them, Haswell (v3) doesn't.
Update from the OP: it was DTLB misses. Cache bank conflicts will usually only be an issue when your working set fits in cache. Being limited to one 8B read per clock instead of 2 shouldn't stop a a CPU from keeping up with main memory, even single-threaded. (8B * 3GHz = 24GB/s, which is about equal to main memory sequential-read bandwidth.)
I think there's a perf counter for that, which you can check with perf or other tools.
Quoting Agner Fog's microarchitecture doc (section 9.13):
Cache bank conflicts
Each consecutive 128 bytes, or two cache lines, in the data cache is
divided into 8 banks of 16 bytes each. It is not possible to do two
memory reads in the same clock cycle if the two memory addresses have
the same bank number, i.e. if bit 4 - 6 in the two addresses are the
same.
; Example 9.5. Sandy bridge cache
mov eax, [rsi] ; Use bank 0, assuming rsi is divisible by 40H
mov ebx, [rsi+100H] ; Use bank 0. Cache bank conflict
mov ecx, [rsi+110H] ; Use bank 1. No cache bank conflict
Changing the total size of your arrays changes the distance between two elements with the same index, if they're laid out more or less head to tail.
If you have each array aligned to a different 16B offset (modulo 128), this will help some for SnB/IvB. Access to the same index in each array will be in a different cache bank, and thus can happen in parallel. Achieving this can be as simple as allocating 128B-aligned arrays, with 16*n extra bytes at the start of each one. (Keeping track of the pointer to eventually free separately from the pointer to dereference will be an annoyance.)
If the tuple where you're writing the results has the same address as a read, modulo 4096, you also get a false dependence. (i.e. a read from one of the arrays might have to wait for a store to the tuple.) See Agner Fog's doc for the details on that. I didn't quote that part because I think cache-bank conflicts are the more likely explanation. Haswell still has the false-dependence issue, but the cache-bank conflict issue is completely gone.

Related

Why does this piece of code written using uint8_t run faster than analogous code written with uint32_t or uint64_t on a 64bit machine?

Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion? Yet while testing my bitset implementation(where the majority of the time depends on bitwise operations), I found I got a ~40% improvement using uint8_t over uint32_t. I'm especially surprised because there is hardly any copying going on that would justify the difference. The same thing occurred regardless of the clang optimisation level.
8bit:
#define mod8(x) x&7
#define div8(x) x>>3
template<unsigned long bits>
struct bitset{
private:
uint8_t fill[8] = {};
uint8_t clear[8];
uint8_t band[(bits/8)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div8(ind)]&fill[mod8(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div8(ind)] |= fill[mod8(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div8(ind)] &= clear[mod8(ind)];
}
bitset(){
for(uint8_t ii = 0, val = 1; ii < 8; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
32bit:
#define mod32(x) x&31
#define div32(x) x>>5
template<unsigned long bits>
struct bitset{
private:
uint32_t fill[32] = {};
uint32_t clear[32];
uint32_t band[(bits/32)+1] = {};
public:
template<typename T>
inline bool operator[](const T ind) const{
return band[div32(ind)]&fill[mod32(ind)];
}
template<typename T>
inline void store_high(const T ind){
band[div32(ind)] |= fill[mod32(ind)];
}
template<typename T>
inline void store_low(const T ind){
band[div32(ind)] &= clear[mod32(ind)];
}
bitset(){
for(uint32_t ii = 0, val = 1; ii < 32; ++ii){
fill[ii] = val;
clear[ii] = ~fill[ii];
val*=2;
}
}
};
And here is the benchmark I used(just moves a single 1 from position 0 till the end iteratively):
const int len = 1000000;
bitset<len> bs;
{
auto start = std::chrono::high_resolution_clock::now();
bs.store_high(0);
for (int ii = 1; ii < len; ++ii) {
bs.store_high(ii);
bs.store_low(ii-1);
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>((stop-start)).count()<<std::endl;
}
TL:DR: large "buckets" for a bitset mean you access the same one repeatedly when you iterate linearly, creating longer dependency chains that out-of-order exec can't overlap as effectively.
Smaller buckets give instruction-level parallelism, making operations on bits in separate bytes independent of each other.
On possible reason is that you iterate linearly over bits, so all the operations within the same band[] element form one long dependency chain of &= and |= operations, plus store and reload (if the compiler doesn't manage to optimize that away with loop unrolling).
For uint32_t band[], that's a chain of 2x 32 operations, since ii>>5 will give the same index for that long.
Out-of-order exec can only partially overlap execution of these long chains if their latency and instruction-count is too large for the ROB (ReOrder Buffer) and RS (Reservation Station, aka Scheduler). With 64 operations probably including store/reload latency (4 or 5 cycles on modern x86), that's a dep chain length of probably 6 x 64 = 384 cycles, composed of probably at least 128 uops, with some parallelism for loading (or better calculating) 1U<<(n&31) or rotl(-1U, n&31) masks that can "use up" some of the wasted execution slots in the pipeline.
But for uint8_t band[], you've moving to a new element 4x as frequently, after only 2x 8 = 16 operations, so the dep chains are 1/4 the length.
See also Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for another case of a modern x86 CPU overlapping two long dependency chains (a simple chain of imul with no other instruction-level parallelism), especially the part about a single dep chain becoming longer than the RS (scheduler for un-executed uops) being the point at which we start to lose some of the overlap of execution of the independent work. (For the case without lfence to artificially block overlap.)
See also Modern Microprocessors
A 90-Minute Guide! and https://www.realworldtech.com/sandy-bridge/ for some background on how modern OoO exec CPUs decode and look at instructions.
Small vs. large buckets
Large buckets are only useful when scanning through for the first non-zero bit, or filling the whole thing or something. Of course, really you'd want to vectorize that with SIMD, checking 16 or 32 bytes at once to see if there's a non-zero element anywhere in that. Current compilers will vectorize for you in loops that fill the whole array, but not search loops (or anything with a trip-count that can't be calculated ahead of the first iteration), except for ICC which can handle that. Re: using fast operations over bit-vectors, see Howard Hinnant's article (in the context of vector<bool>, which is an unfortunate name for a sometimes-useful data structure.)
C++ unfortunately doesn't make it easy in general to use different sized accesses to the same data, unless you compile with g++ -O3 -fno-strict-aliasing or something like that.
Although unsigned char can always alias anything else, so you could use that for your single-bit accesses, only using uintptr_t (which is likely to be as wide as a register, except on ILP32-on-64bit ISAs) for init or whatever. Or in this case, uint_fast32_t being a 64-bit type on many x86-64 C++ implementations would make it useful for this, unlike usual when that sucks, wasting cache footprint when you're only using the value-range of a 32-bit number and being slower for non-constant division on some CPUs.
On x86 CPU, a byte store is naturally fully efficient, but even on an ARM or something, coalescing in the store buffer could still make adjacent byte RMWs fully efficient. (Are there any modern CPUs where a cached byte store is actually slower than a word store?). And you'd still gain ILP; a slower commit to cache is still not as bad as coupling loads to stores that could have been independent if narrower. Especially important on lower-end CPUs with smaller out-of-order schedulers buffers.
(x86 byte loads need to use movzx to zero-extend to avoid false dependencies, but most compilers know that. Clang is reckless about it which can occasionally hurt.)
(Different sized accesses close to each other can lead to store-forwarding stalls, e.g. a byte store and an unsigned long reload that overlaps that byte will have extra latency: What are the costs of failed store-to-load forwarding on x86?)
Code review:
Storing an array of masks is probably worse than just computing 1u32<<(n&31)) as needed, on most CPUs. If you're really lucky, a smart compiler might manage constant propagation from the constructor into the benchmark loop, and realize that it can rotate or shift inside the loop to generate the bitmask instead of indexing memory in a loop that already does other memory operations.
(Some non-x86 ISAs have better bit-manipulation instructions and can materialize 1<<n cheaply, although x86 can do that in 2 instructions as well if compilers are smart. xor eax,eax / bts eax, esi, with the BTS implicitly masking the shift count by the operand-size. But that only works so well for 32-bit operand-size, not 8-bit. Without BMI2 shlx, x86 variable-count shifts run as 3-uops on Intel CPUs, vs. 1 on AMD.)
Almost certainly not worth it to store both fill[] and clear[] constants. Some ISAs even have an andn instruction that can NOT one of the operands on the fly, i.e. implements (~x) & y in one instruction. For example, x86 with BMI1 extensions has andn. (gcc -march=haswell).
Also, your macros are unsafe: wrap the expression in () so operator-precedence doesn't bits you if you use foo[div8(x) - 1].
As in #define div8(x) (x>>3)
But really, you shouldn't be using CPP macros for stuff like this anyway. Even in modern C, just define static const shift = 3; shift counts and masks. In C++, do that inside the struct/class scope, and use band[idx >> shift] or something. (When I was typing ind, my fingers wanted to type int; idx is probably a better name.)
Isn't the common knowledge that math operations on 64bit systems run faster on 32/64 bit datatypes than the smaller datatypes like short due to implicit promotion?
This isn't a universal truth. As always, fit depends on details.
Why does this piece of code written using uint_8 run faster than analogous code written with uint_32 or uint_64 on a 64bit machine?
The title doesn't match the question. There are no such types as uint_X and you aren't using uintX_t. You are using uint_fastX_t. uint_fastX_t is an alias for an integer type that is at least X bytes, that is deemed by the language implementers to provide fastest operations.
If we were to take your earlier mentioned assumption for granted, then it should logically follow that the language implementers would have chosen to use 32/64 bit type as uint_fast8_t. That said, you cannot assume that they have done so and whatever generic measurement (if any) has been used to make that choice doesn't necessarily apply to your case.
That said, regardless of which type uint_fast8_t is an alias of, your test isn't fair for comparing the relative speeds of calculation of potentially different integer types:
uint_fast8_t fill[8] = {};
uint_fast8_t clear[8];
uint_fast8_t band[(bits/8)+1] = {};
uint_fast32_t fill[32] = {};
uint_fast32_t clear[32];
uint_fast32_t band[(bits/32)+1] = {};
Not only are the types (potentially) different, but the sizes of the arrays are too. This can certainly have an effect on the efficiency.

Align struct while minimizing cache line waste

There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).
Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.
For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.
However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.
My question is two-fold:
In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:
where the upper box is a cache line and the other boxes are instances of our small struct.
Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?
Thank you so much for your input here. It is much appreciated.
To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:
#include <array>
#include <iostream>
using namespace std;
template <typename T, size_t BlockAlignment>
struct __attribute__((packed)) Packer
{
static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
T &operator[]( size_t index ) { return packed[index]; }
T packed[ NUM_ELEMS ];
uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
};
template <typename T, size_t NumElements, size_t BlockAlignment>
struct alignas(BlockAlignment) PackedAlignedArray
{
typedef Packer<T, BlockAlignment> PackerType;
std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
T &operator[]( size_t index ) {
return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
}
};
struct __attribute__((packed)) Foo
{
uint64_t a;
uint64_t b;
uint8_t c;
};
int main()
{
static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
constexpr size_t NUM_ELEMENTS = 10;
constexpr size_t BLOCK_ALIGNMENT = 64;
PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;
for ( size_t i=0; i<NUM_ELEMENTS; ++i )
{
// Display the memory offset between the current
// element and the start of the array
cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) -
reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
}
return 0;
}
The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:
0
17
34
64
81
98
128
145
162
192
You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed)) or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).
Then put it in an alignas(256) T array[], so only the start of the array is aligned.
Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr being the start of a valid array, but arr + 1 no being a valid array, even though it has the same type.
Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct) = max (alignof(members)).
See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.
Depending on how your data is accessed, one option would be to store pairs of uint64_t in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.
But for iterating over your data sequentially, that's excellent.

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

If a 32-bit integer overflows, can we use a 40-bit structure instead of a 64-bit long one?

If, say, a 32-bit integer is overflowing, instead of upgrading int to long, can we make use of some 40-bit type if we need a range only within 240, so that we save 24 (64-40) bits for every integer?
If so, how?
I have to deal with billions and space is a bigger constraint.
Yes, but...
It is certainly possible, but it is usually nonsensical (for any program that doesn't use billions of these numbers):
#include <stdint.h> // don't want to rely on something like long long
struct bad_idea
{
uint64_t var : 40;
};
Here, var will indeed have a width of 40 bits at the expense of much less efficient code generated (it turns out that "much" is very much wrong -- the measured overhead is a mere 1-2%, see timings below), and usually to no avail. Unless you have need for another 24-bit value (or an 8 and 16 bit value) which you wish to pack into the same structure, alignment will forfeit anything that you may gain.
In any case, unless you have billions of these, the effective difference in memory consumption will not be noticeable (but the extra code needed to manage the bit field will be noticeable!).
Note:
The question has in the mean time been updated to reflect that indeed billions of numbers are needed, so this may be a viable thing to do, presumed that you take measures not to lose the gains due to structure alignment and padding, i.e. either by storing something else in the remaining 24 bits or by storing your 40-bit values in structures of 8 each or multiples thereof).
Saving three bytes a billion times is worthwhile as it will require noticeably fewer memory pages and thus cause fewer cache and TLB misses, and above all page faults (a single page fault weighting tens of millions instructions).
While the above snippet does not make use of the remaining 24 bits (it merely demonstrates the "use 40 bits" part), something akin to the following will be necessary to really make the approach useful in a sense of preserving memory -- presumed that you indeed have other "useful" data to put in the holes:
struct using_gaps
{
uint64_t var : 40;
uint64_t useful_uint16 : 16;
uint64_t char_or_bool : 8;
};
Structure size and alignment will be equal to a 64 bit integer, so nothing is wasted if you make e.g. an array of a billion such structures (even without using compiler-specific extensions). If you don't have use for an 8-bit value, you could also use an 48-bit and a 16-bit value (giving a bigger overflow margin).
Alternatively you could, at the expense of usability, put 8 40-bit values into a structure (least common multiple of 40 and 64 being 320 = 8*40). Of course then your code which accesses elements in the array of structures will become much more complicated (though one could probably implement an operator[] that restores the linear array functionality and hides the structure complexity).
Update:
Wrote a quick test suite, just to see what overhead the bitfields (and operator overloading with bitfield refs) would have. Posted code (due to length) at gcc.godbolt.org, test output from my Win7-64 machine is:
Running test for array size = 1048576
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 2 1 35 35 1
uint64_t 0 3 3 35 35 1
bad40_t 0 5 3 35 35 1
packed40_t 0 7 4 48 49 1
Running test for array size = 16777216
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 38 14 560 555 8
uint64_t 0 81 22 565 554 17
bad40_t 0 85 25 565 561 16
packed40_t 0 151 75 765 774 16
Running test for array size = 134217728
what alloc seq(w) seq(r) rand(w) rand(r) free
-----------------------------------------------------------
uint32_t 0 312 100 4480 4441 65
uint64_t 0 648 172 4482 4490 130
bad40_t 0 682 193 4573 4492 130
packed40_t 0 1164 552 6181 6176 130
What one can see is that the extra overhead of bitfields is neglegible, but the operator overloading with bitfield reference as a convenience thing is rather drastic (about 3x increase) when accessing data linearly in a cache-friendly manner. On the other hand, on random access it barely even matters.
These timings suggest that simply using 64-bit integers would be better since they are still faster overall than bitfields (despite touching more memory), but of course they do not take into account the cost of page faults with much bigger datasets. It might look very different once you run out of physical RAM (I didn't test that).
You can quite effectively pack 4*40bits integers into a 160-bit struct like this:
struct Val4 {
char hi[4];
unsigned int low[4];
}
long getLong( const Val4 &pack, int ix ) {
int hi= pack.hi[ix]; // preserve sign into 32 bit
return long( (((unsigned long)hi) << 32) + (unsigned long)pack.low[i]);
}
void setLong( Val4 &pack, int ix, long val ) {
pack.low[ix]= (unsigned)val;
pack.hi[ix]= (char)(val>>32);
}
These again can be used like this:
Val4[SIZE] vals;
long getLong( int ix ) {
return getLong( vals[ix>>2], ix&0x3 )
}
void setLong( int ix, long val ) {
setLong( vals[ix>>2], ix&0x3, val )
}
You might want to consider Variable-Lenght Encoding (VLE)
Presumably, you have store a lot of those numbers somewhere (in RAM, on disk, send them over the network, etc), and then take them one by one and do some processing.
One approach would be to encode them using VLE.
From Google's protobuf documentation (CreativeCommons licence)
Varints are a method of serializing integers using
one or more bytes. Smaller numbers take a smaller number of bytes.
Each byte in a varint, except the last byte, has the most significant
bit (msb) set – this indicates that there are further bytes to come.
The lower 7 bits of each byte are used to store the two's complement
representation of the number in groups of 7 bits, least significant
group first.
So, for example, here is the number 1 – it's a single byte, so the msb
is not set:
0000 0001
And here is 300 – this is a bit more complicated:
1010 1100 0000 0010
How do you figure out that this is 300? First you drop the msb from
each byte, as this is just there to tell us whether we've reached the
end of the number (as you can see, it's set in the first byte as there
is more than one byte in the varint)
Pros
If you have lots of small numbers, you'll probably use less than 40 bytes per integer, in average. Possibly much less.
You are able to store bigger numbers (with more than 40 bits) in the future, without having to pay a penalty for the small ones
Cons
You pay an extra bit for each 7 significant bits of your numbers. That means a number with 40 significant bits will need 6 bytes. If most of your numbers have 40 significant bits, you are better of with a bit field approach.
You will lose the ability to easily jump to a number given its index (you have to at least partially parse all previous elements in an array in order to access the current one.
You will need some form of decoding before doing anything useful with the numbers (although that is true for other approaches as well, like bit fields)
(Edit: First of all - what you want is possible, and makes sense in some cases; I have had to do similar things when I tried to do something for the Netflix challenge and only had 1GB of memory; Second - it is probably best to use a char array for the 40-bit storage to avoid any alignment issues and the need to mess with struct packing pragmas; Third - this design assumes that you're OK with 64-bit arithmetic for intermediate results, it is only for large array storage that you would use Int40; Fourth: I don't get all the suggestions that this is a bad idea, just read up on what people go through to pack mesh data structures and this looks like child's play by comparison).
What you want is a struct that is only used for storing data as 40-bit ints but implicitly converts to int64_t for arithmetic. The only trick is doing the sign extension from 40 to 64 bits right. If you're fine with unsigned ints, the code can be even simpler. This should be able to get you started.
#include <cstdint>
#include <iostream>
// Only intended for storage, automatically promotes to 64-bit for evaluation
struct Int40
{
Int40(int64_t x) { set(static_cast<uint64_t>(x)); } // implicit constructor
operator int64_t() const { return get(); } // implicit conversion to 64-bit
private:
void set(uint64_t x)
{
setb<0>(x); setb<1>(x); setb<2>(x); setb<3>(x); setb<4>(x);
};
int64_t get() const
{
return static_cast<int64_t>(getb<0>() | getb<1>() | getb<2>() | getb<3>() | getb<4>() | signx());
};
uint64_t signx() const
{
return (data[4] >> 7) * (uint64_t(((1 << 25) - 1)) << 39);
};
template <int idx> uint64_t getb() const
{
return static_cast<uint64_t>(data[idx]) << (8 * idx);
}
template <int idx> void setb(uint64_t x)
{
data[idx] = (x >> (8 * idx)) & 0xFF;
}
unsigned char data[5];
};
int main()
{
Int40 a = -1;
Int40 b = -2;
Int40 c = 1 << 16;
std::cout << "sizeof(Int40) = " << sizeof(Int40) << std::endl;
std::cout << a << "+" << b << "=" << (a+b) << std::endl;
std::cout << c << "*" << c << "=" << (c*c) << std::endl;
}
Here is the link to try it live: http://rextester.com/QWKQU25252
You can use a bit-field structure, but it's not going to save you any memory:
struct my_struct
{
unsigned long long a : 40;
unsigned long long b : 24;
};
You can squeeze any multiple of 8 such 40-bit variables into one structure:
struct bits_16_16_8
{
unsigned short x : 16;
unsigned short y : 16;
unsigned short z : 8;
};
struct bits_8_16_16
{
unsigned short x : 8;
unsigned short y : 16;
unsigned short z : 16;
};
struct my_struct
{
struct bits_16_16_8 a1;
struct bits_8_16_16 a2;
struct bits_16_16_8 a3;
struct bits_8_16_16 a4;
struct bits_16_16_8 a5;
struct bits_8_16_16 a6;
struct bits_16_16_8 a7;
struct bits_8_16_16 a8;
};
This will save you some memory (in comparison with using 8 "standard" 64-bit variables), but you will have to split every operation (and in particular arithmetic ones) on each of these variables into several operations.
So the memory-optimization will be "traded" for runtime-performance.
As the comments suggest, this is quite a task.
Probably an unnecessary hassle unless you want to save alot of RAM - then it makes much more sense. (RAM saving would be the sum total of bits saved across millions of long values stored in RAM)
I would consider using an array of 5 bytes/char (5 * 8 bits = 40 bits). Then you will need to shift bits from your (overflowed int - hence a long) value into the array of bytes to store them.
To use the values, then shift the bits back out into a long and you can use the value.
Then your RAM and file storage of the value will be 40 bits (5 bytes), BUT you must consider data alignment if you plan to use a struct to hold the 5 bytes. Let me know if you need elaboration on this bit shifting and data alignment implications.
Similarly, you could use the 64 bit long, and hide other values (3 chars perhaps) in the residual 24 bits that you do not want to use. Again - using bit shifting to add and remove the 24 bit values.
Another variation that may be helpful would be to use a structure:
typedef struct TRIPLE_40 {
uint32_t low[3];
uint8_t hi[3];
uint8_t padding;
};
Such a structure would take 16 bytes and, if 16-byte aligned, would fit entirely within a single cache line. While identifying which of the parts of the structure to use may be more expensive than it would be if the structure held four elements instead of three, accessing one cache line may be much cheaper than accessing two. If performance is important, one should use some benchmarks since some machines may perform a divmod-3 operation cheaply and have a high cost per cache-line fetch, while others might have have cheaper memory access and more expensive divmod-3.
If you have to deal with billions of integers, I'd try to encapuslate arrays of 40-bit numbers instead of single 40-bit numbers. That way, you can test different array implementations (e.g. an implementation that compresses data on the fly, or maybe one that stores less-used data to disk.) without changing the rest of your code.
Here's a sample implementation (http://rextester.com/SVITH57679):
class Int64Array
{
char* buffer;
public:
static const int BYTE_PER_ITEM = 5;
Int64Array(size_t s)
{
buffer=(char*)malloc(s*BYTE_PER_ITEM);
}
~Int64Array()
{
free(buffer);
}
class Item
{
char* dataPtr;
public:
Item(char* dataPtr) : dataPtr(dataPtr){}
inline operator int64_t()
{
int64_t value=0;
memcpy(&value, dataPtr, BYTE_PER_ITEM); // Assumes little endian byte order!
return value;
}
inline Item& operator = (int64_t value)
{
memcpy(dataPtr, &value, BYTE_PER_ITEM); // Assumes little endian byte order!
return *this;
}
};
inline Item operator[](size_t index)
{
return Item(buffer+index*BYTE_PER_ITEM);
}
};
Note: The memcpy-conversion from 40-bit to 64-bit is basically undefined behavior, as it assumes litte-endianness. It should work on x86-platforms, though.
Note 2: Obviously, this is proof-of-concept code, not production-ready code. To use it in real projects, you'd have to add (among other things):
error handling (malloc can fail!)
copy constructor (e.g. by copying data, add reference counting or by making the copy constructor private)
move constructor
const overloads
STL-compatible iterators
bounds checks for indices (in debug build)
range checks for values (in debug build)
asserts for the implicit assumptions (little-endianness)
As it is, Item has reference semantics, not value semantics, which is unusual for operator[]; You could probably work around that with some clever C++ type conversion tricks
All of those should be straightforward for a C++ programmer, but they would make the sample code much longer without making it clearer, so I've decided to omit them.
I'll assume that
this is C, and
you need a single, large array of 40 bit numbers, and
you are on a machine that is little-endian, and
your machine is smart enough to handle alignment
you have defined size to be the number of 40-bit numbers you need
unsigned char hugearray[5*size+3]; // +3 avoids overfetch of last element
__int64 get_huge(unsigned index)
{
__int64 t;
t = *(__int64 *)(&hugearray[index*5]);
if (t & 0x0000008000000000LL)
t |= 0xffffff0000000000LL;
else
t &= 0x000000ffffffffffLL;
return t;
}
void set_huge(unsigned index, __int64 value)
{
unsigned char *p = &hugearray[index*5];
*(long *)p = value;
p[4] = (value >> 32);
}
It may be faster to handle the get with two shifts.
__int64 get_huge(unsigned index)
{
return (((*(__int64 *)(&hugearray[index*5])) << 24) >> 24);
}
For the case of storing some billions of 40-bit signed integers, and assuming 8-bit bytes, you can pack 8 40-bit signed integers in a struct (in the code below using an array of bytes to do that), and, since this struct is ordinarily aligned, you can then create a logical array of such packed groups, and provide ordinary sequential indexing of that:
#include <limits.h> // CHAR_BIT
#include <stdint.h> // int64_t
#include <stdlib.h> // div, div_t, ptrdiff_t
#include <vector> // std::vector
#define STATIC_ASSERT( e ) static_assert( e, #e )
namespace cppx {
using Byte = unsigned char;
using Index = ptrdiff_t;
using Size = Index;
// For non-negative values:
auto roundup_div( const int64_t a, const int64_t b )
-> int64_t
{ return (a + b - 1)/b; }
} // namespace cppx
namespace int40 {
using cppx::Byte;
using cppx::Index;
using cppx::Size;
using cppx::roundup_div;
using std::vector;
STATIC_ASSERT( CHAR_BIT == 8 );
STATIC_ASSERT( sizeof( int64_t ) == 8 );
const int bits_per_value = 40;
const int bytes_per_value = bits_per_value/8;
struct Packed_values
{
enum{ n = sizeof( int64_t ) };
Byte bytes[n*bytes_per_value];
auto value( const int i ) const
-> int64_t
{
int64_t result = 0;
for( int j = bytes_per_value - 1; j >= 0; --j )
{
result = (result << 8) | bytes[i*bytes_per_value + j];
}
const int64_t first_negative = int64_t( 1 ) << (bits_per_value - 1);
if( result >= first_negative )
{
result = (int64_t( -1 ) << bits_per_value) | result;
}
return result;
}
void set_value( const int i, int64_t value )
{
for( int j = 0; j < bytes_per_value; ++j )
{
bytes[i*bytes_per_value + j] = value & 0xFF;
value >>= 8;
}
}
};
STATIC_ASSERT( sizeof( Packed_values ) == bytes_per_value*Packed_values::n );
class Packed_vector
{
private:
Size size_;
vector<Packed_values> data_;
public:
auto size() const -> Size { return size_; }
auto value( const Index i ) const
-> int64_t
{
const auto where = div( i, Packed_values::n );
return data_[where.quot].value( where.rem );
}
void set_value( const Index i, const int64_t value )
{
const auto where = div( i, Packed_values::n );
data_[where.quot].set_value( where.rem, value );
}
Packed_vector( const Size size )
: size_( size )
, data_( roundup_div( size, Packed_values::n ) )
{}
};
} // namespace int40
#include <iostream>
auto main() -> int
{
using namespace std;
cout << "Size of struct is " << sizeof( int40::Packed_values ) << endl;
int40::Packed_vector values( 25 );
for( int i = 0; i < values.size(); ++i )
{
values.set_value( i, i - 10 );
}
for( int i = 0; i < values.size(); ++i )
{
cout << values.value( i ) << " ";
}
cout << endl;
}
Yes, you can do that, and it will save some space for large quantities of numbers
You need a class that contains a std::vector of an unsigned integer type.
You will need member functions to store and to retrieve an integer. For example, if you want do store 64 integers of 40 bit each, use a vector of 40 integers of 64 bits each. Then you need a method that stores an integer with index in [0,64] and a method to retrieve such an integer.
These methods will execute some shift operations, and also some binary | and & .
I am not adding any more details here yet because your question is not very specific. Do you know how many integers you want to store? Do you know it during compile time? Do you know it when the program starts? How should the integers be organized? Like an array? Like a map? You should know all this before trying to squeeze the integers into less storage.
There are quite a few answers here covering implementation, so I'd like to talk about architecture.
We usually expand 32-bit values to 64-bit values to avoid overflowing because our architectures are designed to handle 64-bit values.
Most architectures are designed to work with integers whose size is a power of 2 because this makes the hardware vastly simpler. Tasks such as caching are much simpler this way: there are a large number of divisions and modulus operations which can be replaced with bit masking and shifts if you stick to powers of 2.
As an example of just how much this matters, The C++11 specification defines multithreading race-cases based on "memory locations." A memory location is defined in 1.7.3:
A memory location is either an object of scalar type or a maximal
sequence of adjacent bit-fields all having non-zero width.
In other words, if you use C++'s bitfields, you have to do all of your multithreading carefully. Two adjacent bitfields must be treated as the same memory location, even if you wish computations across them could be spread across multiple threads. This is very unusual for C++, so likely to cause developer frustration if you have to worry about it.
Most processors have a memory architecture which fetches 32-bit or 64-bit blocks of memory at a time. Thus use of 40-bit values will have a surprising number of extra memory accesses, dramatically affecting run-time. Consider the alignment issues:
40-bit word to access: 32-bit accesses 64bit-accesses
word 0: [0,40) 2 1
word 1: [40,80) 2 2
word 2: [80,120) 2 2
word 3: [120,160) 2 2
word 4: [160,200) 2 2
word 5: [200,240) 2 2
word 6: [240,280) 2 2
word 7: [280,320) 2 1
On a 64 bit architecture, one out of every 4 words will be "normal speed." The rest will require fetching twice as much data. If you get a lot of cache misses, this could destroy performance. Even if you get cache hits, you are going to have to unpack the data and repack it into a 64-bit register to use it (which might even involve a difficult to predict branch).
It is entirely possible this is worth the cost
There are situations where these penalties are acceptable. If you have a large amount of memory-resident data which is well indexed, you may find the memory savings worth the performance penalty. If you do a large amount of computation on each value, you may find the costs are minimal. If so, feel free to implement one of the above solutions. However, here are a few recommendations.
Do not use bitfields unless you are ready to pay their cost. For example, if you have an array of bitfields, and wish to divide it up for processing across multiple threads, you're stuck. By the rules of C++11, the bitfields all form one memory location, so may only be accessed by one thread at a time (this is because the method of packing the bitfields is implementation defined, so C++11 can't help you distribute them in a non-implementation defined manner)
Do not use a structure containing a 32-bit integer and a char to make 40 bytes. Most processors will enforce alignment and you wont save a single byte.
Do use homogenous data structures, such as an array of chars or array of 64-bit integers. It is far easier to get the alignment correct. (And you also retain control of the packing, which means you can divide an array up amongst several threads for computation if you are careful)
Do design separate solutions for 32-bit and 64-bit processors, if you have to support both platforms. Because you are doing something very low level and very ill-supported, you'll need to custom tailor each algorithm to its memory architecture.
Do remember that multiplication of 40-bit numbers is different from multiplication of 64-bit expansions of 40-bit numbers reduced back to 40-bits. Just like when dealing with the x87 FPU, you have to remember that marshalling your data between bit-sizes changes your result.
This begs for streaming in-memory lossless compression. If this is for a Big Data application, dense packing tricks are tactical solutions at best for what seems to require fairly decent middleware or system-level support. They'd need thorough testing to make sure one is able to recover all the bits unharmed. And the performance implications are highly non-trivial and very hardware-dependent because of interference with the CPU caching architecture (e.g. cache lines vs packing structure). Someone mentioned complex meshing structures : these are often fine-tuned to cooperate with particular caching architectures.
It's not clear from the requirements whether the OP needs random access. Given the size of the data it's more likely one would only need local random access on relatively small chunks, organised hierarchically for retrieval. Even the hardware does this at large memory sizes (NUMA). Like lossless movie formats show, it should be possible to get random access in chunks ('frames') without having to load the whole dataset into hot memory (from the compressed in-memory backing store).
I know of one fast database system (kdb from KX Systems to name one but I know there are others) that can handle extremely large datasets by seemlessly memory-mapping large datasets from backing store. It has the option to transparently compress and expand the data on-the-fly.
If what you really want is an array of 40 bit integers (which obviously you can't have), I'd just combine one array of 32 bit and one array of 8 bit integers.
To read a value x at index i:
uint64_t x = (((uint64_t) array8 [i]) << 32) + array32 [i];
To write a value x to index i:
array8 [i] = x >> 32; array32 [i] = x;
Obviously nicely encapsulated into a class using inline functions for maximum speed.
There is one situation where this is suboptimal, and that is when you do truly random access to many items, so that each access to an int array would be a cache miss - here you would get two cache misses every time. To avoid this, define a 32 byte struct containing an array of six uint32_t, an array of six uint8_t, and two unused bytes (41 2/3rd bits per number); the code to access an item is slightly more complicated, but both components of the item are in the same cache line.

Using SSE for vector initialzation

I am relative new to C++ (moved from Java for performance for my scientific app) and I know nothing about SSE. Still, I need to improve the very simple following code:
int myMax=INT_MAX;
int size=18000003;
vector<int> nodeCost(size);
/* init part */
for (int k=0;k<size;k++){
nodeCost[k]=myMax;
}
I have measured the time for the initialization part and it takes 13ms which is way too big for my scientific app (the entire algorithm runs in 22ms which means that the initialization takes 1/2 of the total time). Keep in mind that the initialization part will be repeated multiple times for the same vector.
As you see the size of the vector is not divided by 4. Is there a way to accelerate the initialization with SSE? Can you suggest how? Do I need to use arrays or SSE can be used with vectors as well?
Please, since I need your help let's all avoid a) "how did you measure the time" or b) "premature optimization is the root of all evil" which are both reasonable for you to ask but a) the measured time is correct b) I agree with it but I have no other choice. I do not want to parallelize the code with OpenMP, so SSE is the only fallback.
Thanks for your help
Use the vector's constructor:
std::vector<int> nodeCost(size, myMax);
This will most likely use an optimized "memset"-type of implementation to fill the vector.
Also tell your compiler to generate architecture-specific code (e.g. -march=native -O3 on GCC). On my x86_64 machine, this produces the following code for filling the vector:
L5:
add r8, 1 ;; increment counter
vmovdqa YMMWORD PTR [rax], ymm0 ;; magic, ymm contains the data, and eax...
add rax, 32 ;; ... the "end" pointer for the vector
cmp r8, rdi ;; loop condition, rdi holds the total size
jb .L5
The movdqa instruction, size-prefixed for 256-bit operations, copies 32 bytes to memory at once; it is part of the AVX instruction set.
Try std::fill first as already suggested, and then if that's still not fast enough you can go to SIMD if you really need to. Note that, depending on your CPU and memory sub-system, for large vectors such as this you may well hit your DRAM's maximum bandwidth and that could be the limiting factor. Anyway, here's a fairly simple SSE implementation:
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size - 3; k += 4)
{
_mm_storeu_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k)
{
pNodeCost[k] = myMax;
}
This should work well on modern CPUs - for older CPUs you might need to handle the potential data misalignment better, i.e. use _mm_store_si128 rather than _mm_storeu_si128. E.g.
#include <emmintrin.h>
const __m128i vMyMax = _mm_set1_epi32(myMax);
int * const pNodeCost = &nodeCost[0];
for (k = 0; k < size && (((intptr_t)&pNodeCost[k] & 15ULL) != 0); ++k)
{ // initial scalar loop until we
pNodeCost[k] = myMax; // hit 16 byte alignment
}
for ( ; k < size - 3; k += 4) // 16 byte aligned SIMD loop
{
_mm_store_si128((__m128i *)&pNodeCost[k], vMyMax);
}
for ( ; k < size; ++k) // scalar loop to take care of any
{ // remaining elements at end of vector
pNodeCost[k] = myMax;
}
This is an extension of the ideas in Mats Petersson's comment.
If you really care about this, you need to improve your referential locality. Plowing through 72 megabytes of initialization, only to come back later to overwrite it, is extremely unfriendly to the memory hierarchy.
I do not know how to do this in straight C++, since std::vector always initializes itself. But you might try (1) using calloc and free to allocate the memory; and (2) interpreting the elements of the array as "0 means myMax and n means n-1". (I am assuming "cost" is non-negative. Otherwise you need to adjust this scheme a bit. The point is to avoid the explicit initialization.)
On a Linux system, this can help because calloc of a sufficiently large block does not need to explicitly zero the memory, since pages acquired directly from the kernel are already zeroed. Better yet, they only get mapped and zeroed the first time you touch them, which is very cache-friendly.
(On my Ubuntu 13.04 system, Linux calloc is smart enough not to explicitly initialize. If yours is not, you might have to do an mmap of /dev/zero to use this approach...)
Yes, this does mean every access to the array will involve adding/subtracting 1. (Although not for operations like "min" or "max".) Main memory is pretty darn slow by comparison, and simple arithmetic like this can often happen in parallel with whatever else you are doing, so there is a decent chance this could give you a big performance win.
Of course whether this helps will be platform dependent.