Efficient equality test for bitstrings with arbitrary offsets

Efficient equality test for bitstrings with arbitrary offsets - c++

I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char for each token / std::string for a sequence). I store each sequence in a char array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
Copy up to 32 tokens (each having 2 bit) from each subsequences into an uint64_t. This is realized in a loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t.
Compare the two uint64_t. If they are not equal, return.
Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string (with each token stored in a char) and its operator==.
I have looked into std::memcmp, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?

First of all, this is the kind of computation where the target processor, RAM, compiler and compiler flags can drastically change the results. Unfortunately these critical information are not provided. Let's assume you use a quite recent mainstream x86-64 processor, a common DDR4-SDRAM, a compiler like Clang/GCC relatively up-to-date, and optimizations are enabled (ie. -O3 and possibly -march=native).
Clang and GCC use a fast comparison functions for comparing strings : respectively memcmp for GCC 12 and bcmp for Clang 15. The two functions are highly optimized on most platforms : they typically compare short strings by blocks of 8 bytes (uint64_t) and large strings by using SIMD instructions.
Your optimization is good to reduce the memory footprint but it introduces more computation and there is a high chance for the operation to be already compute-bound if the input buffer is already in the CPU cache. In addition, the computation is not SIMD-friendly due to the inner loop : the compiler will certainly not generate an efficient code due toe the bit-wise operations. The thing is scalar codes are slow. In fact, scalar byte-per-byte computations are generally so slow that they are usually far from being able to saturate the RAM bandwidth (at least the one achievable using only 1 core) as opposed to to memcmp. For example, a Skylake/Coffeelake processor at 4 GHz can only read 8 GiB/s from the L1 cache using a scalar byte-per-byte code while an AVX-2 SIMD code can read 256 GiB/s. For the write it is twice smaller : 4 GiB/s VS 128 GiB/s. A 1-channel DDR4-SDRAM # 3200MHz can theoretically reach ~24 GiB/s, that is, far more than a byte-per-byte scalar sequential code. The L3 cache have a much bigger bandwidth.
If you want a fast code for large sequences, then you need to either help your compiler so it can use SIMD instruction (not so easy in this case), to use non-portable SIMD intrinsics or possibly to use a relatively-portable SIMD library to generate quite-good SIMD code (though low-level platform-dependent intrinsics are more flexible/featureful).
I expect the main bottleneck to come from the "loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t". Indeed, this loop will likely generate a dependency chain of instructions (operating on the same uint64_t variable) that cannot be executed efficiently by the processor nor easily optimized by the compiler.
A typical solution would be to read blocks of 8 bytes (using memcpy to do it correctly, and hope the compiler optimize it properly). The bits can be reordered using a bswap instruction on x86-64 processors and it is not needed on big-endian processors. A shift+mask can be applied so to compare only the useful part. Here is an (untested) example to show the idea:
if(length >= 16)
{
uint64_t block1, block2;
uint64_t prev_block1 = 0, prev_block2 = 0;
unsigned int shift1 = (start1 % 4) * 2;
unsigned int shift2 = (start2 % 4) * 2;
uint64_t mask = 0xFFFFFFFFFFFFFF00ull;
// Read blocks 7 byte per 7 byte for sake of simplicity
for(size_t i=0; i<length-7 ; i+=7)
{
// Safe and cheap and GCC/Clang
memcpy(&block1, charArray1[i], 8);
memcpy(&block2, charArray2[i], 8);
// Architecture-dependent: reorder bytes on little-endian processors.
// There is a fast instruction for that on x86-64 processors: bswap.
// See: https://stackoverflow.com/questions/36497605
block1 = reorder_bytes(block1);
block2 = reorder_bytes(block2);
block1 = (block1 << shift1) & mask;
block2 = (block2 << shift2) & mask;
if(block1 != block2)
return false;
}
}
// TODO: compute the reminder part for the last block
This operation can be done using the SSE/AVX instruction set so to be faster for large sequences. Note you can perform a special optimization when shift1 == shift2 (especially when the both are equal to 0).
One should keep in mind that the bit-packing computation is pretty expensive, even using a SIMD code. It will certainly not be faster than a memcpy unless the operation is memory bound which is unlikely to be the case. For example, a Skylake/Coffeelake processor can load and compare 2 blocks of 32 bytes (ie. 32 tokens per block) in only 1 cycle (reciprocal throughput) using the AVX-2 SIMD instruction set, while there is no chance each iteration of the above bit-packing loop can take less than 2 cycles to compute 7 bytes (ie. 28 tokens). Using AVX-2 to optimize the above code is possible but the AVX lanes and the byte reordering results in several additional instructions being required so it will certainly be still slightly slower than just a basic very-fast comparisons (few cycles to compute ~120 tokens).
The only use-case where packing can help is when multiple core are used to do the computation. Indeed, in that case, the bit-packing code can scale well because it is likely compute-bound while the string-based version will quickly be limited by the speed of the RAM since it is likely memory-bound.

If there are only 10million tokens total, its 20Mbit or 2-3MB. If you keep their shifted versions in different arrays such as from 2 bit shifted to 30 bit shifted (assuming 4byte comparison at once, ignore 32 bit shift as it means just a different starting position), you can do a direct comparison (std::memcmp) with no shifting involved (fast) after selecting the right array with modulo of the arbitrary offset. But this requires the token sequence to be constant through many function calls (if not lifetime of program).
If these tokens are part of a much bigger data, you can put a caching layer (that caches fixed length chunks and joins them to get requested sub-sequence for A and B) just before the shifted initialization. Maybe LRU/LFU works fast enough if its token access pattern is cache-friendly. If its not cache friendly, then perhaps just reaching the arrays could be the bottleneck with or without shifting.
If you do checking per byte instead of per 4 bytes, it requires only 4 arrays instead of 16 and it shouldn't add too big requirement with caching.
You can also add an XOR result of fixed-length (like 50-100) sub-sequences for every offset as a way of quicker exiting. Again, this requires 4x more memory space. If XOR results of first tokens (+fixed length) are not equal, then they are not equal. This would reduce number of comparisons at least.
Another way is directly caching f(x,y)->bool like Python language does with its own caching. But this would be much worse than "fixed-length-chunked-caching & joining them" due to non-reusable parts & a lot of duplication.

Related

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
0 or 1 for 95% of all cases,
2 in 4% of cases,
between 3 and 255 in
the other 1% of cases?
So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.
I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude.
It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.
When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).
If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.
Quick benchmark
#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>
using namespace std::chrono;
/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
/// This stuff allows to use this class wherever a library function
/// requires a UniformRandomBitGenerator (e.g. std::shuffle)
typedef uint32_t result_type;
static uint32_t min() { return 1; }
static uint32_t max() { return uint32_t(-1); }
/// PRNG state
uint32_t y;
/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
if(y == 0) y = 2463534242UL;
}
/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
y ^= (y<<13);
y ^= (y>>17);
y ^= (y<<15);
return y;
}
/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
return (*this)()%limit;
}
};
struct mean_variance {
double rmean = 0.;
double rvariance = 0.;
int count = 0;
void operator()(double x) {
++count;
double ormean = rmean;
rmean += (x-rmean)/count;
rvariance += (x-ormean)*(x-rmean);
}
double mean() const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev() const { return std::sqrt(variance()); }
};
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
volatile unsigned out;
int main() {
XorShift32 xs;
std::vector<uint8_t> vec;
int size = 10000000;
for(int i = 0; i<size; ++i) {
uint32_t v = xs();
if(v < 1825361101) v = 0; // 42.5%
else if(v < 4080218931) v = 1; // 95.0%
else if(v < 4252017623) v = 2; // 99.0%
else {
while((v & 0xff) < 3) v = xs();
}
vec.push_back(v);
}
populate(vec.data(), vec.size());
mean_variance lk_t, arr_t;
for(int i = 0; i<50; ++i) {
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += lookup(xs() % size);
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "lookup: %10d µs\n", dur);
lk_t(dur);
}
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += vec[xs() % size];
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "array: %10d µs\n", dur);
arr_t(dur);
}
}
fprintf(stderr, " lookup | ± | array | ± | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
lk_t.mean(), lk_t.stddev(),
arr_t.mean(), arr_t.stddev(),
arr_t.mean()/lk_t.mean());
return 0;
}
(code and data always updated in my Bitbucket)
The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:
performs a random lookup of 10M elements with my data structure
does the same through the original array.
(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)
These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).
I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.
CPU | cache | lookup (µs) | array (µs) | speedup (x)
Xeon E5-1650 v3 # 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49
Xeon E5-2697 v3 # 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50
Celeron G1610T # 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94
Core i3-3220T # 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03
Core i5-7200U # 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03
Xeon X3430 # 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15
Core i7 920 # 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27
Xeon X5650 # 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32
Core i7 870 # 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36
Core i7-6700 # 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40
Core i3-4150 # 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49
Xeon X5650 # 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51
Core i7-3770T # 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85
The results are... mixed!
In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.

Another option could be
check if the result is 0, 1 or 2
if not do a regular lookup
In other words something like:
unsigned char lookup(int index) {
int code = (bmap[index>>2]>>(2*(index&3)))&3;
if (code != 3) return code;
return full_array[index];
}
where bmap uses 2 bits per element with the value 3 meaning "other".
This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.

This is more of a "long comment" than a concrete answer
Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".
For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.
I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.
In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?
But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".
The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.
In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.

What I've done in the past is use a hashmap in front of a bitset.
This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).
Often, however, "cache is king".

Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.
There are two assumptions in your questions:
The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.
I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.
For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.

If the data and accesses are uniformly randomly distributed, performance is probably going to depend upon what fraction of accesses avoid an outer-level cache miss. Optimizing that will require knowing what size array can be reliably accommodated in cache. If your cache is large enough to accommodate one byte for every five cells, the simplest approach may be to have one byte hold the five base-three encoded values in the range 0-2 (there are 243 combinations of 5 values, so that will fit in a byte), along with a 10,000,000 byte array that would be queried whenever an the base-3 value indicates "2".
If the cache isn't that big, but could accommodate one byte per 8 cells, then it would not be possible to use one byte value to select from among all 6,561 possible combinations of eight base-3 values, but since the only effect of changing a 0 or 1 to a 2 would be to cause an otherwise-unnecessary lookup, correctness wouldn't require supporting all 6,561. Instead, one could focus on the 256 most "useful" values.
Especially if 0 is more common than 1, or vice versa, a good approach might be to use 217 values to encode the combinations of 0 and 1 that contain 5 or fewer 1's, 16 values to encode xxxx0000 through xxxx1111, 16 to encode 0000xxxx through 1111xxxx, and one for xxxxxxxx. Four values would remain for whatever other use one might find. If the data are randomly distributed as described, a slight majority of all queries would hit bytes which contained just zeroes and ones (in about 2/3 of all groups of eight, all bits would be zeroes and ones, and about 7/8 of those would have six or fewer 1 bits); the vast majority of those that didn't would land in a byte which contained four x's, and would have a 50% chance of landing on a zero or a one. Thus, only about one in four queries would necessitate a large-array lookup.
If the data are randomly distributed but the cache isn't big enough to handle one byte per eight elements, one could try to use this approach with each byte handling more than eight items, but unless there is a strong bias toward 0 or toward 1, the fraction of values that can be handled without having to do a lookup in the big array will shrink as the number handled by each byte increases.

I'll add to #o11c's answer, as his wording might be a bit confusing.
If I need to squeeze the last bit and CPU cycle I'd do the following.
We will start by constructing a balanced binary search tree that holds the 5% "something else" cases. For every lookup, you walk the tree quickly: you have 10000000 elements: 5% of which is in the tree: hence the tree data structure holds 500000 elements. Walking this in O(log(n)) time, gives you 19 iterations. I'm no expert at this, but I guess there are some memory-efficient implementations out there. Let's guesstimate:
Balanced tree, so subtree position can be calculated (indices do not need to be stored in the nodes of the tree). The same way a heap (data structure) is stored in linear memory.
1 byte value (2 to 255)
3 bytes for the index (10000000 takes 23 bits, which fits 3 bytes)
Totalling, 4 bytes: 500000*4 = 1953 kB. Fits the cache!
For all the other cases (0 or 1), you can use a bitvector. Note that you cannot leave out the 5% other cases for random access: 1.19 MB.
The combination of these two use approximately 3,099 MB. Using this technique, you will save a factor 3.08 of memory.
However, this doesn't beat the answer of #Matteo Italia (which uses 2.76 MB), a pity. Is there anything we can do extra? The most memory consuming part is the 3 bytes of index in the tree. If we can get this down to 2, we would save 488 kB and the total memory usage would be: 2.622 MB, which is smaller!
How do we do this? We have to reduce the indexing to 2 bytes. Again, 10000000 takes 23 bits. We need to be able to drop 7 bits. We can simply do this by partitioning the range of 10000000 elements into 2^7 (=128) regions of 78125 elements. Now we can build a balanced tree for each of these regions, with 3906 elements on average. Picking the right tree is done by a simple division of the target index by 2^7 (or a bitshift >> 7). Now the required index to store can be represented by the remaining 16 bits. Note that there is some overhead for the length of the tree that needs to be stored, but this is negligible. Also note that this splitting mechanism reduces the required number of iterations to walk the tree, this now reduces to 7 iterations less, because we dropped 7 bits: only 12 iterations are left.
Note that you could theoretically repeat the process to cut off the next 8 bits, but this would require you to create 2^15 balanced trees, with ~305 elements on average. This would result in 2.143 MB, with only 4 iterations to walk the tree, which is a considerable speedup, compared to the 19 iterations we started with.
As a final conclusion: this beats the 2-bit vector strategy by a tiny bit of memory usage, but is a whole struggle to implement. But if it can make the difference between fitting the cache or not, it might be worth the try.

If you only perform read operations it would be better to not assign a value to an single index but to an interval of indices.
For example:
[0, 15000] = 0
[15001, 15002] = 153
[15003, 26876] = 2
[25677, 31578] = 0
...
This can be done with a struct. You also might want to define a class similar to this if you like an OO approach.
class Interval{
private:
uint32_t start; // First element of interval
uint32_t end; // Last element of interval
uint8_t value; // Assigned value
public:
Interval(uint32_t start, uint32_t end, uint8_t value);
bool isInInterval(uint32_t item); // Checks if item lies within interval
uint8_t getValue(); // Returns the assigned value
}
Now you just have to iterate trough a list of intervals and check if your index lies within one of them which can be much less memory intensive in average but costs more CPU resources.
Interval intervals[INTERVAL_COUNT];
intervals[0] = Interval(0, 15000, 0);
intervals[1] = Interval(15001, 15002, 153);
intervals[2] = Interval(15003, 26876, 2);
intervals[3] = Interval(25677, 31578, 0);
...
uint8_t checkIntervals(uint32_t item)
for(int i=0; i<INTERVAL_COUNT-1; i++)
{
if(intervals[i].isInInterval(item) == true)
{
return intervals[i].getValue();
}
}
return DEFAULT_VALUE;
}
If you order the intervals by descending size you increase the probability that the item you are looking for is found early which further decreases your average memory and CPU resource usage.
You could also remove all intervals with a size of 1. Put the corresponding values into a map and check them only if the item you are looking for wasn't found in the intervals. This should also raise the average performance a bit.

Long long time ago, I can just remember...
In university we got a task to accelerate a ray tracer program, that has to read by algorithm over and over again from buffer arrays. A friend told me to always use RAM-reads that are multiples of 4Bytes. So I changed the array from a pattern of [x1,y1,z1,x2,y2,z2,...,xn,yn,zn] to a pattern of [x1,y1,z1,0,x2,y2,z2,0,...,xn,yn,zn,0]. Means I add a empty field after each 3D coordinate. After some performance testing: It was faster.
So long story short: Read multiple of 4 Bytes from your array from RAM, and maybe also from the right starting position, so you read a little cluster where the searched index is in it and read the searched index from this little cluster in cpu. (In your case you will not need to inserting fill-fields, but the concept should be clear)
Maybe also other multiples could be the key in newer systems.
I don't know if this will work in your case, so if it doesn't work: Sorry. If it work I would be happy to hear about some test results.
PS: Oh and if there is any access pattern or nearby accessed indices, you can reuse the cached cluster.
PPS: It could be, that the multiple factor was more like 16Bytes or something like that, it's too long ago, that I can remember exactly.

Looking at this, you could split your data, for example:
a bitset which gets indexed and represents the value 0 (std::vector would be useful here)
a bitset which gets indexed and represents the value 1
a std::vector for the values of 2, containing the indexes which refer to this value
a map for the other values (or std::vector>)
In this case, all values appear till a given index, so you could even remove one of bitsets and represents the value as it being missing in the other ones.
This will save you some memory for this case, though would make the worst case worse.
You'll also need more CPU power to do the lookups.
Make sure to measure!

Like Mats mentions in his comment-answer, it is hard to say what is actually the best solution without knowing specifically what kind of data you have (e.g., are there long runs of 0's, and so on), and what your access pattern looks like (does "random" mean "all over the place" or just "not strictly in completely linear fashion" or "every value exactly once, just randomized" or ...).
That said, there are two mechanisms coming to mind:
Bit arrays; i.e., if you only had two values, you could trivially compress your array by a factor of 8; if you have 4 values (or "3 values + everything else") you can compress by a factor of two. Which might just not be worth the trouble and would need benchmarks, especially if you have really random access patterns which escape your caches and hence do not change the access time at all.
(index,value) or (value,index) tables. I.e., have one very small table for the 1% case, maybe one table for the 5% case (which only needs to store the indexes as all have the same value), and a big compressed bit array for the final two cases. And with "table" I mean something which allows relatively quick lookup; i.e., maybe a hash, a binary tree, and so on, depending on what you have available and your actual needs. If these subtables fit into your 1st/2nd level caches, you might get lucky.

I am not very familiar with C, but in C++ you can use unsigned char to represent an integer in the range 0 - 255.
Compared to normal int (again, I am coming from Java and C++ world) in which 4 byte (32 bit) are required, an unsigned char requires 1 byte (8 bits).
so it might reduce the total size of the array by 75%.

You have succinctly described all the distribution characteristics of your array; toss the array.
You can easily replace the array with a randomized method that produces the same probabilistic output as the array.
If consistency matters (producing the same value for the same random index), consider using a bloom filter and/or hash map to track repeat hits. If your array accesses really are random, though, this is totally unnecessary.

How to optimize strtok + atoll

Is there any good way to optimize this function in terms of execution time? My final goal is to parse a long string composed of several integers (thousands of integer per line, and thousands of lines). This was my initial solution.
int64_t get_next_int(char *newLine) {
char *token=strtok(newLine, " ");
if( token == NULL ) {
exit(0);
}
return atoll(token);
}
More details: I need the "state" based implementation of strtok, so the padding implemented by strtok should exist in the final string. Atoll does not need of any kind of verification.
Target system: Intel x86_64 (Xeon series)
Related topics:
atoi optimization: C++ most efficient way to convert string to int (faster than atoi)

First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.
I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream performance.
Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.
So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor
Outer loop, jumps in steps of 16:
load 128 bit of input string into 128 bit SIMD register
run something like __mm_cmpestri to find indices of delimiters and \0 terminator in all these 16 bytes at once
inner loop over the found indices
Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with 0
prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
subtract 0 from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8)
convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit, _mm_cvtepi8_epi64, I think)
initialize a __mm128 register with [10^15 10^14], let's call it powers
loop over pairs dual-64bit words: (each step should be one SSE instruction)
multiply first with powers
divide powers by [100 100]
multiply second with powers
add results to dual-64bit accumulator
sum the two values in accumulator
store the result to integer array

I'd rather use something along the lines of a std::istringstream:
int64_t get_next_int(std::istringstream& line) {
int64_t token;
if(!(line >> token))
exit(0);
return token;
}
std::istringstream line(newLine);
int64_t i = get_next_int(line);
strtok() has well known drawbacks, and you don't want to use it at all.

What about
int n= 0;
// Find the token
for ( ; *newline == ' '; newline++)
;
if (*newline == 0)
// Not found
exit(0);
// Scan and convert the token
for ( ; unsigned(*newline - '0') < 10; newline++)
n= 10 * n + *newline - '0';
return n;

AFA I get from your code at first splitting it will return. It seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. Namely, you just need a string for the conversion. So you don't need tokenizing just check the string is null or not. You can change return type as well. Because, if you need a type with _exactly_ 64 bits, use (u)int64_t, if you need _at least_ 64 bits, (unsigned) long long is perfectly fine, as would be (u)int_least64_t. I think your code is little gobbledygook. Show what you exactly want without simplification.
/*
* ascii-to-longlong conversion
*
* no error checking; assumes decimal digits
*
* efficient conversion:
* start with value = 0
* then, starting at first character, repeat the following
* until the end of the string:
*
* new value = (10 * (old value)) + decimal value of next character
*
*/
long long my_atoll(char *instr)
{
if(str[0] == '\0')
return -1;
long long retval;
int i;
retval = 0;
for (; *instr; instr++) {
retval = 10*retval + (*instr - '0');
}
return retval;
}

Fastest way to downcast an array short to char

I have to process roughly 2000, 100 element arrays every second. The arrays come to me as shorts, w/ the data in the upper bits and need to be shifted and cast to chars. Is this as efficient as I can get, or is there a faster way to perform this operation? (I have to skip 2 of the values)
for(int i = 0; i < 48; i++)
{
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}

Even if shift and bitwise operation are fast, you can try to process the short array as a char pointer as other advised in comments. It is allowed per standard and for common architectures does what is expected - left the endianness problem.
So you could try to first determine your endianness:
bool isBigEndian() {
short i = 1; // sets only lowest order bit
char *ix = reinterpret_cast<char *>(&i);
return (*ix == 0); // will be 1 if little endian
}
Your loop now becomes:
int shft = isBigEndian()? 0 : 1;
char * pb = reinterpret_cast<char *>(b);
for(int i = 0; i < 48; i++)
{
a[i] = pt[2 * i + shft];
a[i+48] = pt[2 * i + 50 + shft];
}
But as always for low level optimisation, this has to be benchmarked with the compiler and compiler options that will be used in production code.

You could put a wrapper class around these arrays, so code that accesses elements of the wrapper in order actually accesses every other byte of the underlying memory.
This will probably defeat auto-vectorization, though. Other than that, having all the code that would read a actually read b and increment its pointers by two instead of one shouldn't change the cost at all.
The two skipped elements are a problem, though. Having your operator[] do if (i>=48) i+=2 might kill this idea. memmove will often be much faster than storing one byte at a time, so you could consider using memmove to make a contiguous array of shorts that you can index even though it seems silly to copy without storing in a better format.
The trick will be to write a wrapper that completely optimizes away to no extra instructions in loops over your arrays. This is possible on x86, where scaled indexing is available in normal effective-addresses in asm instructions, so if the compiler understands what's going on, it can make code that's just as efficient.
Having arrays of shorts does take twice as much memory, so cache effects could matter.
It all depends on what you need to do with the byte arrays.
If you do need to convert, use SIMD
For x86 targets, you can get a big speedup with SIMD vectors instead of looping one char at a time. For other compile targets you care about, you can write similar special versions. I assume ARM NEON has similar shuffling capability, for example.
When writing a platform-specific version, you also get to make all the endian and unaligned-access assumptions that are true on that platform.
#ifdef __SSE2__ // will be true for all x86-64 builds and most i386 builds
#include <immintrin.h>
static __m128i pack2(const short *p) {
__m128i lo = _mm_loadu_si128((__m128i*)p);
__m128i hi = _mm_loadu_si128((__m128i*)(p + 8));
lo = _mm_srli_epi16(lo, 8); // logical shift, not arithmetic, because we need the high byte to be zero
hi = _mm_srli_epi16(hi, 8);
return _mm_packus_epi16(lo, hi); // treats input as signed, saturates to unsigned 0x0 .. 0xff range
}
#endif // SSE2
void conv(char *a, const short *b) {
#ifdef __SSE2__
for(int i = 0; i < 48; i+=16) {
__m128i low = pack2(b+i);
_mm_storeu_si128((__m128i *)(a+i), low);
__m128i high = pack2(b+i + 50);
_mm_storeu_si128((__m128i *)(a+i + 48), high);
}
#else
/******* Fallback C version *******/
for(int i = 0; i < 48; i++) {
a[i] = (char)(b[i] >> 8);
a[i+48] = (char)(b[i+50] >> 8);
}
#endif
}
As you can see on the Godbolt Compiler Explorer, gcc fully unrolls the loop since it's only a few iterations when storing 16B at a time.
This should perform ok, but on pre-Skylake will bottleneck on shifting both vectors of shorts before the store. Haswell can only sustain one psrli per clock. (Skylake can sustain one per 0.5c when the shift-count is an immediate. See Agner Fog's guide and insn tables, links at the x86 tag wiki.)
You might get better results from loading from (__m128i*)(1 + (char*)p) so the bytes we want are already in the low half of each 16bit element. We'd still have to mask off the high half of each element with _mm_and_si128 instead of shifting, but PAND can run on any vector execution port, so it has three per clock throughput.
More importantly, with AVX it can be combined with an unaligned load. e.g. vpand xmm0, xmm5, [rsi], where xmm5 is a mask of _mm_set1_epi16(0x00ff), and [rsi] holds 2*i + 1 + (char*)b. fused-domain uop throughput is probably going to be an issue, like is common for code with a lot of loads/stores as well as computation.
Unaligned accesses are slightly slower than aligned accesses, but at least half your vector accesses will be unaligned anyway (since skipping two shorts means skipping 4B). On Intel SnB-family CPUs, I don't think it's slower to have loads that are split across a cache-line boundary in a 15:1 split compared to a 12:4 split. (The no-split case is definitely faster, though.) If b is 16B-aligned, then it'll be worth testing the mask version against the shift version.
I didn't write up complete code for this version, because you'll end up reading one byte past the end of b unless you take special precautions. This is fine if you make sure b has padding of some sort so it doesn't go right to the end of a memory page.
AVX2
With AVX2, vpackuswb ymm operates in two separate lanes. IDK if there's anything to gain from doing the load and mask (or shift) on 256b vectors and then using a vextracti128 and 128b pack on the two halves of the 256b vector.
Or maybe do a 256b pack between two vectors and then a vpermq (_mm256_permute4x64_epi64) to sort things out:
lo = _mm256_loadu(b..); // { b[15..8] | b[7..0] }
hi = // { b[31..24] | b[23..16] }
// mask or shift
__m256i packed = _mm256_packus_epi16(lo, hi); // [ a31..24 a15..8 | a23..16 a7..0 ]
packed = _mm256_permute4x64_epi64(packed, _MM_SHUFFLE(3, 1, 2, 0));
Of course, use any portable optimizations you can in the C version. e.g. Serge Ballesta's suggestion of just copying the desired bytes after figuring out their location from the endianness of the machine. (Preferably at compile time by checking GNU C's __BYTE_ORDER__ macro.

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

I want to calculate y = ax + b, where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float
Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++..
After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps. But in the first place I need to convert the x in byte to float (4 byte).
The question is, after I load the data from byte-data source (_mm_load_si128), how can I convert the data from byte to float?

a and b are different for each pixel? That's going to make it difficult to vectorize, unless there's a pattern or you can generate them in vectors.
Is there any way you can efficiently generate a and b in vectors, either as fixed-point or floating point? If not, inserting 4 FP values, or 8 16bit integers, might be worse than just scalar ops.
Fixed point
If a and b can be reused at all, or generated with fixed-point in the first place, this might be a good use-case for fixed-point math. (i.e. integers that represent value * 2^scale). SSE/AVX don't have a 8b*8b->16b multiply; the smallest elements are words, so you have to unpack bytes to words, but not all the way to 32bit. This means you can process twice as much data per instruction.
There's a _mm_maddubs_epi16 instruction which might be useful if b and a change infrequently enough, or you can easily generate a vector with alternating a2^4 and b2^1 bytes. Apparently it's really handy for bilinear interpolation, but it still gets the job done for us with minimal shuffling, if we can prepare an a and b vector.
float a, b;
const int logascale = 4, logbscale=1;
const int ascale = 1<<logascale; // fixed point scale for a: 2^4
const int bscale = 1<<logbscale; // fixed point scale for b: 2^1
const __m128i brescale = _mm_set1_epi8(1<<(logascale-logbscale)); // re-scale b to match a in the 16bit temporary result
for (i=0 ; i<n; i+=16) {
//__m128i avec = get_scaled_a(i);
//__m128i bvec = get_scaled_b(i);
//__m128i ab_lo = _mm_unpacklo_epi8(avec, bvec);
//__m128i ab_hi = _mm_unpackhi_epi8(avec, bvec);
__m128i abvec = _mm_set1_epi16( ((int8_t)(bscale*b) << 8) | (int8_t)(ascale*a) ); // integer promotion rules might do sign-extension in the wrong place here, so check this if you actually write it this way.
__m128i block = _mm_load_si128(&buf[i]); // call this { v[0] .. v[15] }
__m128i lo = _mm_unpacklo_epi8(block, brescale); // {v[0], 8, v[1], 8, ...}
__m128i hi = _mm_unpackhi_epi8(block, brescale); // {v[8], 8, v[9], 8, ...
lo = _mm_maddubs_epi16(lo, abvec); // first arg is unsigned bytes, 2nd arg is signed bytes
hi = _mm_maddubs_epi16(hi, abvec);
// lo = { v[0]*(2^4*a) + 8*(2^1*b), ... }
lo = _mm_srli_epi16(lo, logascale); // truncate from scaled fixed-point to integer
hi = _mm_srli_epi16(hi, logascale);
// and re-pack. Logical, not arithmetic right shift means sign bits can't be set
block = _mm_packuswb(lo, hi);
_mm_store_si128(&buf[i], block);
}
// then a scalar cleanup loop
2^4 is an arbitrary choice. It leaves 3 non-sign bits for the integer part of a, and 4 fraction bits. So it effectively rounds a to the nearest 16th, and overflows if it has a magnitude greater than 8 and 15/16ths. 2^6 would give more fractional bits, and allow a from -2 to +1 and 63/64ths.
Since b is being added, not multiplied, its useful range is much larger, and fractional part much less useful. To represent it in 8 bits, rounding it to the nearest half still keeps a little bit of fractional information, but allows it to be [-64 : 63.5] without overflowing.
For more precision, 16b fixed-point is a good choice. You can scale a and b up by 2^7 or something, to have 7b of fractional precision and still allow the integer part to be [-256 .. 255]. There's no multiply-and-add instruction for this case, so you'd have to do that separately. Good options for doing the multiply include:
_mm_mulhi_epu16: unsigned 16b*16b->high16 (bits [31:16]). Useful if a can't be negative
_mm_mulhi_epi16: signed 16b*16b->high16 (bits [31:16]).
_mm_mulhrs_epi16: signed 16b*16b->bits [30:15] of the 32b temporary, with rounding. With a good choice of scaling factor for a, this should be nicer. As I understand it, SSSE3 introduced this instruction for exactly this kind of use.
_mm_mullo_epi16: signed 16b*16b->low16 (bits [15:0]). This only allows 8 significant bits for a before the low16 result overflows, so I think all you gain over the _mm_maddubs_epi16 8bit solution is more precision for b.
To use these, you'd get scaled 16b vectors of a and b values, then:
unpack your bytes with zero (or pmovzx byte->word), to get signed words still in the [0..255] range
left shift the words by 7.
multiply by your a vector of 16b words, taking the upper half of each 16*16->32 result. (e.g. mul
right shift here if you wanted different scales for a and b, to get more fractional precision for a
add b to that.
right shift to do the final truncation back from fixed point to [0..255].
With a good choice of fixed-point scale, this should be able to handle a wider range of a and b, as well as more fractional precision, than 8bit fixed point.
If you don't left-shift your bytes after unpacking them to words, a has to be full-range just to get 8bits set in the high16 of the result. This would mean a very limited range of a that you could support without truncating your temporary to less than 8 bits during the multiply. Even _mm_mulhrs_epi16 doesn't leave much room, since it starts at bit 30.
expand bytes to floats
If you can't efficiently generate fixed-point a and b values for every pixel, it may be best to convert your pixels to floats. This takes more unpacking/repacking, so latency and throughput are worse. It's worth looking into generating a and b with fixed point.
For packed-float to work, you still have to efficiently build a vector of a values for 4 adjacent pixels.
This is a good use-case for pmovzx (SSE4.1), because it can go directly from 8b elements to 32b. The other options are SSE2 punpck[l/h]bw/punpck[l/h]wd with multiple steps, or SSSE3 pshufb to emulate pmovzx. (You can do one 16B load and shuffle it 4 different ways to unpack it to four vectors of 32b ints.)
char *buf;
// const __m128i zero = _mm_setzero_si128();
for (i=0 ; i<n; i+=16) {
__m128 a = get_a(i);
__m128 b = get_b(i);
// IDK why there isn't an intrinsic for using `pmovzx` as a load, because it takes a m32 or m64 operand, not m128. (unlike punpck*)
__m128i unsigned_dwords = _mm_cvtepu8_epi32( _mm_loadu_si32(buf+i)); // load 4B at once.
// Current GCC has a bug with _mm_loadu_si32, might want to use _mm_load_ss and _mm_castps_si128 until it's fixed.
__m128 floats = _mm_cvtepi32_ps(unsigned_dwords);
floats = _mm_fmadd_ps(floats, a, b); // with FMA available, this might as well be 256b vectors, even with the inconvenience of the different lane-crossing semantics of pmovzx vs. punpck
// or without FMA, do this with _mm_mul_ps and _mm_add_ps
unsigned_dwords = _mm_cvtps_epi32(floats);
// repeat 3 more times for buf+4, buf+8, and buf+12, then:
__m128i packed01 = _mm_packss_epi32(dwords0, dwords1); // SSE2
__m128i packed23 = _mm_packss_epi32(dwords2, dwords3);
// packuswb wants SIGNED input, so do signed saturation on the first step
// saturate into [0..255] range
__m12i8 packedbytes=_mm_packus_epi16(packed01, packed23); // SSE2
_mm_store_si128(buf+i, packedbytes); // or storeu if buf isn't aligned.
}
// cleanup code to handle the odd up-to-15 leftover bytes, if n%16 != 0
(Re: a load that can be a memory source operand for pmovzxbd, see also Loading 8 chars from memory into an __m256 variable as packed single precision floats re: the problems compilers have with this.) And see also GCC bug 99754 - wrong code for _mm_loadu_si32 - reversed vector elements.
The previous version of this answer went from float->uint8 vectors with packusdw/packuswb, and had a whole section on workarounds for without SSE4.1. None of that masking-the-sign-bit after an unsigned pack is needed if you simply stay in the signed integer domain until the last pack. I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing.
The last CPU to not have SSE4.1 was Intel Conroe/merom (first gen Core2, from before late 2007), and AMD pre Barcelona (before late 2007). If working-but-slow is acceptable for those CPUs, just write a version for AVX2, and a version for SSE4.1. Or SSSE3 (with 4x pshufb to emulate pmovzxbd of the four 32b elements of a register) pshufb is slow on Conroe, though, so if you care about CPUs without SSE4.1, write a specific version. Actually, Conroe/merom also has slow xmm punpcklbw and so on (except for q->dq). 4x slow pshufb should still beats 6x slow unpacks. Vectorizing is a lot less of a win on pre-Wolfdale, because of the slow shuffles for unpacking and repacking. The fixed point version, with a lot less unpacking/repacking, will have an even bigger advantage there.
See the edit history for an unfinished attempt at using punpck before I realized how many extra instructions it was going to need. Removed it because this answer is long already, and another code block would be confusing.

I guess you're looking fro the __m128 _mm_cvtpi8_ps(__m64 a ) composite intrinsic.
Here is a minimal example:
#include <xmmintrin.h>
#include <stdio.h>
int main() {
unsigned char a[4] __attribute__((aligned(32)))= {1,2,3,4};
float b[4] __attribute__((aligned(32)));
_mm_store_ps(b, _mm_cvtpi8_ps(*(__m64*)a));
printf("%f %f, %f, %f\n", b[0], b[1], b[2], b[3]);
return 0;
}

Fast implementation of a large integer counter (in C/C++)

My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js