Bit string nearest neighbour searching - compression

I have hundreds of thousands of sparse bit strings of length 32 bits.
I'd like to do a nearest neighbour search on them and look-up performance is critical. I've been reading up on various algorithms but they seem to target text strings rather than binary strings. I think either locally sensitive hashing or spectral hashing seem good candidates or I could look into compression. Will any of these work well for my bit string problem ? Any direction or guidance would be greatly appreciated.

Here's a fast and easy method,
then a variant with better performance at the cost of more memory.
In: array Uint X[], e.g. 1M 32-bit words
Wanted: a function near( Uint q ) --> j with small hammingdist( q, X[j] )
Method: binary search q in sorted X,
then linear search a block around that.
Pseudocode:
def near( q, X, Blocksize=100 ):
preprocess: sort X
Uint* p = binsearch( q, X ) # match q in leading bits
linear-search Blocksize words around p
return the hamming-nearest of these.
This is fast --
Binary search 1M words
+ nearest hammingdist in a block of size 100
takes < 10 us on my Mac ppc.
(This is highly cache-dependent — your mileage will vary.)
How close does this come to finding the true nearest X[j] ?
I can only experiment, can't do the math:
for 1M random queries in 1M random words,
the nearest match is on average 4-5 bits away,
vs. 3 away for the true nearest (linear scan all 1M):
near32 N 1048576 Nquery 1048576 Blocksize 100
binary search, then nearest +- 50
7 usec
distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0
near32 N 1048576 Nquery 100 Blocksize 1048576
linear scan all 1048576
38701 usec
distance distribution: 0 0 7 58 35 0
Run your data with blocksizes say 50 and 100
to see how the match distances drop.
To get even nearer, at the cost of twice the memory,
make a copy Xswap of X with upper / lower halfwords swapped,
and return the better of
near( q, X, Blocksize )
near( swap q, Xswap, Blocksize )
With lots of memory, one can use many more bit-shuffled copies of X,
e.g. 32 rotations.
I have no idea how performance varies with Nshuffle and Blocksize —
a question for LSH theorists.
(Added): To near-match bit strings of say 320 bits, 10 words,
make 10 arrays of pointers, sorted on word 0, word 1 ...
and search blocks with binsearch as above:
nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist e.g. 42 of 320
nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37
nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50
...
-> e.g. the 37.
This will of course miss near-matches where no single word is close,
but it's very simple, and sort and binsearch are blazingly fast.
The pointer arrays take exactly as much space as the data bits.
100 words, 3200 bits would work in exactly the same way.
But: this works only if there are roughly equal numbers of 0 bits and 1 bits,
not 99 % 0 bits.

I just came across a paper that addresses this problem.
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering (Ravichandran et al, 2005)
The basic idea is similar to Denis's answer (sort lexicographically by different permutations of the bits) but it includes a number of additional ideas and further references for articles on the topic.
It is actually implemented in https://github.com/soundcloud/cosine-lsh-join-spark which is where I found it.

Related

Mapping a continuous range into discrete bins in C++

I've inherited maintenance of a function that takes as parameter a value between 0 and 65535 (inclusive):
MyClass::mappingFunction(unsigned short headingIndex);
headingIndex can be converted to degrees using the following formula: degrees = headingIndex * 360 / 65536
The role of this function is to translate the headingIndex into 1 of 36 symbols representing various degrees of rotation, i.e. there is a symbol for 10 degrees, a symbol for 20 degrees etc, up to 360 degrees in units of 10 degrees.
A headingIndex of 0 would translate to displaying the 0 (360) degree symbol.
The function performs the following which I can't seem to get my head around:
const int MAX_INTEGER = 65536;
const int NUM_SYMBOLS = 36;
int symbolRange = NUM_SYMBOLS - 1;
int roundAmount = MAX_INTEGER / (symbolRange + 1) - 1;
int roundedIndex = headingIndex + roundAmount;
int symbol = (symbolRange * roundedIndex) / MAX_INTEGER;
I'm confused about the algorithm that is being used here, specifically with regard to the following:
The intention behind roundAmount? I understand it is essentially dividing the maximum input range into discrete chunks but to then add it on to the headingIndex seems a strange thing to do.
roundedIndex is then the original value now offset or rotated by some offset in a clockwise direction?
The algorithm produces results such as:
headingIndex of 0 --> symbol 0
headingIndex of 100 --> symbol 1
headingIndex of 65500 --> symbol 35
I'm thinking there must be a better way of doing this?
The shown code looks very convoluted (it is possibly a guard against integer overflow). A far simpler way to determine the symbol number would be code like the following:
symbol = (headingIndex * 36u) / 65536u;
However, if this does present problems with integer overflow, then the calculation could be done in double precision, converting the result back to int after rounding:
symbol = static_cast<int>( ((headindIndex * 36.0) / 65536.0) + 0.5 ); // Add 0.5 for rounding.
You have 65536 possible inputs (0..65535) and 36 outputs (0..35). That means each output bin should represent about 1820 inputs if they are divided equally.
The above formula doesn't do that. Only the first 54 values are in bin 0, then they are equally divided across the remaining 35 bins (MAX_INTEGER/symbolRange). About 1872 per bin.
To show this, solve for the lowest value of heading where symbol is 1. 1 * 65536 = 35 * (headingIndex + 1819) so headingIndex == 53.
If you want to keep the output the same but tidy it up. Walk away.
There are odd features of that method that may or may not be what is desired.
The range for headingIndex of 0 - 53 gives a symbol of 0. That's a bucket (AKA bin) of 54 values.
The range of 63717 - 65535 give 35. That bucket is 1819 values.
All the other buckets are either 1872 or 1873 values so seem 'big'.
We can't have equal sized buckets because number of values is 65536 and 65536/36 is 1820 and 16 remainder.
So we need to bury the 16 among the buckets. They have to be uneven in size.
Notice the constant MAX_INTEGER is a red herring. The max is 65535. 65536 is the range. The chosen name is misleading from the start.
Why:
int symbolRange = NUM_SYMBOLS - 1;
int roundAmount = MAX_INTEGER / (symbolRange + 1) - 1;
when the second line could be int roundAmount = MAX_INTEGER / MAX_SYMBOLS - 1;
It doesn't look quite thought through is all I'm saying. But looks can be deceptive.
What also bothers me is the 'obvious' method proposed in other answers works great!
int symbol=(NUM_SYMBOLS*headingIndex)/(MAX_INTEGER);
Gives us buckets of either 1820 or 1821 with an even distribution. I'd say that's the natural solution to the head question.
So why the current method? Is it some artefact of some measuring device?
I'll put money the maximum value is 65535 because that's the maximum value of an unsigned 16-bit integer.
It's right to wonder about overflow. But if you're working in 16-bits it's already broken. So I wonder about a device that is recording 16-bits. That's quite realistic.
This is similar to what I know as "The Instalments Problem".
We want to the customer to pay £655.36 over 36 months. Do they pay £18.20 a month totalling £655.20 and we forget the 16p? They won't pay £18.21 totalling £655.56 and overpay 20p. Bigger first payment of £18.36 and then 35 of £18.20?
People wrestle with this one. The business answers are 'get the money' - bigger first payment. Avoid complaints if they owe you money (big last payment) and forget the pennies (all same - we're bigger than a few pence!).
In arithmetic terms for a measurement (such as degrees) I'd say the sprinkled method offered is the most natural, even and distributes the anomaly evenly.
But it's not the only answer. Up to you. Hint: If you haven't been ask to fix this and just think it's ugly - walk away. Walk away now.

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
uint8_t MyArray[10000000];
when the value at any position in the array is
0 or 1 for 95% of all cases,
2 in 4% of cases,
between 3 and 255 in
the other 1% of cases?
So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.
I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude.
It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.
A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.
When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).
If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.
Quick benchmark
#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>
using namespace std::chrono;
/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
/// This stuff allows to use this class wherever a library function
/// requires a UniformRandomBitGenerator (e.g. std::shuffle)
typedef uint32_t result_type;
static uint32_t min() { return 1; }
static uint32_t max() { return uint32_t(-1); }
/// PRNG state
uint32_t y;
/// Initializes with seed
XorShift32(uint32_t seed = 0) : y(seed) {
if(y == 0) y = 2463534242UL;
}
/// Returns a value in the range [1, 1<<32)
uint32_t operator()() {
y ^= (y<<13);
y ^= (y>>17);
y ^= (y<<15);
return y;
}
/// Returns a value in the range [0, limit); this conforms to the RandomFunc
/// requirements for std::random_shuffle
uint32_t operator()(uint32_t limit) {
return (*this)()%limit;
}
};
struct mean_variance {
double rmean = 0.;
double rvariance = 0.;
int count = 0;
void operator()(double x) {
++count;
double ormean = rmean;
rmean += (x-rmean)/count;
rvariance += (x-ormean)*(x-rmean);
}
double mean() const { return rmean; }
double variance() const { return rvariance/(count-1); }
double stddev() const { return std::sqrt(variance()); }
};
std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;
uint8_t lookup(unsigned idx) {
// extract the 2 bits of our interest from the main array
uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
// usual (likely) case: value between 0 and 2
if(v != 3) return v;
// bad case: lookup the index<<8 in the secondary array
// lower_bound finds the first >=, so we don't need to mask out the value
auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
// some coherency checks
if(ptr == sec_arr.end()) std::abort();
if((*ptr >> 8) != idx) std::abort();
#endif
// extract our 8-bit value from the 32 bit (index, value) thingie
return (*ptr) & 0xff;
}
void populate(uint8_t *source, size_t size) {
main_arr.clear(); sec_arr.clear();
// size the main storage (round up)
main_arr.resize((size+3)/4);
for(size_t idx = 0; idx < size; ++idx) {
uint8_t in = source[idx];
uint8_t &target = main_arr[idx>>2];
// if the input doesn't fit, cap to 3 and put in secondary storage
if(in >= 3) {
// top 24 bits: index; low 8 bit: value
sec_arr.push_back((idx << 8) | in);
in = 3;
}
// store in the target according to the position
target |= in << ((idx & 3)*2);
}
}
volatile unsigned out;
int main() {
XorShift32 xs;
std::vector<uint8_t> vec;
int size = 10000000;
for(int i = 0; i<size; ++i) {
uint32_t v = xs();
if(v < 1825361101) v = 0; // 42.5%
else if(v < 4080218931) v = 1; // 95.0%
else if(v < 4252017623) v = 2; // 99.0%
else {
while((v & 0xff) < 3) v = xs();
}
vec.push_back(v);
}
populate(vec.data(), vec.size());
mean_variance lk_t, arr_t;
for(int i = 0; i<50; ++i) {
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += lookup(xs() % size);
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "lookup: %10d µs\n", dur);
lk_t(dur);
}
{
unsigned o = 0;
auto beg = high_resolution_clock::now();
for(int i = 0; i < size; ++i) {
o += vec[xs() % size];
}
out += o;
int dur = (high_resolution_clock::now()-beg)/microseconds(1);
fprintf(stderr, "array: %10d µs\n", dur);
arr_t(dur);
}
}
fprintf(stderr, " lookup | ± | array | ± | speedup\n");
printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
lk_t.mean(), lk_t.stddev(),
arr_t.mean(), arr_t.stddev(),
arr_t.mean()/lk_t.mean());
return 0;
}
(code and data always updated in my Bitbucket)
The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:
performs a random lookup of 10M elements with my data structure
does the same through the original array.
(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)
These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).
I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.
CPU | cache | lookup (µs) | array (µs) | speedup (x)
Xeon E5-1650 v3 # 3.50GHz | 15360 KB | 60011 ± 3667 | 29313 ± 2137 | 0.49
Xeon E5-2697 v3 # 2.60GHz | 35840 KB | 66571 ± 7477 | 33197 ± 3619 | 0.50
Celeron G1610T # 2.30GHz | 2048 KB | 172090 ± 629 | 162328 ± 326 | 0.94
Core i3-3220T # 2.80GHz | 3072 KB | 111025 ± 5507 | 114415 ± 2528 | 1.03
Core i5-7200U # 2.50GHz | 3072 KB | 92447 ± 1494 | 95249 ± 1134 | 1.03
Xeon X3430 # 2.40GHz | 8192 KB | 111303 ± 936 | 127647 ± 1503 | 1.15
Core i7 920 # 2.67GHz | 8192 KB | 123161 ± 35113 | 156068 ± 45355 | 1.27
Xeon X5650 # 2.67GHz | 12288 KB | 106015 ± 5364 | 140335 ± 6739 | 1.32
Core i7 870 # 2.93GHz | 8192 KB | 77986 ± 429 | 106040 ± 1043 | 1.36
Core i7-6700 # 3.40GHz | 8192 KB | 47854 ± 573 | 66893 ± 1367 | 1.40
Core i3-4150 # 3.50GHz | 3072 KB | 76162 ± 983 | 113265 ± 239 | 1.49
Xeon X5650 # 2.67GHz | 12288 KB | 101384 ± 796 | 152720 ± 2440 | 1.51
Core i7-3770T # 2.50GHz | 8192 KB | 69551 ± 1961 | 128929 ± 2631 | 1.85
The results are... mixed!
In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.
Another option could be
check if the result is 0, 1 or 2
if not do a regular lookup
In other words something like:
unsigned char lookup(int index) {
int code = (bmap[index>>2]>>(2*(index&3)))&3;
if (code != 3) return code;
return full_array[index];
}
where bmap uses 2 bits per element with the value 3 meaning "other".
This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.
This is more of a "long comment" than a concrete answer
Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".
For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.
I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.
In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?
But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".
The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.
In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.
What I've done in the past is use a hashmap in front of a bitset.
This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).
Often, however, "cache is king".
Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.
There are two assumptions in your questions:
The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.
I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.
For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.
If the data and accesses are uniformly randomly distributed, performance is probably going to depend upon what fraction of accesses avoid an outer-level cache miss. Optimizing that will require knowing what size array can be reliably accommodated in cache. If your cache is large enough to accommodate one byte for every five cells, the simplest approach may be to have one byte hold the five base-three encoded values in the range 0-2 (there are 243 combinations of 5 values, so that will fit in a byte), along with a 10,000,000 byte array that would be queried whenever an the base-3 value indicates "2".
If the cache isn't that big, but could accommodate one byte per 8 cells, then it would not be possible to use one byte value to select from among all 6,561 possible combinations of eight base-3 values, but since the only effect of changing a 0 or 1 to a 2 would be to cause an otherwise-unnecessary lookup, correctness wouldn't require supporting all 6,561. Instead, one could focus on the 256 most "useful" values.
Especially if 0 is more common than 1, or vice versa, a good approach might be to use 217 values to encode the combinations of 0 and 1 that contain 5 or fewer 1's, 16 values to encode xxxx0000 through xxxx1111, 16 to encode 0000xxxx through 1111xxxx, and one for xxxxxxxx. Four values would remain for whatever other use one might find. If the data are randomly distributed as described, a slight majority of all queries would hit bytes which contained just zeroes and ones (in about 2/3 of all groups of eight, all bits would be zeroes and ones, and about 7/8 of those would have six or fewer 1 bits); the vast majority of those that didn't would land in a byte which contained four x's, and would have a 50% chance of landing on a zero or a one. Thus, only about one in four queries would necessitate a large-array lookup.
If the data are randomly distributed but the cache isn't big enough to handle one byte per eight elements, one could try to use this approach with each byte handling more than eight items, but unless there is a strong bias toward 0 or toward 1, the fraction of values that can be handled without having to do a lookup in the big array will shrink as the number handled by each byte increases.
I'll add to #o11c's answer, as his wording might be a bit confusing.
If I need to squeeze the last bit and CPU cycle I'd do the following.
We will start by constructing a balanced binary search tree that holds the 5% "something else" cases. For every lookup, you walk the tree quickly: you have 10000000 elements: 5% of which is in the tree: hence the tree data structure holds 500000 elements. Walking this in O(log(n)) time, gives you 19 iterations. I'm no expert at this, but I guess there are some memory-efficient implementations out there. Let's guesstimate:
Balanced tree, so subtree position can be calculated (indices do not need to be stored in the nodes of the tree). The same way a heap (data structure) is stored in linear memory.
1 byte value (2 to 255)
3 bytes for the index (10000000 takes 23 bits, which fits 3 bytes)
Totalling, 4 bytes: 500000*4 = 1953 kB. Fits the cache!
For all the other cases (0 or 1), you can use a bitvector. Note that you cannot leave out the 5% other cases for random access: 1.19 MB.
The combination of these two use approximately 3,099 MB. Using this technique, you will save a factor 3.08 of memory.
However, this doesn't beat the answer of #Matteo Italia (which uses 2.76 MB), a pity. Is there anything we can do extra? The most memory consuming part is the 3 bytes of index in the tree. If we can get this down to 2, we would save 488 kB and the total memory usage would be: 2.622 MB, which is smaller!
How do we do this? We have to reduce the indexing to 2 bytes. Again, 10000000 takes 23 bits. We need to be able to drop 7 bits. We can simply do this by partitioning the range of 10000000 elements into 2^7 (=128) regions of 78125 elements. Now we can build a balanced tree for each of these regions, with 3906 elements on average. Picking the right tree is done by a simple division of the target index by 2^7 (or a bitshift >> 7). Now the required index to store can be represented by the remaining 16 bits. Note that there is some overhead for the length of the tree that needs to be stored, but this is negligible. Also note that this splitting mechanism reduces the required number of iterations to walk the tree, this now reduces to 7 iterations less, because we dropped 7 bits: only 12 iterations are left.
Note that you could theoretically repeat the process to cut off the next 8 bits, but this would require you to create 2^15 balanced trees, with ~305 elements on average. This would result in 2.143 MB, with only 4 iterations to walk the tree, which is a considerable speedup, compared to the 19 iterations we started with.
As a final conclusion: this beats the 2-bit vector strategy by a tiny bit of memory usage, but is a whole struggle to implement. But if it can make the difference between fitting the cache or not, it might be worth the try.
If you only perform read operations it would be better to not assign a value to an single index but to an interval of indices.
For example:
[0, 15000] = 0
[15001, 15002] = 153
[15003, 26876] = 2
[25677, 31578] = 0
...
This can be done with a struct. You also might want to define a class similar to this if you like an OO approach.
class Interval{
private:
uint32_t start; // First element of interval
uint32_t end; // Last element of interval
uint8_t value; // Assigned value
public:
Interval(uint32_t start, uint32_t end, uint8_t value);
bool isInInterval(uint32_t item); // Checks if item lies within interval
uint8_t getValue(); // Returns the assigned value
}
Now you just have to iterate trough a list of intervals and check if your index lies within one of them which can be much less memory intensive in average but costs more CPU resources.
Interval intervals[INTERVAL_COUNT];
intervals[0] = Interval(0, 15000, 0);
intervals[1] = Interval(15001, 15002, 153);
intervals[2] = Interval(15003, 26876, 2);
intervals[3] = Interval(25677, 31578, 0);
...
uint8_t checkIntervals(uint32_t item)
for(int i=0; i<INTERVAL_COUNT-1; i++)
{
if(intervals[i].isInInterval(item) == true)
{
return intervals[i].getValue();
}
}
return DEFAULT_VALUE;
}
If you order the intervals by descending size you increase the probability that the item you are looking for is found early which further decreases your average memory and CPU resource usage.
You could also remove all intervals with a size of 1. Put the corresponding values into a map and check them only if the item you are looking for wasn't found in the intervals. This should also raise the average performance a bit.
Long long time ago, I can just remember...
In university we got a task to accelerate a ray tracer program, that has to read by algorithm over and over again from buffer arrays. A friend told me to always use RAM-reads that are multiples of 4Bytes. So I changed the array from a pattern of [x1,y1,z1,x2,y2,z2,...,xn,yn,zn] to a pattern of [x1,y1,z1,0,x2,y2,z2,0,...,xn,yn,zn,0]. Means I add a empty field after each 3D coordinate. After some performance testing: It was faster.
So long story short: Read multiple of 4 Bytes from your array from RAM, and maybe also from the right starting position, so you read a little cluster where the searched index is in it and read the searched index from this little cluster in cpu. (In your case you will not need to inserting fill-fields, but the concept should be clear)
Maybe also other multiples could be the key in newer systems.
I don't know if this will work in your case, so if it doesn't work: Sorry. If it work I would be happy to hear about some test results.
PS: Oh and if there is any access pattern or nearby accessed indices, you can reuse the cached cluster.
PPS: It could be, that the multiple factor was more like 16Bytes or something like that, it's too long ago, that I can remember exactly.
Looking at this, you could split your data, for example:
a bitset which gets indexed and represents the value 0 (std::vector would be useful here)
a bitset which gets indexed and represents the value 1
a std::vector for the values of 2, containing the indexes which refer to this value
a map for the other values (or std::vector>)
In this case, all values appear till a given index, so you could even remove one of bitsets and represents the value as it being missing in the other ones.
This will save you some memory for this case, though would make the worst case worse.
You'll also need more CPU power to do the lookups.
Make sure to measure!
Like Mats mentions in his comment-answer, it is hard to say what is actually the best solution without knowing specifically what kind of data you have (e.g., are there long runs of 0's, and so on), and what your access pattern looks like (does "random" mean "all over the place" or just "not strictly in completely linear fashion" or "every value exactly once, just randomized" or ...).
That said, there are two mechanisms coming to mind:
Bit arrays; i.e., if you only had two values, you could trivially compress your array by a factor of 8; if you have 4 values (or "3 values + everything else") you can compress by a factor of two. Which might just not be worth the trouble and would need benchmarks, especially if you have really random access patterns which escape your caches and hence do not change the access time at all.
(index,value) or (value,index) tables. I.e., have one very small table for the 1% case, maybe one table for the 5% case (which only needs to store the indexes as all have the same value), and a big compressed bit array for the final two cases. And with "table" I mean something which allows relatively quick lookup; i.e., maybe a hash, a binary tree, and so on, depending on what you have available and your actual needs. If these subtables fit into your 1st/2nd level caches, you might get lucky.
I am not very familiar with C, but in C++ you can use unsigned char to represent an integer in the range 0 - 255.
Compared to normal int (again, I am coming from Java and C++ world) in which 4 byte (32 bit) are required, an unsigned char requires 1 byte (8 bits).
so it might reduce the total size of the array by 75%.
You have succinctly described all the distribution characteristics of your array; toss the array.
You can easily replace the array with a randomized method that produces the same probabilistic output as the array.
If consistency matters (producing the same value for the same random index), consider using a bloom filter and/or hash map to track repeat hits. If your array accesses really are random, though, this is totally unnecessary.

Fast way to generate pseudo-random bits with a given probability of 0 or 1 for each bit

Normally, a random number generator returns a stream of bits for which the probability to observe a 0 or a 1 in each position is equal (i.e. 50%). Let's call this an unbiased PRNG.
I need to generate a string of pseudo-random bits with the following property: the probability to see a 1 in each position is p (i.e. the probability to see a 0 is 1-p). The parameter p is a real number between 0 and 1; in my problem it happens that it has a resolution of 0.5%, i.e. it can take the values 0%, 0.5%, 1%, 1.5%, ..., 99.5%, 100%.
Note that p is a probability and not an exact fraction. The actual number of bits set to 1 in a stream of n bits must follow the binomial distribution B(n, p).
There is a naive method that can use an unbiased PRNG to generate the value of each bit (pseudocode):
generate_biased_stream(n, p):
result = []
for i in 1 to n:
if random_uniform(0, 1) < p:
result.append(1)
else:
result.append(0)
return result
Such an implementation is much slower than one generating an unbiased stream, since it calls the random number generator function once per each bit; while an unbiased stream generator calls it once per word size (e.g. it can generate 32 or 64 random bits with a single call).
I want a faster implementation, even it it sacrifices randomness slightly. An idea that comes to mind is to precompute a lookup table: for each of the 200 possible values of p, compute C 8-bit values using the slower algorithm and save them in a table. Then the fast algorithm would just pick one of these at random to generate 8 skewed bits.
A back of the envelope calculation to see how much memory is needed:
C should be at least 256 (the number of possible 8-bit values), probably more to avoid sampling effects; let's say 1024. Maybe the number should vary depending on p, but let's keep it simple and say the average is 1024.
Since there are 200 values of p => total memory usage is 200 KB. This is not bad, and might fit in the L2 cache (256 KB). I still need to evaluate it to see if there are sampling effects that introduce biases, in which case C will have to be increased.
A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions.
I would like to know if there is a faster method, based on bit operations instead of lookup tables. For example modifying the random number generation code directly to introduce a bias for each bit. This would achieve the same performance as an unbiased PRNG.
Edit March 5
Thank you all for your suggestions, I got a lot of interesting ideas and suggestions. Here are the top ones:
Change the problem requirements so that p has a resolution of 1/256 instead of 1/200. This allows using bits more efficiently, and also gives more opportunities for optimization. I think I can make this change.
Use arithmetic coding to efficiently consume bits from an unbiased generator. With the above change of resolution this becomes much easier.
A few people suggested that PRNGs are very fast, thus using arithmetic coding might actually make the code slower due to the introduced overhead. Instead I should always consume the worst-case number of bits and optimize that code. See the benchmarks below.
#rici suggested using SIMD. This is a nice idea, which works only if we always consume a fixed number of bits.
Benchmarks (without arithmetic decoding)
Note: as many of you have suggested, I changed the resolution from 1/200 to 1/256.
I wrote several implementations of the naive method that simply takes 8 random unbiased bits and generates 1 biased bit:
Without SIMD
With SIMD using the Agner Fog's vectorclass library, as suggested by #rici
With SIMD using intrinsics
I use two unbiased pseudo-random number generators:
xorshift128plus
Ranvec1 (Mersenne Twister-like) from Agner Fog's library.
I also measure the speed of the unbiased PRNG for comparison. Here are the results:
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 16.081 16.125 16.093 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.778 0.783 0.812 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.176 2.184 2.145 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.129 2.151 2.183 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
SIMD increases performance by a factor of 3 compared to the scalar method. It is 8 times slower than the unbiased generator, as expected.
The fastest biased generator achieves 2.1 Gb/s.
RNG: xorshift128plus
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.300 21.486 21.483 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 22.660 22.661 24.662 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.065 1.102 1.078 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 4.972 4.971 4.970 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 4.955 4.971 4.971 [Gb/s]
Number of ones: 104,869,407 104,869,407 104,869,407
Theoretical : 104,857,600
For xorshift, SIMD increases performance by a factor of 5 compared to the scalar method. It is 4 times slower than the unbiased generator. Note that this is a scalar implementation of xorshift.
The fastest biased generator achieves 4.9 Gb/s.
RNG: xorshift128plus_avx2
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 18.754 21.494 21.878 [Gb/s]
Number of ones: 536,867,655 536,867,655 536,867,655
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 54.126 54.071 54.145 [Gb/s]
Number of ones: 536,874,540 536,880,718 536,891,316
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 1.093 1.103 1.063 [Gb/s]
Number of ones: 104,868,930 104,868,930 104,868,930
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 19.567 19.578 19.555 [Gb/s]
Number of ones: 104,836,115 104,846,215 104,835,129
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 19.551 19.589 19.557 [Gb/s]
Number of ones: 104,831,396 104,837,429 104,851,100
Theoretical : 104,857,600
This implementation uses AVX2 to run 4 unbiased xorshift generators in parallel.
The fastest biased generator achieves 19.5 Gb/s.
Benchmarks for arithmetic decoding
Simple tests show that the arithmetic decoding code is the bottleneck, not the PRNG. So I am only benchmarking the most expensive PRNG.
RNG: Ranvec1(Mersenne Twister for Graphics Processors + Multiply with Carry)
Method: Arithmetic decoding (floating point)
Gbps/s: 0.068 0.068 0.069 [Gb/s]
Number of ones: 10,235,580 10,235,580 10,235,580
Theoretical : 10,240,000
Method: Arithmetic decoding (fixed point)
Gbps/s: 0.263 0.263 0.263 [Gb/s]
Number of ones: 10,239,367 10,239,367 10,239,367
Theoretical : 10,240,000
Method: Unbiased with 1/1 efficiency (incorrect, baseline)
Gbps/s: 12.687 12.686 12.684 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Unbiased with 1/1 efficiency, SIMD=vectorclass (incorrect, baseline)
Gbps/s: 14.536 14.536 14.536 [Gb/s]
Number of ones: 536,875,204 536,875,204 536,875,204
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency
Gbps/s: 0.754 0.754 0.754 [Gb/s]
Number of ones: 104,867,269 104,867,269 104,867,269
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=vectorclass
Gbps/s: 2.094 2.095 2.094 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
Method: Biased with 1/8 efficiency, SIMD=intrinsics
Gbps/s: 2.094 2.094 2.095 [Gb/s]
Number of ones: 104,859,067 104,859,067 104,859,067
Theoretical : 104,857,600
The simple fixed point method achieves 0.25 Gb/s, while the naive scalar method is 3x faster, and the naive SIMD method is 8x faster. There might be ways to optimize and/or parallelize the arithmetic decoding method further, but due to its complexity I have decided to stop here and choose the naive SIMD implementation.
Thank you all for the help.
One thing you can do is to sample from the underlying unbiased generator multiple times, getting several 32-bit or 64-bit words, and then performing bitwise boolean arithmetic. As an example, for 4 words b1,b2,b3,b4, you can get the following distributions:
expression | p(bit is 1)
-----------------------+-------------
b1 & b2 & b3 & b4 | 6.25%
b1 & b2 & b3 | 12.50%
b1 & b2 & (b3 | b4) | 18.75%
b1 & b2 | 25.00%
b1 & (b2 | (b3 & b4)) | 31.25%
b1 & (b2 | b3) | 37.50%
b1 & (b2 | b3 | b4)) | 43.75%
b1 | 50.00%
Similar constructions can be made for finer resolutions. It gets a bit tedious and still requires more generator calls, but at least not one per bit. This is similar to a3f's answer, but is probably easier to implement and, I suspect, faster than scanning words for 0xF nybbles.
Note that for your desired 0.5% resolution, you would need 8 unbiased words for one biased word, which would give you a resolution of (0.5^8) = 0.390625%.
If you're prepared to approximate p based on 256 possible values, and you have a PRNG which can generate uniform values in which the individual bits are independent of each other, then you can use vectorized comparison to produce multiple biased bits from a single random number.
That's only worth doing if (1) you worry about random number quality and (2) you are likely to need a large number of bits with the same bias. The second requirement seems to be implied by the original question, which criticizes a proposed solution, as follows: "A deficiency of this solution is that it can generate only 8 bits at once, even that with a lot of work, while an unbiased PRNG can generate 64 at once with just a few arithmetic instructions." Here, the implication seems to be that it is useful to generate a large block of biased bits in a single call.
Random-number quality is a difficult subject. It's hard if not impossible to measure, and therefore different people will propose different metrics which emphasize and/or devalue different aspects of "randomness". It is generally possible to trade off speed of random-number generation for lower "quality"; whether this is worth doing depends on your precise application.
The simplest possible tests of random number quality involve the distribution of individual values and the cycle length of the generator. Standard implementations of the C library rand and Posix random functions will typically pass the distribution test, but the cycle lengths are not adequate for long-running applications.
These generators are typically extremely fast, though: the glibc implementation of random requires only a few cycles, while the classic linear congruential generator (LCG) requires a multiply and an addition. (Or, in the case of the glibc implementation, three of the above to generate 31 bits.) If that's sufficient for your quality requirements, then there is little point trying to optimize, particularly if the bias probability changes frequently.
Bear in mind that the cycle length should be a lot longer than the number of samples expected; ideally, it should be greater than the square of that number, so a linear-congruential generator (LCG) with a cycle length of 231 is not appropriate if you expect to generate gigabytes of random data. Even the Gnu trinomial nonlinear additive-feedback generator, whose cycle length is claimed to be approximately 235, shouldn't be used in applications which will require millions of samples.
Another quality issue, which is much harder to test, relates to the independence on consecutive samples. Short cycle lengths completely fail on this metric, because once the repeat starts, the generated random numbers are precisely correlated with historical values. The Gnu trinomial algorithm, although its cycle is longer, has a clear correlation as a result of the fact that the ith random number generated, ri, is always one of the two values ri−3&plus;ri−31 or ri−3&plus;ri−31&plus;1. This can have surprising or at least puzzling consequences, particularly with Bernoulli experiments.
Here's an implementation using Agner Fog's useful vector class library, which abstracts away a lot of the annoying details in SSE intrinsics, and also helpfully comes with a fast vectorized random number generator (found in special.zip inside the vectorclass.zip archive), which lets us generate 256 bits from eight calls to the 256-bit PRNG. You can read Dr. Fog's explanation of why he finds even the Mersenne twister to have quality issues, and his proposed solution; I'm not qualified to comment, really, but it does at least appear to give expected results in the Bernoulli experiments I have tried with it.
#include "vectorclass/vectorclass.h"
#include "vectorclass/ranvec1.h"
class BiasedBits {
public:
// Default constructor, seeded with fixed values
BiasedBits() : BiasedBits(1) {}
// Seed with a single seed; other possibilities exist.
BiasedBits(int seed) : rng(3) { rng.init(seed); }
// Generate 256 random bits, each with probability `p/256` of being 1.
Vec8ui random256(unsigned p) {
if (p >= 256) return Vec8ui{ 0xFFFFFFFF };
Vec32c output{ 0 };
Vec32c threshold{ 127 - p };
for (int i = 0; i < 8; ++i) {
output += output;
output -= Vec32c(Vec32c(rng.uniform256()) > threshold);
}
return Vec8ui(output);
}
private:
Ranvec1 rng;
};
In my test, that produced and counted 268435456 bits in 260 ms, or one bit per nanosecond. The test machine is an i5, so it doesn't have AVX2; YMMV.
In the actual use case, with 201 possible values for p, the computation of 8-bit threshold values will be annoyingly imprecise. If that imprecision is undesired, you could adapt the above to use 16-bit thresholds, at the cost of generating twice as many random numbers.
Alternatively, you could hand-roll a vectorization based on 10-bit thresholds, which would give you a very good approximation to 0.5% increments, using the standard bit-manipulation hack of doing the vectorized threshold comparison by checking for borrow on every 10th bit of the subtraction of the vector of values and the repeated threshold. Combined with, say, std::mt19937_64, that would give you an average of six bits each 64-bit random number.
From an information-theoretic point of view, a biased stream of bits (with p != 0.5) has less information in it than an unbiased stream, so in theory it should take (on average) less than 1 bit of the unbiased input to produce a single bit of the biased output stream. For example, the entropy of a Bernoulli random variable with p = 0.1 is -0.1 * log2(0.1) - 0.9 * log2(0.9) bits, which is around 0.469 bits. That suggests that for the case p = 0.1 we should be able to produce a little over two bits of the output stream per unbiased input bit.
Below, I give two methods for producing the biased bits. Both achieve close to optimal efficiency, in the sense of requiring as few input unbiased bits as possible.
Method 1: arithmetic (de)coding
A practical method is to decode your unbiased input stream using arithmetic (de)coding, as already described in the answer from alexis. For this simple a case, it's not hard to code something up. Here's some unoptimised pseudocode (cough, Python) that does this:
import random
def random_bits():
"""
Infinite generator generating a stream of random bits,
with 0 and 1 having equal probability.
"""
global bit_count # keep track of how many bits were produced
while True:
bit_count += 1
yield random.choice([0, 1])
def bernoulli(p):
"""
Infinite generator generating 1-bits with probability p
and 0-bits with probability 1 - p.
"""
bits = random_bits()
low, high = 0.0, 1.0
while True:
if high <= p:
# Generate 1, rescale to map [0, p) to [0, 1)
yield 1
low, high = low / p, high / p
elif low >= p:
# Generate 0, rescale to map [p, 1) to [0, 1)
yield 0
low, high = (low - p) / (1 - p), (high - p) / (1 - p)
else:
# Use the next random bit to halve the current interval.
mid = 0.5 * (low + high)
if next(bits):
low = mid
else:
high = mid
Here's an example usage:
import itertools
bit_count = 0
# Generate a million deviates.
results = list(itertools.islice(bernoulli(0.1), 10**6))
print("First 50:", ''.join(map(str, results[:50])))
print("Biased bits generated:", len(results))
print("Unbiased bits used:", bit_count)
print("mean:", sum(results) / len(results))
The above gives the following sample output:
First 50: 00000000000001000000000110010000001000000100010000
Biased bits generated: 1000000
Unbiased bits used: 469036
mean: 0.100012
As promised, we've generated 1 million bits of our output biased stream using fewer than five hundred thousand from the source unbiased stream.
For optimisation purposes, when translating this into C / C++ it may make sense to code this up using integer-based fixed-point arithmetic rather than floating-point.
Method 2: integer-based algorithm
Rather than trying to convert the arithmetic decoding method to use integers directly, here's a simpler approach. It's not quite arithmetic decoding any more, but it's not totally unrelated, and it achieves close to the same output-biased-bit / input-unbiased-bit ratio as the floating-point version above. It's organised so that all quantities fit into an unsigned 32-bit integer, so should be easy to translate to C / C++. The code is specialised to the case where p is an exact multiple of 1/200, but this approach would work for any p that can be expressed as a rational number with reasonably small denominator.
def bernoulli_int(p):
"""
Infinite generator generating 1-bits with probability p
and 0-bits with probability 1 - p.
p should be an integer multiple of 1/200.
"""
bits = random_bits()
# Assuming that p has a resolution of 0.05, find p / 0.05.
p_int = int(round(200*p))
value, high = 0, 1
while True:
if high < 2**31:
high = 2 * high
value = 2 * value + next(bits)
else:
# Throw out everything beyond the last multiple of 200, to
# avoid introducing a bias.
discard = high - high % 200
split = high // 200 * p_int
if value >= discard: # rarer than 1 time in 10 million
value -= discard
high -= discard
elif value >= split:
yield 0
value -= split
high = discard - split
else:
yield 1
high = split
The key observation is that every time we reach the beginning of the while loop, value is uniformly distributed amongst all integers in [0, high), and is independent of all previously-output bits. If you care about speed more than perfect correctness, you can get rid of discard and the value >= discard branch: that's just there to ensure that we output 0 and 1 with exactly the right probabilities. Leave that complication out, and you'll just get almost the right probabilities instead. Also, if you make the resolution for p equal to 1/256 rather than 1/200, then the potentially time-consuming division and modulo operations can be replaced with bit operations.
With the same test code as before, but using bernoulli_int in place of bernoulli, I get the following results for p=0.1:
First 50: 00000010000000000100000000000000000000000110000100
Biased bits generated: 1000000
Unbiased bits used: 467997
mean: 0.099675
Let's say the probability of a 1 appearing is 6,25% (1/16). There are 16 possible bit patterns for a 4 bit-number:
0000,0001, ..., 1110,1111.
Now, just generate a random number like you used to and replace every 1111 at a nibble-boundary with a 1 and turn everything else to a 0.
Adjust accordingly for other probabilities.
You'll get theoretically optimal behavior, i.e. make truly minimal use of the random number generator and be able to model any probability p exactly, if you approach this using arithmetic coding.
Arithmetic coding is a form of data compression that represents the message as a sub-interval of a number range. It provides theoretically optimal encoding, and can use a fractional number of bits for each input symbol.
The idea is this: Imagine that you have a sequence of random bits, which are 1 with probability p. For convenience, I will instead use q for the probability of the bit being zero. (q = 1-p). Arithmetic coding assigns to each bit part of the number range. For the first bit, assign the interval [0, q) if the input is 0, and the interval [q, 1) if the input is 1. Subsequent bits assign proportional sub-intervals of the current range. For example, suppose that q = 1/3 The input 1 0 0 will be encoded like this:
Initially [0, 1), range = 1
After 1 [0.333, 1), range = 0.6666
After 0 [0.333, 0.5555), range = 0.2222
After 0 [0.333, 0.407407), range = 0.074074
The first digit, 1, selects the top two-thirds (1-q) of the range; the second digit, 0, selects the bottom third of that, and so on.
After the first and second step, the interval stradles the midpoint; but after the third step it is entirely below the midpoint, so the first compressed digit can be output: 0. The process continues, and a special EOF symbol is added as a terminator.
What does this have to do with your problem? The compressed output will have random zeros and ones with equal probability. So, to obtain bits with probability p, just pretend that the output of your RNG is the result of arithmetic coding as above, and apply the decoder process to it. That is, read bits as if they subdivide the line interval into smaller and smaller pieces. For example, after we read 01 from the RNG, we will be in the range [0.25, 0.5). Keep reading bits until enough output is "decoded". Since you're mimicking decompressing, you'll get more random bits out than you put in. Because arithmetic coding is theoretically optimal, there's no possible way to turn the RNG output into more biased bits without sacrificing randomness: you're getting the true maximum.
The catch is that you can't do this in a couple of lines of code, and I don't know of a library I can point you to (though there must be some you could use). Still, it's pretty simple. The above article provides code for a general-purpose encoder and decoder, in C. It's pretty straightforward, and it supports multiple input symbols with arbitrary probabilities; in your case a far simpler implementation is possible (as Mark Dickinson's answer now shows), since the probability model is trivial. For extended use, a bit more work would be needed to produce a robust implementation that does not do a lot of floating-point computation for each bit.
Wikipedia also has an interesting discussion of arithmetic encoding considered as change of radix, which is another way to view your task.
Uh, pseudo-random number generators are generally quite fast. I'm not sure what language this is (Python, perhaps), but "result.append" (which almost certainly contains memory allocation) is likely slower than "random_uniform" (which just does a little math).
If you want to optimize the performance of this code:
Verify that it is a problem. Optimizations are a bit of work and make the code harder to maintain. Don't do them unless necessary.
Profile it. Run some tests to determine which parts of the code are actually the slowest. Those are the parts you need to speed up.
Make your changes, and verify that they actually are faster. Compilers are pretty smart; often clear code will compile into better code that something complex than might appear faster.
If you are working in a compiled language (even JIT compiled), you take a performance hit for every transfer of control (if, while, function call, etc). Eliminate what you can. Memory allocation is also (usually) quite expensive.
If you are working in an interpreted language, all bets are off. The simplest code is very likely the best. The overhead of the interpreter will dwarf whatever you are doing, so reduce its work as much as possible.
I can only guess where your performance problems are:
Memory allocation. Pre-allocate the array at its full size and fill in the entries later. This ensures that the memory won't need to be reallocated while you're adding the entries.
Branches. You might be able to avoid the "if" by casting the result or something similar. This will depend a lot on the compiler. Check the assembly (or profile) to verify that it does what you want.
Numeric types. Find out the type your random number generator uses natively, and do your arithmetic in that type. For example, if the generator naturally returns 32-bit unsigned integers, scale "p" to that range first, then use it for the comparison.
By the way, if you really want to use the least bits of randomness possible, use "arithmetic coding" to decode your random stream. It won't be fast.
One way that would give a precise result is to first randomly generate for a k-bit block the number of 1 bits following the binomial distribution, and then generate a k-bit word with exactly that many bits using one of the methods here. For example the method by mic006 needs only about log k k-bit random numbers, and mine needs only one.
If p is close to 0, you can calculate the probability that the n-th bit is the first bit that is 1; then you calculate a random number between 0 and 1 and pick n accordingly. For example if p = 0.005 (0.5%), and the random number is 0.638128, you might calculate (I'm guessing here) n = 321, so you fill with 321 0 bits and one bit set.
If p is close to 1, use 1-p instead of p, and set 1 bits plus one 0 bit.
If p isn't close to 1 or 0, make a table of all 256 sequences of 8 bits, calculate their cumulative probabilities, then get a random number, do a binary search in the array of cumulative probabilities, and you can set 8 bits.
Assuming that you have access to a generator of random bits, you can generate a value to compare with p bit by bit, and abort as soon as you can prove that the generated value is less-than or greater-or-equal-to p.
Proceed as follows to create one item in a stream with given probability p:
Start with 0. in binary
Append a random bit; assuming that a 1 has been drawn, you'll get 0.1
If the result (in binary notation) is provably smaller than p, output a 1
If the result is provably larger or equal to p, output a 0
Otherwise (if neither can be ruled out), proceed with step 2.
Let's assume that p in binary notation is 0.1001101...; if this process generates any of 0.0, 0.1000, 0.10010, ..., the value cannot become larger or equal than p anymore; if any of 0.11, 0.101, 0.100111, ... is generated, the value cannot become smaller than p.
To me, it looks like this method uses about two random bits in expectation. Arithmetic coding (as shown in the answer by Mark Dickinson) consumes at most one random bit per biased bit (on average) for fixed p; the cost of modifying p is unclear.
What it does
This implementation makes single call to random device kernel module via interface of "/dev/urandom" special character file to get number of random data needed to represent all values in given resolution. Maximum possible resolution is 1/256^2 so that 0.005 can be represented by:
328/256^2,
i.e:
resolution: 256*256
x: 328
with error 0.000004883.
How it does that
The implementation calculates the number of bits bits_per_byte which is number of uniformly distributed bits needed to handle given resolution, i.e. represent all #resolution values. It makes then a single call to randomization device ("/dev/urandom" if URANDOM_DEVICE is defined, otherwise it will use additional noise from device drivers via call to "/dev/random" which may block if there is not enough entropy in bits) to get required number of uniformly distributed bytes and fills in array rnd_bytes of random bytes. Finally it reads number of needed bits per each Bernoulli sample from each bytes_per_byte bytes of rnd_bytes array and compares the integer value of these bits to probability of success in single Bernoulli outcome given by x/resolution. If value hits, i.e. it falls in segment of x/resolution length which we arbitrarily choose to be [0, x/resolution) segment then we note success and insert 1 into resulting array.
Read from random device:
/* if defined use /dev/urandom (will not block),
* if not defined use /dev/random (may block)*/
#define URANDOM_DEVICE 1
/*
* #brief Read #outlen bytes from random device
* to array #out.
*/
int
get_random_samples(char *out, size_t outlen)
{
ssize_t res;
#ifdef URANDOM_DEVICE
int fd = open("/dev/urandom", O_RDONLY);
if (fd == -1) return -1;
res = read(fd, out, outlen);
if (res < 0) {
close(fd);
return -2;
}
#else
size_t read_n;
int fd = open("/dev/random", O_RDONLY);
if (fd == -1) return -1;
read_n = 0;
while (read_n < outlen) {
res = read(fd, out + read_n, outlen - read_n);
if (res < 0) {
close(fd);
return -3;
}
read_n += res;
}
#endif /* URANDOM_DEVICE */
close(fd);
return 0;
}
Fill in vector of Bernoulli samples:
/*
* #brief Draw vector of Bernoulli samples.
* #details #x and #resolution determines probability
* of success in Bernoulli distribution
* and accuracy of results: p = x/resolution.
* #param resolution: number of segments per sample of output array
* as power of 2: max resolution supported is 2^24=16777216
* #param x: determines used probability, x = [0, resolution - 1]
* #param n: number of samples in result vector
*/
int
get_bernoulli_samples(char *out, uint32_t n, uint32_t resolution, uint32_t x)
{
int res;
size_t i, j;
uint32_t bytes_per_byte, word;
unsigned char *rnd_bytes;
uint32_t uniform_byte;
uint8_t bits_per_byte;
if (out == NULL || n == 0 || resolution == 0 || x > (resolution - 1))
return -1;
bits_per_byte = log_int(resolution);
bytes_per_byte = bits_per_byte / BITS_PER_BYTE +
(bits_per_byte % BITS_PER_BYTE ? 1 : 0);
rnd_bytes = malloc(n * bytes_per_byte);
if (rnd_bytes == NULL)
return -2;
res = get_random_samples(rnd_bytes, n * bytes_per_byte);
if (res < 0)
{
free(rnd_bytes);
return -3;
}
i = 0;
while (i < n)
{
/* get Bernoulli sample */
/* read byte */
j = 0;
word = 0;
while (j < bytes_per_byte)
{
word |= (rnd_bytes[i * bytes_per_byte + j] << (BITS_PER_BYTE * j));
++j;
}
uniform_byte = word & ((1u << bits_per_byte) - 1);
/* decision */
if (uniform_byte < x)
out[i] = 1;
else
out[i] = 0;
++i;
}
free(rnd_bytes);
return 0;
}
Usage:
int
main(void)
{
int res;
char c[256];
res = get_bernoulli_samples(c, sizeof(c), 256*256, 328); /* 328/(256^2) = 0.0050 */
if (res < 0) return -1;
return 0;
}
Complete code, results.
Although this question is 5 years old, I believe I have something of value to add. While SIMD and arithmetic decoding are undoubtedly great techniques, its hard to ignore that the bitwise boolean arithmetic suggested by #mindriot is very simple and easy to grasp.
However, it's not immediately apparent how you would go about efficiently and quickly implementing this solution. For 256 bits (0.00390625) of resolution, you could write a switch statement with 256 cases and then manually determine the required boolean expression by hand for each case. It would take a while to program this but it will compile down to a very fast jump table in C/C++.
But, what if you want 2^16 bits of resolution, or even 2^64? The latter is a resolution of 5.4210109E-20, more precise than most of us would ever need. The task is absolutely impossible by hand, but we can actually construct a small virtual machine to do this quickly in just 30 lines of C code.
Let's construct the machine for 256 bits of resolution. I'll define probability = resolution/256. e.g., when resolution = 64, then probability = 0.25. As it turns out, the numerator (resolution) actually implicitly encodes the required boolean operations in its binary representation.
For example, what expression generates probability = 0.69140625 = 177/256? The resolution is 177, which in binary is 10110001. Let AND = 0 and OR = 1. We start after the first nonzero least significant bit and read toward the most significant bit. Map the 0/1 to AND/OR. Thus, starting from b1 and reading right to left, we generate the boolean expression (((((((b1 and b2) and b3) and b4) or b5) or b6) and b7) or b8). A computer-generated truth table will confirm 177 cases yield True. To give another example, probability = 0.4375 = 112/256 gives the resolution in binary as 01110000. Reading the 3 bits in order after the first non zero LSB (011) gives ((b1 | b2) | b3) & b4).
Since all we need are the two AND and OR operations, and since the resolution encodes the exact boolean expression we need, a virtual machine can be programmed which interprets the resolution as bitcode. AND and OR are just opcodes that act immediately on the output of an unbiased random number generator. Here is my sample C code:
uint64_t rng_bias (uint64_t *state, const uint8_t resolution)
{
if (state == NULL) return 0;
//registers
uint64_t R0 = 0;
uint8_t PC = __builtin_ctz(resolution|0x80);
//opcodes
enum
{
OP_ANDI = 0,
OP_ORI = 1,
};
//execute instructions in sequence from LSB -> MSB
while (PC != (uint8_t) 0x8)
{
switch((resolution >> PC++) & (uint8_t) 0x1)
{
case OP_ANDI:
R0 &= rng_generator(state);
break;
case OP_ORI:
R0 |= rng_generator(state);
break;
}
}
return R0;
}
The virtual machine is nothing more than 2 registers and 2 opcodes. I am using GCC's builtin function ctz which counts the trailing zero bits so that I can easily find the first nonzero LSB. I bitwise-or the ctz argument with 0x80 because passing zero is undefined. Any other decent compiler should have a similar function. Notice, that unlike the examples I showed by hand, the VM interprets the bitcode starting on the first nonzero LSB, not after. This is because I need to make at least one call to the PRNG to generate the base p=0.5 and p=0.0 cases.
The state pointer and rng_generator() calls are used to interface with your random number generator. For example, for demonstration purposes I can use Marsaglia's Xorshift64:
uint64_t rng_generator(uint64_t *state)
{
uint64_t x = *state;
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
return *state = x;
}
All the user/you need to do is manage a separate uint64_t state variable, which must be appropriately seeded prior to using either function.
It is extremely easy to scale to 2^64 bits or whatever other arbitrary resolution desired. use ctzll instead for unsigned long long arguments, change the uint8_t types to uint64_t, and change the while loop check to 64 instead of 8. That's it! Now with at most 64 calls to the PRNG, which is fairly fast, we have access to 5.4210109E-20 resolution.
The key here is that we get the bitcode practically for free. No lexing, parsing, or any other typical VM interpreter tasks. The user provides it via the resolution, without ever realizing. As far as they're concerned, its just the numerator of the probability. As far as we, the implementers are concerned, its nothing more than a string of bitcode for our VM to interpret.
Explaining why the bitcode works requires a whole different and much longer essay. In probability theory, the problem is to determine the generating event (the set of all sample points) of a given probability. Not unlike the usual inverse CDF problem for generating random numbers from a density function. In a computer science viewpoint, in the 256 bit resolution case, we are traversing a depth-8 binary tree where each node represents a probability. The parent node is p=0.5. Left traversal indicates AND operations, right traversal indicates OR. The traversal and node depth maps directly to the LSB->MSB bit encoding that we discussed several paragraphs before.

Calculate and Store Power of very large Number

I am finding pow(2,i) where i can range: 0<=i<=100000.
Apart i have MOD=1000000007
powers[100000];
powers[0]=1;
for (i = 1; i <=100000; ++i)
{
powers[i]=(powers[i-1]*2)%MOD;
}
for i=100000 won't power value become greater than MOD ?
How do I store the power correctly?
The operation doesn't look feasible to me.
I am getting correct value up to i=70 max I guess.
I have to find sum+= ar[i]*power(2,i) and finally print sum%1000000007 where ar[i] is an additional array with some big numbers up to 10^5
As long as your modulus value is less than half the capacity of your data type, it will never be exceeded. That's because you take the previous value in the range 0..1000000006, double it, then re-modulo it bringing it back to that same range.
However, I can't guarantee that higher values won't cause you troubles, it's more mathematical analysis than I'm prepared to invest given the simple alternative. You could spend a lot of time analysing, checking and debugging, but it's probably better just to not allow the problem to occur in the first place.
The alternative? I'd tend to use the pre-generation method (having a program do the gruntwork up front, inserting the pre-generated values into an array easily and speedily accessible from your real program).
With this method, you can use tools that are well tested and known to work with massive values. Since this data is not going to change, it's useless calculating it every time your program starts.
If you want an easy (and efficient) way to do this, the following bash script in conjunction with bc and awk can do this:
#!/usr/bin/bash
bc >nums.txt <<EOF
i = 1;
for (x = 0;x <= 10000; x++) {
i % 1000000007;
i = i * 2;
}
EOF
awk 'BEGIN { printf "static int array[] = {" }
{ if (NR % 5 == 1) printf "\n ";
printf "%s, ",$0;
next
}
END { print "\n};" }' nums.txt
The bc part is the "meat" of the matter, it creates the large powers of two and outputs them modulo the number you provided. The awk part is simply to format them in C-style array elements, five per line.
Just take the output of that and put it into your code and, voila, there you have it, a compile-time-expensed array that you can use for fast lookup.
It takes only a second and a half on my box to generate the array and then you never need to do it again. You also won't have to concern yourself with the vagaries of modulo math :-)
static int array[] = {
1,2,4,8,16,
32,64,128,256,512,
1024,2048,4096,8192,16384,
32768,65536,131072,262144,524288,
1048576,2097152,4194304,8388608,16777216,
33554432,67108864,134217728,268435456,536870912,
73741817,147483634,294967268,589934536,179869065,
359738130,719476260,438952513,877905026,755810045,
511620083,23240159,46480318,92960636,185921272,
371842544,743685088,487370169,974740338,949480669,
898961331,797922655,595845303,191690599,383381198,
766762396,533524785,67049563,134099126,268198252,
536396504,72793001,145586002,291172004,582344008,
164688009,329376018,658752036,317504065,635008130,
270016253,540032506,80065005,160130010,320260020,
640520040,281040073,562080146,124160285,248320570,
:
861508356,723016705,446033403,892066806,784133605,
568267203,136534399,273068798,546137596,92275185,
184550370,369100740,738201480,476402953,952805906,
905611805,
};
If you notice that your modulo can be stored in int. MOD=1000000007(decimal) is equivalent of 0b00111011100110101100101000000111 and can be stored in 32 bits.
- i pow(2,i) bit representation
- 0 1 0b00000000000000000000000000000001
- 1 2 0b00000000000000000000000000000010
- 2 4 0b00000000000000000000000000000100
- 3 8 0b00000000000000000000000000001000
- ...
- 29 536870912 0b00100000000000000000000000000000
Tricky part starts when pow(2,i) is grater than your MOD=1000000007, but if you know that current pow(2,i) will be greater than your MOD, you can actually see how bits look like after MOD
- i pow(2,i) pow(2,i)%MOD bit representation
- 30 1073741824 73741817 0b000100011001010011000000000000
- 31 2147483648 147483634 0b001000110010100110000000000000
- 32 4294967296 294967268 0b010001100101001100000000000000
- 33 8589934592 589934536 0b100011001010011000000000000000
So if you have pow(2,i-1)%MOD you can do *2 actually on pow(2,i-1)%MOD till you're next pow(2,i) will be greater than MOD.
In example for i=34 you will use (589934536*2) MOD 1000000007 instead of (8589934592*2) MOD 1000000007, because 8589934592 can't be stored in int.
Additional you can try bit operations instead of multiplication for pow(2,i).
Bit operation same as multiplication for 2 is bit shift left.

Data structure for fast range searches of dense dataset 4D vectors

I have millions of unstructured 3D vectors associated with arbitrary values - making for a set 4D of vectors. To make it simpler to understand: I have unixtime stamps associated with hundreds of thousands of 3D vectors. And I have many time stamps, making for a very large dataset; upwards of 30 millions vectors.
I have the need to search particular datasets of specific time stamps.
So lets say I have the following data:
For time stamp 1407633943:
(0, 24, 58, 1407633943)
(9, 2, 59, 1407633943)
...
For time stamp 1407729456:
(40, 1, 33, 1407729456)
(3, 5, 7, 1407729456)
...
etc etc
And I wish to make a very fast query along the lines of:
Query Example 1:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99
Give me list of those vectors, so I can find the timestamps.
Query Example 2:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99 && W (timestamp) = 1407729456
So far I've used SQLite for the task, but even after column indexing, the thing takes between 500ms - 7s per query. I'm looking for somewhere between 50ms-200ms per query solution.
What sort of structures or techniques can I use to speed the query up?
Thank you.
kd-trees can be helpful here. Range search in a kd-tree is a well-known problem. Time complexity of one query depends on the output size, of course(in the worst case all tree will be traversed if all vectors fit). But it can work pretty fast on average.
I would use octree. In each node I would store arrays of vectors in a hashtable using the timestamp as a key.
To further increase the performance you can use CUDA, OpenCL, OpenACC, OpenMP and implement the algorithms to be executed in parallel on the GPU or a multi-core CPU.
BKaun: please accept my attempt at giving you some insight into the problem at hand. I suppose you have thought of every one of my points, but maybe seeing them here will help.
Regardless of how ingest data is presented, consider that, using the C programming language, you can reduce the storage size of the data to minimize space and search time. You will be searching for, loading, and parsing single bits of a vector instead of, say, a SHORT INT which is 2 bytes for every entry - or a FLOAT which is much more. The object, as I understand it, is to search the given data for given values of X, Y, and Z and then find the timestamp associated with these 3 while optimizing the search. My solution does not go into the search, but merely the data that is used in a search.
To illustrate my hints simply, I'm considering that the data consists of 4 vectors:
X between -2 and 7,
Y between 0.17 and 3.08,
Z between 0 and 50,
timestamp (many of same size - 10 digits)
To optimize, consider how many various numbers each vector can have in it:
1. X can be only 10 numbers (include 0)
2. Y can be 3.08 minus 0.17 = 2.91 x 100 = 291 numbers
3. Z can be 51 numbers
4. timestamp can be many (but in this scenario,
you are not searching for a certain one)
Consider how each variable is stored as a binary:
1. Each entry in Vector X COULD be stored in 4 bits, using the first bit=1 for
the negative sign:
7="0111"
6="0110"
5="0101"
4="0100"
3="0011"
2="0010"
1="0001"
0="0000"
-1="1001"
-2="1010"
However, the original data that you are searching through may range
from -10 to 20!
Therefore, adding another 2 bits gives you a table like this:
-10="101010"
-9="101001" ...
...
-2="100010"
-1="100001" ...
...
8="001000"
9="001001" ...
...
19="001001"
20="010100"
And that's only 6 bits to store each X vector entry for integers from -10 to 20
For search purposes on a range of -10 to 20, there are 21 different X Vector entries
possible to search through.
Each entry in Vector Y COULD be stored in 9 bits (no extra sign bit is needed)
The 1's and 0's COULD be stored (accessed, really) in 2 parts
(tens place, and a 2 digit decimal).
Part 1 can be 0, 1, 2, or 3 (4 2-place bits from "00" to "11")
However, if the range of the entire Y dataset is 0 to 10,
part 1 can be 0, 1, ...9, 10 (which is 11 4-place bits
from "0000" to "1010"
Part 2 can be 00, 01,...98, 99 (100 7-place bits from "0000000" to "1100100"
Total storage bits for Vector Y entries is 11 + 7 = 18 bits in the
range 00.00 to 10.99
For search purposes on a range 00.00 to 10.99, there are 1089 different Y Vector
entries possible to search through (11x99) (?)
Each entry in Vector Z in the range of 0 to 50 COULD be stored in 6 bits
("000000" to "110010").
Again, the actual data range may be 7 bits long (for simplicity's sake)
0 to 64 ("0000000" to "1000000")
For search purposes on a range of 0 to 64, there are 65 different Z Vector entries
possible to search through.
Consider that you will be storing the data in this optimized format, in a single
succession of bits:
X=4 bits + 2 range bits = 6 bits
+ Y=4 bits part 1 and 7 bits part 2 = 11 bits
+ Z=7 bits
+ timestamp (10 numbers - each from 0 to 9 ("0000" to "1001") 4 bits each = 40 bits)
= TOTAL BITS: 6 + 11 + 7 + 40 = 64 stored bits for each 4D vector
THE SEARCH:
Input xx, yy, zz to search for in arrays X, Y and Z (which are stored in binary)
Change xx, yy, and zz to binary bit strings per optimized format above.
function(xx, yy, zz)
Search for X first, since it has 21 possible outcomes (range is -10 to 10)
- the lowest number of any array
First search for positive targets (there are 8 of them and better chance
of finding one)
These all start with "000"
7="000111"
6="000110"
5="000101"
4="000100"
3="000011"
2="000010"
1="000001"
0="000000"
So you can check if the first 3 bits = "000". If so, you have a number
between 0 and 7.
Found: search for Z
Else search for xx=-2 or -1: does X = -2="100010" or -1="100001" ?
(do second because there are only 2 of them)
Found: Search for Z
NotFound: next X
Search for Z after X is Found: (Z second, since it has 65 possible outcomes
- range is 0 to 64)
You are searching for 6 bits of a 7 bit binary number
("0000000" to "1000000") If bits 1,2,3,4,5,6 are all "0", analyze bit 0.
If it is "1" (it's 64), next Z
Else begin searching 6 bits ("000000" to "110010") with LSB first
Found: Search for Y
NotFound: Next X
Search for Y (Y last, since it has 1089 possible outcomes - range is 0.00 to 10.99)
Search for Part 1 (decimal place) bits (you are searching for
"0000", "0001" or "0011" only, so use yyPt1=YPt1)
Found: Search for Part 2 ("0000000" to "1100100") using yyPt2=YPt2
(direct comparison)
Found: Print out X, Y, Z, and timestamp
NotFound: Search criteria for X, Y, and Z not found in data.
Print X,Y,Z,"timestamp not found". Ask for new X, Y, Z. New search.