Mapping a continuous range into discrete bins in C++ - c++

I've inherited maintenance of a function that takes as parameter a value between 0 and 65535 (inclusive):
MyClass::mappingFunction(unsigned short headingIndex);
headingIndex can be converted to degrees using the following formula: degrees = headingIndex * 360 / 65536
The role of this function is to translate the headingIndex into 1 of 36 symbols representing various degrees of rotation, i.e. there is a symbol for 10 degrees, a symbol for 20 degrees etc, up to 360 degrees in units of 10 degrees.
A headingIndex of 0 would translate to displaying the 0 (360) degree symbol.
The function performs the following which I can't seem to get my head around:
const int MAX_INTEGER = 65536;
const int NUM_SYMBOLS = 36;
int symbolRange = NUM_SYMBOLS - 1;
int roundAmount = MAX_INTEGER / (symbolRange + 1) - 1;
int roundedIndex = headingIndex + roundAmount;
int symbol = (symbolRange * roundedIndex) / MAX_INTEGER;
I'm confused about the algorithm that is being used here, specifically with regard to the following:
The intention behind roundAmount? I understand it is essentially dividing the maximum input range into discrete chunks but to then add it on to the headingIndex seems a strange thing to do.
roundedIndex is then the original value now offset or rotated by some offset in a clockwise direction?
The algorithm produces results such as:
headingIndex of 0 --> symbol 0
headingIndex of 100 --> symbol 1
headingIndex of 65500 --> symbol 35
I'm thinking there must be a better way of doing this?

The shown code looks very convoluted (it is possibly a guard against integer overflow). A far simpler way to determine the symbol number would be code like the following:
symbol = (headingIndex * 36u) / 65536u;
However, if this does present problems with integer overflow, then the calculation could be done in double precision, converting the result back to int after rounding:
symbol = static_cast<int>( ((headindIndex * 36.0) / 65536.0) + 0.5 ); // Add 0.5 for rounding.

You have 65536 possible inputs (0..65535) and 36 outputs (0..35). That means each output bin should represent about 1820 inputs if they are divided equally.
The above formula doesn't do that. Only the first 54 values are in bin 0, then they are equally divided across the remaining 35 bins (MAX_INTEGER/symbolRange). About 1872 per bin.
To show this, solve for the lowest value of heading where symbol is 1. 1 * 65536 = 35 * (headingIndex + 1819) so headingIndex == 53.

If you want to keep the output the same but tidy it up. Walk away.
There are odd features of that method that may or may not be what is desired.
The range for headingIndex of 0 - 53 gives a symbol of 0. That's a bucket (AKA bin) of 54 values.
The range of 63717 - 65535 give 35. That bucket is 1819 values.
All the other buckets are either 1872 or 1873 values so seem 'big'.
We can't have equal sized buckets because number of values is 65536 and 65536/36 is 1820 and 16 remainder.
So we need to bury the 16 among the buckets. They have to be uneven in size.
Notice the constant MAX_INTEGER is a red herring. The max is 65535. 65536 is the range. The chosen name is misleading from the start.
Why:
int symbolRange = NUM_SYMBOLS - 1;
int roundAmount = MAX_INTEGER / (symbolRange + 1) - 1;
when the second line could be int roundAmount = MAX_INTEGER / MAX_SYMBOLS - 1;
It doesn't look quite thought through is all I'm saying. But looks can be deceptive.
What also bothers me is the 'obvious' method proposed in other answers works great!
int symbol=(NUM_SYMBOLS*headingIndex)/(MAX_INTEGER);
Gives us buckets of either 1820 or 1821 with an even distribution. I'd say that's the natural solution to the head question.
So why the current method? Is it some artefact of some measuring device?
I'll put money the maximum value is 65535 because that's the maximum value of an unsigned 16-bit integer.
It's right to wonder about overflow. But if you're working in 16-bits it's already broken. So I wonder about a device that is recording 16-bits. That's quite realistic.
This is similar to what I know as "The Instalments Problem".
We want to the customer to pay £655.36 over 36 months. Do they pay £18.20 a month totalling £655.20 and we forget the 16p? They won't pay £18.21 totalling £655.56 and overpay 20p. Bigger first payment of £18.36 and then 35 of £18.20?
People wrestle with this one. The business answers are 'get the money' - bigger first payment. Avoid complaints if they owe you money (big last payment) and forget the pennies (all same - we're bigger than a few pence!).
In arithmetic terms for a measurement (such as degrees) I'd say the sprinkled method offered is the most natural, even and distributes the anomaly evenly.
But it's not the only answer. Up to you. Hint: If you haven't been ask to fix this and just think it's ugly - walk away. Walk away now.

Related

How to write this floating point code in a portable way?

I am working on a cryptocurrency and there is a calculation that nodes must make:
average /= total;
double ratio = average/DESIRED_BLOCK_TIME_SEC;
int delta = -round(log2(ratio));
It is required that every node has the exact same result no matter what architecture or stdlib being used by the system. My understanding is that log2 might have different implementations that yield very slightly different results or flags like --ffast-math could impact the outputted results.
Is there a simple way to convert the above calculation to something that is verifiably portable across different architectures (fixed point?) or am I overthinking the precision that is needed (given that I round the answer at the end).
EDIT: Average is a long and total is an int... so average ends up rounded to the closest second.
DESIRED_BLOCK_TIME_SEC = 30.0 (it's a float) that is #defined
For this kind of calculation to be exact, one must either calculate all the divisions and logarithms exactly -- or one can work backwards.
-round(log2(x)) == round(log2(1/x)), meaning that one of the divisions can be turned around to get (1/x) >= 1.
round(log2(x)) == floor(log2(x * sqrt(2))) == binary_log((int)(x*sqrt(2))).
One minor detail here is, if (double)sqrt(2) rounds down, or up. If it rounds up, then there might exist one or more value x * sqrt2 == 2^n + epsilon (after rounding), where as if it would round down, we would get 2^n - epsilon. One would give the integer value of n the other would give n-1. Which is correct?
Naturally that one is correct, whose ratio to the theoretical mid point x * sqrt(2) is smaller.
x * sqrt(2) / 2^(n-1) < 2^n / (x * sqrt(2)) -- multiply by x*sqrt(2)
x^2 * 2 / 2^(n-1) < 2^n -- multiply by 2^(n-1)
x^2 * 2 < 2^(2*n-1)
In order of this comparison to be exact, x^2 or pow(x,2) must be exact as well on the boundary - and it matters, what range the original values are. Similar analysis can and should be done while expanding x = a/b, so that the inexactness of the division can be mitigated at the cost of possible overflow in the multiplication...
Then again, I wonder how all the other similar applications handle the corner cases, which may not even exist -- and those could be brute force searched assuming that average and total are small enough integers.
EDIT
Because average is an integer, it makes sense to tabulate those exact integer values, which are on the boundaries of -round(log2(average)).
From octave: d=-round(log2((1:1000000)/30.0)); find(d(2:end) ~= find(d(1:end-1))
1 2 3 6 11 22 43 85 170 340 679 1358 2716
5431 10862 21723 43445 86890 173779 347558 695115
All the averages between [1 2( -> 5
All the averages between [2 3( -> 4
All the averages between [3 6( -> 3
..
All the averages between [43445 86890( -> -11
int a = find_lower_bound(average, table); // linear or binary search
return 5 - a;
No floating point arithmetic needed

Calculate and Store Power of very large Number

I am finding pow(2,i) where i can range: 0<=i<=100000.
Apart i have MOD=1000000007
powers[100000];
powers[0]=1;
for (i = 1; i <=100000; ++i)
{
powers[i]=(powers[i-1]*2)%MOD;
}
for i=100000 won't power value become greater than MOD ?
How do I store the power correctly?
The operation doesn't look feasible to me.
I am getting correct value up to i=70 max I guess.
I have to find sum+= ar[i]*power(2,i) and finally print sum%1000000007 where ar[i] is an additional array with some big numbers up to 10^5
As long as your modulus value is less than half the capacity of your data type, it will never be exceeded. That's because you take the previous value in the range 0..1000000006, double it, then re-modulo it bringing it back to that same range.
However, I can't guarantee that higher values won't cause you troubles, it's more mathematical analysis than I'm prepared to invest given the simple alternative. You could spend a lot of time analysing, checking and debugging, but it's probably better just to not allow the problem to occur in the first place.
The alternative? I'd tend to use the pre-generation method (having a program do the gruntwork up front, inserting the pre-generated values into an array easily and speedily accessible from your real program).
With this method, you can use tools that are well tested and known to work with massive values. Since this data is not going to change, it's useless calculating it every time your program starts.
If you want an easy (and efficient) way to do this, the following bash script in conjunction with bc and awk can do this:
#!/usr/bin/bash
bc >nums.txt <<EOF
i = 1;
for (x = 0;x <= 10000; x++) {
i % 1000000007;
i = i * 2;
}
EOF
awk 'BEGIN { printf "static int array[] = {" }
{ if (NR % 5 == 1) printf "\n ";
printf "%s, ",$0;
next
}
END { print "\n};" }' nums.txt
The bc part is the "meat" of the matter, it creates the large powers of two and outputs them modulo the number you provided. The awk part is simply to format them in C-style array elements, five per line.
Just take the output of that and put it into your code and, voila, there you have it, a compile-time-expensed array that you can use for fast lookup.
It takes only a second and a half on my box to generate the array and then you never need to do it again. You also won't have to concern yourself with the vagaries of modulo math :-)
static int array[] = {
1,2,4,8,16,
32,64,128,256,512,
1024,2048,4096,8192,16384,
32768,65536,131072,262144,524288,
1048576,2097152,4194304,8388608,16777216,
33554432,67108864,134217728,268435456,536870912,
73741817,147483634,294967268,589934536,179869065,
359738130,719476260,438952513,877905026,755810045,
511620083,23240159,46480318,92960636,185921272,
371842544,743685088,487370169,974740338,949480669,
898961331,797922655,595845303,191690599,383381198,
766762396,533524785,67049563,134099126,268198252,
536396504,72793001,145586002,291172004,582344008,
164688009,329376018,658752036,317504065,635008130,
270016253,540032506,80065005,160130010,320260020,
640520040,281040073,562080146,124160285,248320570,
:
861508356,723016705,446033403,892066806,784133605,
568267203,136534399,273068798,546137596,92275185,
184550370,369100740,738201480,476402953,952805906,
905611805,
};
If you notice that your modulo can be stored in int. MOD=1000000007(decimal) is equivalent of 0b00111011100110101100101000000111 and can be stored in 32 bits.
- i pow(2,i) bit representation
- 0 1 0b00000000000000000000000000000001
- 1 2 0b00000000000000000000000000000010
- 2 4 0b00000000000000000000000000000100
- 3 8 0b00000000000000000000000000001000
- ...
- 29 536870912 0b00100000000000000000000000000000
Tricky part starts when pow(2,i) is grater than your MOD=1000000007, but if you know that current pow(2,i) will be greater than your MOD, you can actually see how bits look like after MOD
- i pow(2,i) pow(2,i)%MOD bit representation
- 30 1073741824 73741817 0b000100011001010011000000000000
- 31 2147483648 147483634 0b001000110010100110000000000000
- 32 4294967296 294967268 0b010001100101001100000000000000
- 33 8589934592 589934536 0b100011001010011000000000000000
So if you have pow(2,i-1)%MOD you can do *2 actually on pow(2,i-1)%MOD till you're next pow(2,i) will be greater than MOD.
In example for i=34 you will use (589934536*2) MOD 1000000007 instead of (8589934592*2) MOD 1000000007, because 8589934592 can't be stored in int.
Additional you can try bit operations instead of multiplication for pow(2,i).
Bit operation same as multiplication for 2 is bit shift left.

Data structure for fast range searches of dense dataset 4D vectors

I have millions of unstructured 3D vectors associated with arbitrary values - making for a set 4D of vectors. To make it simpler to understand: I have unixtime stamps associated with hundreds of thousands of 3D vectors. And I have many time stamps, making for a very large dataset; upwards of 30 millions vectors.
I have the need to search particular datasets of specific time stamps.
So lets say I have the following data:
For time stamp 1407633943:
(0, 24, 58, 1407633943)
(9, 2, 59, 1407633943)
...
For time stamp 1407729456:
(40, 1, 33, 1407729456)
(3, 5, 7, 1407729456)
...
etc etc
And I wish to make a very fast query along the lines of:
Query Example 1:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99
Give me list of those vectors, so I can find the timestamps.
Query Example 2:
Give me vectors between:
X > 4 && X < 9 && Y > -29 && Y < 100 && Z > 0.58 && Z < 0.99 && W (timestamp) = 1407729456
So far I've used SQLite for the task, but even after column indexing, the thing takes between 500ms - 7s per query. I'm looking for somewhere between 50ms-200ms per query solution.
What sort of structures or techniques can I use to speed the query up?
Thank you.
kd-trees can be helpful here. Range search in a kd-tree is a well-known problem. Time complexity of one query depends on the output size, of course(in the worst case all tree will be traversed if all vectors fit). But it can work pretty fast on average.
I would use octree. In each node I would store arrays of vectors in a hashtable using the timestamp as a key.
To further increase the performance you can use CUDA, OpenCL, OpenACC, OpenMP and implement the algorithms to be executed in parallel on the GPU or a multi-core CPU.
BKaun: please accept my attempt at giving you some insight into the problem at hand. I suppose you have thought of every one of my points, but maybe seeing them here will help.
Regardless of how ingest data is presented, consider that, using the C programming language, you can reduce the storage size of the data to minimize space and search time. You will be searching for, loading, and parsing single bits of a vector instead of, say, a SHORT INT which is 2 bytes for every entry - or a FLOAT which is much more. The object, as I understand it, is to search the given data for given values of X, Y, and Z and then find the timestamp associated with these 3 while optimizing the search. My solution does not go into the search, but merely the data that is used in a search.
To illustrate my hints simply, I'm considering that the data consists of 4 vectors:
X between -2 and 7,
Y between 0.17 and 3.08,
Z between 0 and 50,
timestamp (many of same size - 10 digits)
To optimize, consider how many various numbers each vector can have in it:
1. X can be only 10 numbers (include 0)
2. Y can be 3.08 minus 0.17 = 2.91 x 100 = 291 numbers
3. Z can be 51 numbers
4. timestamp can be many (but in this scenario,
you are not searching for a certain one)
Consider how each variable is stored as a binary:
1. Each entry in Vector X COULD be stored in 4 bits, using the first bit=1 for
the negative sign:
7="0111"
6="0110"
5="0101"
4="0100"
3="0011"
2="0010"
1="0001"
0="0000"
-1="1001"
-2="1010"
However, the original data that you are searching through may range
from -10 to 20!
Therefore, adding another 2 bits gives you a table like this:
-10="101010"
-9="101001" ...
...
-2="100010"
-1="100001" ...
...
8="001000"
9="001001" ...
...
19="001001"
20="010100"
And that's only 6 bits to store each X vector entry for integers from -10 to 20
For search purposes on a range of -10 to 20, there are 21 different X Vector entries
possible to search through.
Each entry in Vector Y COULD be stored in 9 bits (no extra sign bit is needed)
The 1's and 0's COULD be stored (accessed, really) in 2 parts
(tens place, and a 2 digit decimal).
Part 1 can be 0, 1, 2, or 3 (4 2-place bits from "00" to "11")
However, if the range of the entire Y dataset is 0 to 10,
part 1 can be 0, 1, ...9, 10 (which is 11 4-place bits
from "0000" to "1010"
Part 2 can be 00, 01,...98, 99 (100 7-place bits from "0000000" to "1100100"
Total storage bits for Vector Y entries is 11 + 7 = 18 bits in the
range 00.00 to 10.99
For search purposes on a range 00.00 to 10.99, there are 1089 different Y Vector
entries possible to search through (11x99) (?)
Each entry in Vector Z in the range of 0 to 50 COULD be stored in 6 bits
("000000" to "110010").
Again, the actual data range may be 7 bits long (for simplicity's sake)
0 to 64 ("0000000" to "1000000")
For search purposes on a range of 0 to 64, there are 65 different Z Vector entries
possible to search through.
Consider that you will be storing the data in this optimized format, in a single
succession of bits:
X=4 bits + 2 range bits = 6 bits
+ Y=4 bits part 1 and 7 bits part 2 = 11 bits
+ Z=7 bits
+ timestamp (10 numbers - each from 0 to 9 ("0000" to "1001") 4 bits each = 40 bits)
= TOTAL BITS: 6 + 11 + 7 + 40 = 64 stored bits for each 4D vector
THE SEARCH:
Input xx, yy, zz to search for in arrays X, Y and Z (which are stored in binary)
Change xx, yy, and zz to binary bit strings per optimized format above.
function(xx, yy, zz)
Search for X first, since it has 21 possible outcomes (range is -10 to 10)
- the lowest number of any array
First search for positive targets (there are 8 of them and better chance
of finding one)
These all start with "000"
7="000111"
6="000110"
5="000101"
4="000100"
3="000011"
2="000010"
1="000001"
0="000000"
So you can check if the first 3 bits = "000". If so, you have a number
between 0 and 7.
Found: search for Z
Else search for xx=-2 or -1: does X = -2="100010" or -1="100001" ?
(do second because there are only 2 of them)
Found: Search for Z
NotFound: next X
Search for Z after X is Found: (Z second, since it has 65 possible outcomes
- range is 0 to 64)
You are searching for 6 bits of a 7 bit binary number
("0000000" to "1000000") If bits 1,2,3,4,5,6 are all "0", analyze bit 0.
If it is "1" (it's 64), next Z
Else begin searching 6 bits ("000000" to "110010") with LSB first
Found: Search for Y
NotFound: Next X
Search for Y (Y last, since it has 1089 possible outcomes - range is 0.00 to 10.99)
Search for Part 1 (decimal place) bits (you are searching for
"0000", "0001" or "0011" only, so use yyPt1=YPt1)
Found: Search for Part 2 ("0000000" to "1100100") using yyPt2=YPt2
(direct comparison)
Found: Print out X, Y, Z, and timestamp
NotFound: Search criteria for X, Y, and Z not found in data.
Print X,Y,Z,"timestamp not found". Ask for new X, Y, Z. New search.

What is correct by common sense: (int) blabla * 255.99999999999997 or round(blabla*255)?

Recently I found this interesting thing in webkit sources, related to color conversions (hsl to rgb):
http://osxr.org/android/source/external/webkit/Source/WebCore/platform/graphics/Color.cpp#0111
const double scaleFactor = nextafter(256.0, 0.0); // it's here something like 255.99999999999997
// .. some code skipped
return makeRGBA(static_cast<int>(calcSomethingFrom0To1(blablabla) * scaleFactor),
Same I found here: http://www.filewatcher.com/p/kdegraphics-4.6.0.tar.bz2.5101406/kdegraphics-4.6.0/kolourpaint/imagelib/effects/kpEffectHSV.cpp.html
(int)(value * 255.999999)
Is it correct to use such technique at all? Why dont' use something straight like round(blabla * 255)?
Is it features of C/C++? As I see strictly speaking is will return not always correct results, in 27 cases of 100. See spreadsheet at https://docs.google.com/spreadsheets/d/1AbGnRgSp_5FCKAeNrELPJ5j9zON9HLiHoHC870PwdMc/edit?usp=sharing
Somebody pls explain — I think it should be something basic.
Normally we want to map a real value x in the (closed) interval [0,1] to an integer value j in the range [0 ...255].
And we want to do it in a "fair" way, so that, if the reals are uniformly distributed in the range, the discrete values will be approximately equiprobable: each of the 256 discrete values should get "the same share" (1/256) from the [0,1] interval. That is, we want a mapping like this:
[0 , 1/256) -> 0
[1/256, 2/256) -> 1
...
[254/256, 255/256) -> 254
[255/256, 1] -> 255
We are not much concerned about the transition points [*], but we do want to cover the full the range [0,1]. How to accomplish that?
If we simply do j = (int)(x *255): the value 255 would almost never appear (only when x=1); and the rest of the values 0...254 would each get a share of 1/255 of the interval. This would be unfair, regardless of the rounding behaviour at the limit points.
If we instead do j = (int)(x * 256): this partition would be fair, except for a sngle problem: we would get the value 256 (out of range!) when x=1 [**]
That's why j = (int)(x * 255.9999...) (where 255.9999... is actually the largest double less than 256) will do.
An alternative implementation (also reasonable, almost equivalent) would be
j = (int)(x * 256);
if(j == 256) j = 255;
// j = x == 1.0 ? 255 : (int)(x * 256); // alternative
but this would be more clumsy and probably less efficient.
round() does not help here. For example, j = (int)round(x * 255) would give a 1/255 share to the integers j=1...254 and half that value to the extreme points j=0, j=255.
[*] I mean: we are not extremely interested in what happens in the 'small' neighbourhood of, say, 3/256: rounding might give 2 or 3, it doesn't matter. But we are interested in the extrema: we want to get 0 and 255, for x=0 and x=1respectively.
[**] The IEEE floating point standard guarantees that there's no rounding ambiguity here: integers admit an exact floating point representation, the product will be exact, and the casting will give always 256. Further, we are guaranteed that 1.0 * z = z.
In general, I'd say (int)(blabla * 255.99999999999997) is more correct than using round().
Why?
Because with round(), 0 and 255 only have "half" the range that 1-254 do. If you round(), then 0-0.00196078431 get mapped to 0, while 0.00196078431-0.00588235293 get mapped to 1. This means that 1 has 200% more probability of occurring than 0, which is, strictly speaking, an unfair bias.
If, isntead, one multiplies by 255.99999999999997 and then floors (which is what casting to an integer does, since it truncates), then each integer from 0 to 255 are equally likely.
Your spreadsheet might show this better if it counted in fractional percentages (i.e. if it counted by 0.01% instead of 1% each time). I've made a simple spreadsheet to show this. If you look at that spreadsheet, you'll see that 0 is unfairly biased against when round()ing, but with the other method things are fair and equal.
Casting to int has the same effect as the floor function (i.e. it truncates). When you call round it, well, rounds to the nearest integer.
They do different things, so choose the one you need.

Bit string nearest neighbour searching

I have hundreds of thousands of sparse bit strings of length 32 bits.
I'd like to do a nearest neighbour search on them and look-up performance is critical. I've been reading up on various algorithms but they seem to target text strings rather than binary strings. I think either locally sensitive hashing or spectral hashing seem good candidates or I could look into compression. Will any of these work well for my bit string problem ? Any direction or guidance would be greatly appreciated.
Here's a fast and easy method,
then a variant with better performance at the cost of more memory.
In: array Uint X[], e.g. 1M 32-bit words
Wanted: a function near( Uint q ) --> j with small hammingdist( q, X[j] )
Method: binary search q in sorted X,
then linear search a block around that.
Pseudocode:
def near( q, X, Blocksize=100 ):
preprocess: sort X
Uint* p = binsearch( q, X ) # match q in leading bits
linear-search Blocksize words around p
return the hamming-nearest of these.
This is fast --
Binary search 1M words
+ nearest hammingdist in a block of size 100
takes < 10 us on my Mac ppc.
(This is highly cache-dependent — your mileage will vary.)
How close does this come to finding the true nearest X[j] ?
I can only experiment, can't do the math:
for 1M random queries in 1M random words,
the nearest match is on average 4-5 bits away,
vs. 3 away for the true nearest (linear scan all 1M):
near32 N 1048576 Nquery 1048576 Blocksize 100
binary search, then nearest +- 50
7 usec
distance distribution: 0 4481 38137 185212 443211 337321 39979 235 0
near32 N 1048576 Nquery 100 Blocksize 1048576
linear scan all 1048576
38701 usec
distance distribution: 0 0 7 58 35 0
Run your data with blocksizes say 50 and 100
to see how the match distances drop.
To get even nearer, at the cost of twice the memory,
make a copy Xswap of X with upper / lower halfwords swapped,
and return the better of
near( q, X, Blocksize )
near( swap q, Xswap, Blocksize )
With lots of memory, one can use many more bit-shuffled copies of X,
e.g. 32 rotations.
I have no idea how performance varies with Nshuffle and Blocksize —
a question for LSH theorists.
(Added): To near-match bit strings of say 320 bits, 10 words,
make 10 arrays of pointers, sorted on word 0, word 1 ...
and search blocks with binsearch as above:
nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist e.g. 42 of 320
nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37
nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50
...
-> e.g. the 37.
This will of course miss near-matches where no single word is close,
but it's very simple, and sort and binsearch are blazingly fast.
The pointer arrays take exactly as much space as the data bits.
100 words, 3200 bits would work in exactly the same way.
But: this works only if there are roughly equal numbers of 0 bits and 1 bits,
not 99 % 0 bits.
I just came across a paper that addresses this problem.
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering (Ravichandran et al, 2005)
The basic idea is similar to Denis's answer (sort lexicographically by different permutations of the bits) but it includes a number of additional ideas and further references for articles on the topic.
It is actually implemented in https://github.com/soundcloud/cosine-lsh-join-spark which is where I found it.