How should I store and compute Hamming distance between binary codes?

How should I store and compute Hamming distance between binary codes? - c++

How can I efficiently store binary codes? For certain fixed sizes, such as 32 bits, there are primitive types that can be used. But what if I my binary codes are much longer?
What is the fastest way to compute the Hamming distance between two binary codes?

Use std::bitset<N>, defined in the <bitset> header, where N is the number of bits (not bytes).
Compute the Hamming distance between two binary codes a and b using (a ^ b).count().

Related

How to parse the division of two very large numbers into a double?

I have a geometric algorithm which takes as input a polygon. However, the files I am supposed to use as input files store the coordinates of the polygons in a rather peculiar way. Each file consists of one line, a counterclockwise sequence of the vertices. Each vertex is represented by its x and y coordinates each of which is written as the quotient of two integers int/int. However, these integers are incredibly large. I wrote a program that parses them from a string into long long using the function std::stoll. However, it appears that some of the numbers in the input file are larger than 2^64.
The output coordinates are usually quite small, in the range 0-1000. How do I go about parsing these numbers and then dividing them, obtaining doubles? Is there any standard library way of doing this, or should I use something like the boost library?

If you are after a ratio of two large numbers as string, you can shorten the strings:
"194725681173571753193674" divided by "635482929374729202" is the same as
"1947256811735717" divided by "6354829293" to at least 9 digits (I just removed the same amount of digits on both sides). Depending on the needed precision, this might be the simplest solution. Just remove digits before converting to long long.

You can parse the inputs directly into a long double I believe. However, that approach will introduce precision errors. If precision is important, then avoid this.
A general solution for precise results is to represent the large integer with an array of integers where one integer represents the lower order bytes, next integer represents the larger bytes etc. This is generally called arbitrary precision arithmetic.
Is there any standard library way of doing this
No, other than basic building blocks such as vector for storing the array.
or should I use something like the boost library?
That's often a good place to start. Boost happens to have a library for this.

convert very big int (written as string) to binary string in c/c++

I have a number in base 10 which has around 10k digits. I want to convert it into base 2 (1010101001...). All I can think of is primitive algorithm:
take last digit mod 2 -> write down bit
number divide by 2;
It's shouldn't be hard to implement primary school division on string, but i'm thinking that it very inefficiente. If i'm right it will be O(l^2), where l means length of number in base 10. Can that be done faster?

From what I understand you have your big number represented as a sequence of decimal digits. If that is so, you can compute a "binary" representation using multiplication and addition:
value = sum(i in 0...n-1) 10i * digiti
This computation can be split into parts in a divide and conquor way, although I'm not sure if you can arrive at a O(n log n) algorithm.

If you are working with big numbers, I really suggest you use a multi precision library. Try GMP or MPRF or something similar. -Øystein

Division by 2 is the same as multiplication by 1/2. For the latter you can use some of the well known fast multiplication algorithms (Toom–Cook, Schönhage–Strassen,etc).

Fastest way to add numbers in a very large arithmetic series?

I'm trying to minimize overhead as much as possible when adding numbers in an arithmetic series. I'm talking about a very large set, such as from 1 to 2^128. Is there any fast way of doing this? If so, what would it be without actually using the arithmetic sequence sum formula? Just as a reference, the sum from 1 to 2^128 is:
57896044618658097711785492504343953926464851149359812787997104700240680714240

Only fast way is to use the formula:
n * (n+1) / 2
Any other method (adding naively) will take way too long! (Even if you had a million years on a supercomputer, you wouldn't finish the calculation).
For such a large integer though, you cannot use normal integers. You will need to use a big integer object. So get a Big Integer library, eg. Google search, https://mattmccutchen.net/bigint/.
Note: a 256-bit integer may be able to hold results up to around that scale, but it is quite platform and compiler-dependent, as to whether 256-bit integers are readily available, and how they are used.

Hamming distance and CRC

How to find the Hamming distance of a code generated by a certain CRC?
Assume that I have a generating polynomial of order, say, 4 and 11 bits of data.
How to compute the HD basing only on these information?

You should be able to pad your results with zeros making both values 11bits long. Computing an XOR on the two bit strings and counting the ones should yield the hamming distance for your data set.
Hope this helps...

How do I compress a large number of similar doubles?

I want to store billions (10^9) of double precision floating point numbers in memory and save space. These values are grouped in thousands of ordered sets (they are time series), and within a set, I know that the difference between values is usually not large (compared to their absolute value). Also, the closer to each other, the higher the probability of the difference being relatively small.
A perfect fit would be a delta encoding that stores only the difference of each value to its predecessor. However, I want random access to subsets of the data, so I can't depend on going through a complete set in sequence. I'm therefore using deltas to a set-wide baseline that yields deltas which I expect to be within 10 to 50 percent of the absolute value (most of the time).
I have considered the following approaches:
divide the smaller value by the larger one, yielding a value between 0 and 1 that could be stored as an integer of some fixed precision plus one bit for remembering which number was divided by which. This is fairly straightforward and yields satisfactory compression, but is not a lossless method and thus only a secondary choice.
XOR the IEEE 754 binary64 encoded representations of both values and store the length of the long stretches of zeroes at the beginning of the exponent and mantissa plus the remaining bits which were different. Here I'm quite unsure how to judge the compression, although I think it should be good in most cases.
Are there standard ways to do this? What might be problems about my approaches above? What other solutions have you seen or used yourself?

Rarely are all the bits of a double-precision number meaningful.
If you have billions of values that are the result of some measurement, find the calibration and error of your measurement device. Quantize the values so that you only work with meaningful bits.
Often, you'll find that you only need 16 bits of actual dynamic range. You can probably compress all of this into arrays of "short" that retain all of the original input.
Use a simple "Z-score technique" where every value is really a signed fraction of the standard deviation.
So a sequence of samples with a mean of m and a standard deviation of s gets transformed into a bunch of Z score. Normal Z-score transformations use a double, but you should use a fixed-point version of that double. s/1000 or s/16384 or something that retains only the actual precision of your data, not the noise bits on the end.
for u in samples:
z = int( 16384*(u-m)/s )
for z in scaled_samples:
u = s*(z/16384.0)+m
Your Z-scores retain a pleasant easy-to-work with statistical relationship with the original samples.
Let's say you use a signed 16-bit Z-score. You have +/- 32,768. Scale this by 16,384 and your Z-scores have an effective resolution of 0.000061 decimal.
If you use a signed 24-but Z-score, you have +/- 8 million. Scale this by 4,194,304 and you have a resolution of 0.00000024.
I seriously doubt you have measuring devices this accurate. Further, any arithmetic done as part of filter, calibration or noise reduction may reduce the effective range because of noise bits introduced during the arithmetic. A badly thought-out division operator could make a great many of your decimal places nothing more than noise.

Whatever compression scheme you pick, you can decouple that from the problem of needing to be able to perform arbitrary seeks by compressing into fixed-size blocks and prepending to each block a header containing all the data required to decompress it (e.g. for a delta encoding scheme, the block would contain deltas enconded in some fashion that takes advantage of their small magnitude to make them take less space, e.g. fewer bits for exponent/mantissa, conversion to fixed-point value, Huffman encoding etc; and the header a single uncompressed sample); seeking then becomes a matter of cheaply selecting the appropriate block, then decompressing it.
If the compression ratio is so variable that much space is being wasted padding the compressed data to produce fixed size blocks, a directory of offsets into the compressed data could be built instead and the state required to decompress recorded in that.

If you know a group of doubles has the same exponent, you could store the exponent once, and only store the mantissa for each value.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js