I hope some guru here could help me out
I am writing a C/C++ code to implement Consistent Hashing using SHA1 as the hashing algorithm.
I need to implement module operation as follow :
0100 0011....0110 100 mod 0010 1001 = ?
If the divisor (0000 1111) is a power of 2 pow(2,n) , then it would be easy as the last n bit of dividend is the result.
SHA1 length is 160 bit ( or 40 hexa ). My question is how do I implement a modulo operation of a long bit string to another arbitrary bit string?
Thank you.
As user stark points out (more rudely than necessary, I think) in comments, this is just long division in base 2^32. Or in base 2, with more digits. And you can use the algorithm for long division that you learned in school for base 10.
There are fancier algorithms for doing division on much bigger numbers, but for your application I suspect you can afford to do it very simply and inefficiently, along the following lines (this is the base-2 version, which just amounts to subtracting off left-shifted versions of the denominator until you can no longer do so):
// Treat x,y as 160-bit numbers, where [0] is least significant and
// [4] is most significant. Compute x mod y, and put the result in out.
void mod160(unsigned int out[5], const unsigned int x[5], const unsigned y[5]) {
unsigned int temp[5];
copy160(out, x);
copy160(temp, y);
int n=0;
// Find first 1-bit in quotient.
while (lessthanhalf160(temp, x)) { lshift160(temp); ++n; }
while (n>=0) {
if (!less160(out, temp)) sub160(out, temp); // quotient bit is 1
rshift160(temp); --n; // proceed to next bit of quotient
}
}
For the avoidance of doubt, the above is only a crude sketch:
It may be full of bugs.
I haven't written the implementations of the building-block functions like less160.
Actually you'd probably just put the code for those inline rather than having separate functions. E.g., copy160 is just five assignments, or a short loop, or a memcpy.
It could surely be made more efficient; e.g., that first step might do better to count leading 0-bits and then do a single shift, instead of shifting one place at a time. (The right-shifting probably doesn't want to do that, though, because half the time you will be doing only a single shift.)
The "base-2^32" version may well be faster, but the implementation will be a bit more complicated.
Related
This question already has answers here:
Handling large numbers in C++?
(10 answers)
Closed 7 years ago.
I would like to write a program, which could compute integers having more then 2000 or 20000 digits (for Pi's decimals). I would like to do in C++, without any libraries! (No big integer, boost,...). Can anyone suggest a way of doing it? Here are my thoughts:
using const char*, for holding the integer's digits;
representing the number like
( (1 * 10 + x) * 10 + x )...
The obvious answer works along these lines:
class integer {
bool negative;
std::vector<std::uint64_t> data;
};
Where the number is represented as a sign bit and a (unsigned) base 2**64 value.
This means the absolute value of your number is:
data[0] + (data[1] << 64) + (data[2] << 128) + ....
Or, in other terms you represent your number as a little-endian bitstring with words as large as your target machine can reasonably work with. I chose 64 bit integers, as you can minimize the number of individual word operations this way (on a x64 machine).
To implement Addition, you use a concept you have learned in elementary school:
a b
+ x y
------------------
(a+x+carry) (b+y reduced to one digit length)
The reduction (modulo 2**64) happens automatically, and the carry can only ever be either zero or one. All that remains is to detect a carry, which is simple:
bool next_carry = false;
if(x += y < y) next_carry = true;
if(prev_carry && !++x) next_carry = true;
Subtraction can be implemented similarly using a borrow instead.
Note that getting anywhere close to the performance of e.g. libgmp is... unlikely.
A long integer is usually represented by a sequence of digits (see positional notation). For convenience, use little endian convention: A[0] is the lowest digit, A[n-1] is the highest one. In general case your number is equal to sum(A[i] * base^i) for some value of base.
The simplest value for base is ten, but it is not efficient. If you want to print your answer to user often, you'd better use power-of-ten as base. For instance, you can use base = 10^9 and store all digits in int32 type. If you want maximal speed, then better use power-of-two bases. For instance, base = 2^32 is the best possible base for 32-bit compiler (however, you'll need assembly to make it work optimally).
There are two ways to represent negative integers, The first one is to store integer as sign + digits sequence. In this case you'll have to handle all cases with different signs yourself. The other option is to use complement form. It can be used for both power-of-two and power-of-ten bases.
Since the length of the sequence may be different, you'd better store digit sequence in std::vector. Do not forget to remove leading zeroes in this case. An alternative solution would be to store fixed number of digits always (fixed-size array).
The operations are implemented in pretty straightforward way: just as you did them in school =)
P.S. Alternatively, each integer (of bounded length) can be represented by its reminders for a set of different prime modules, thanks to CRT. Such a representation supports only limited set of operations, and requires nontrivial convertion if you want to print it.
I have implemented Karatsuba multiplication algorithm for my educational goals. Now I am looking for further improvments. I have implemented some kind of long arithmetic and it works well whether I do not use the base of integer representation more than 100.
With base 10 and compiling with clang++ -O3 multiplication of two random integers in range [10^50000, 10^50001] takes:
Naive algorithm took me 1967 cycles (1.967 seconds)
Karatsuba algorithm took me 400 cycles (0.4 seconds)
And the same numbers with base 100:
Naive algorithm took me 409 cycles (0.409 seconds)
Karatsuba algorithm took me 140 cycles (0.14 seconds)
Is there a way for improve this results?
Now I use such function to finalize my result:
void finalize(vector<int>& res) {
for (int i = 0; i < res.size(); ++i) {
res[i + 1] += res[i] / base;
res[i] %= base;
}
}
As you can see each step it calculates carry and push it to the next digit. And if I take base >=1000 the result will be overflowed.
If you see at my code I use vectors of int to represent long integer. According to my base a number will divide in separate parts of vector.
Now I see several options:
to use long long type for vector, but it might also be overflowed for vast length integers
implement representation of carry in long arithmetic
After I had saw some coments I decided to expand the issue. Assume that we want to represent our long integer as a vector of ints. For instanse:
ULLONG_MAX = 18446744073709551615
And for input we pass 210th Fibonacci number 34507973060837282187130139035400899082304280 which does not fit to any stadard type. If we represent it in a vector of int with base 10000000 it will be like:
v[0]: 2304280
v[1]: 89908
v[2]: 1390354
v[3]: 2187130
v[4]: 6083728
v[5]: 5079730
v[6]: 34
And when we do multiplication we may get (for simplicity let it be two identical numbers)
(34507973060837282187130139035400899082304280)^2:
v[0] * v[0] = 5309706318400
...
v[0] * v[4] = 14018612755840
...
It was only the first row and we have to do the six steps like that. Certainly, some step will cause overflow during multiplication or after carry calculation.
If I missed something, please, let me know and I will change it.
If you want to see full version, it is on my github
Base 2^64 and base 2^32 are the most popular bases for doing high precision arithmetic. Usually, the digits are stored in an unsigned integral type, because they have well-behaved semantics with regard to overflow.
For example, one can detect the carry from an addition as follows:
uint64_t x, y; // initialize somehow
uint64_t sum = x + y;
uint64_t carry = sum < x; // 1 if true, 0 if false
Also, assembly languages usually have a few "add with carry" instructions; if you can write inline assembly (or have access to intrinsics) you can take advantage of these.
For multiplication, most computers have machine instructions that can compute a one machine word -> two machine word product; sometimes, the instructions to get the two halves are called "multiply hi" and "multiply low". You need to write assembly to get them, although many compilers offer larger integer types whose use would let you access these instructions: e.g. in gcc you can implement multiply hi as
uint64_t mulhi(uint64_t x, uint64_t y)
{
return ((__uint128_t) x * y) >> 64;
}
When people can't use this, they do multiplication in 2^32 instead, so that they can use the same approach to implement a portable mulhi instruction, using uint64_t as the double-digit type.
If you want to write efficient code, you really need to take advantage of these bigger multiply instructions. Multiplying digits in base 2^32 is more than ninety times more powerful than multiplying digits in base 10. Multiplying digits in base 2^64 is four times more powerful than that. And your computer can probably do these more quickly than whatever you implement for base 10 multiplication.
When reading Lua’s source code, I noticed that Lua uses a macro to round double values to 32-bit int values. The macro is defined in the Llimits.h header file and reads as follows:
union i_cast {double d; int i[2]};
#define double2int(i, d, t) \
{volatile union i_cast u; u.d = (d) + 6755399441055744.0; \
(i) = (t)u.i[ENDIANLOC];}
Here ENDIANLOC is defined according to endianness: 0 for little endian, 1 for big endian architectures; Lua carefully handles endianness. The t argument is substituted with an integer type like int or unsigned int.
I did a little research and found that there is a simpler format of that macro which uses the same technique:
#define double2int(i, d) \
{double t = ((d) + 6755399441055744.0); i = *((int *)(&t));}
Or, in a C++-style:
inline int double2int(double d)
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
This trick can work on any machine using IEEE 754 (which means pretty much every machine today). It works for both positive and negative numbers, and the rounding follows Banker’s Rule. (This is not surprising, since it follows IEEE 754.)
I wrote a little program to test it:
int main()
{
double d = -12345678.9;
int i;
double2int(i, d)
printf("%d\n", i);
return 0;
}
And it outputs -12345679, as expected.
I would like to understand how this tricky macro works in detail. The magic number 6755399441055744.0 is actually 251 + 252, or 1.5 × 252, and 1.5 in binary can be represented as 1.1. When any 32-bit integer is added to this magic number—
Well, I’m lost from here. How does this trick work?
Update
As #Mysticial points out, this method does not limit itself to a 32-bit int, it can also be expanded to a 64-bit int as long as the number is in the range of 252. (Although the macro needs some modification.)
Some materials say this method cannot be used in Direct3D.
When working with Microsoft assembler for x86, there is an even faster macro written in assembly code (the following is also extracted from Lua source):
#define double2int(i,n) __asm {__asm fld n __asm fistp i}
There is a similar magic number for single precision numbers: 1.5 × 223.
A value of the double floating-point type is represented like so:
and it can be seen as two 32-bit integers; now, the int taken in all the versions of your code (supposing it’s a 32-bit int) is the one on the right in the figure, so what you are doing in the end is just taking the lowest 32 bits of mantissa.
Now, to the magic number; as you correctly stated, 6755399441055744 is 251 + 252; adding such a number forces the double to go into the “sweet range” between 252 and 253, which, as explained by Wikipedia, has an interesting property:
Between 252 = 4,503,599,627,370,496 and 253 = 9,007,199,254,740,992, the representable numbers are exactly the integers.
This follows from the fact that the mantissa is 52 bits wide.
The other interesting fact about adding 251 + 252 is that it affects the mantissa only in the two highest bits—which are discarded anyway, since we are taking only its lowest 32 bits.
Last but not least: the sign.
IEEE 754 floating point uses a magnitude and sign representation, while integers on “normal” machines use 2’s complement arithmetic; how is this handled here?
We talked only about positive integers; now suppose we are dealing with a negative number in the range representable by a 32-bit int, so less (in absolute value) than (−231 + 1); call it −a. Such a number is obviously made positive by adding the magic number, and the resulting value is 252 + 251 + (−a).
Now, what do we get if we interpret the mantissa in 2’s complement representation? It must be the result of 2’s complement sum of (252 + 251) and (−a). Again, the first term affects only the upper two bits, what remains in the bits 0–50 is the 2’s complement representation of (−a) (again, minus the upper two bits).
Since reduction of a 2’s complement number to a smaller width is done just by cutting away the extra bits on the left, taking the lower 32 bits gives us correctly (−a) in 32-bit, 2’s complement arithmetic.
This kind of "trick" comes from older x86 processors, using the 8087 intructions/interface for floating point. On these machines, there's an instruction for converting floating point to integer "fist", but it uses the current fp rounding mode. Unfortunately, the C spec requires that fp->int conversions truncate towards zero, while all other fp operations round to nearest, so doing an
fp->int conversion requires first changing the fp rounding mode, then doing a fist, then restoring the fp rounding mode.
Now on the original 8086/8087, this wasn't too bad, but on later processors that started to get super-scalar and out-of-order execution, altering the fp rounding mode generally serializes the CPU core and is quite expensive. So on a CPU like a Pentium-III or Pentium-IV, this overall cost is quite high -- a normal fp->int conversion is 10x or more expensive than this add+store+load trick.
On x86-64, however, floating point is done with the xmm instructions, and the cost of converting
fp->int is pretty small, so this "optimization" is likely slower than a normal conversion.
if this helps with visualization, that lua magic value
(2^52+2^51, or base2 of 110 then [50 zeros]
hex
0x 0018 0000 0000 0000 (18e12)
octal
0 300 00000 00000 00000 ( 3e17)
Here is a simpler implementation of the above Lua trick:
/**
* Round to the nearest integer.
* for tie-breaks: round half to even (bankers' rounding)
* Only works for inputs in the range: [-2^51, 2^51]
*/
inline double rint(double d)
{
double x = 6755399441055744.0; // 2^51 + 2^52
return d + x - x;
}
The trick works for numbers with absolute value < 2 ^ 51.
This is a little program to test it: ideone.com
#include <cstdio>
int main()
{
// round to nearest integer
printf("%.1f, %.1f\n", rint(-12345678.3), rint(-12345678.9));
// test tie-breaking rule
printf("%.1f, %.1f, %.1f, %.1f\n", rint(-24.5), rint(-23.5), rint(23.5), rint(24.5));
return 0;
}
// output:
// -12345678.0, -12345679.0
// -24.0, -24.0, 24.0, 24.0
What is the best way to get individual digits from an int with n number of digits for use in a radix sort algorithm? I'm wondering if there is a particularly good way to do it in C/C++, if not what is the general best solution?
edit: just to clarify, i was looking for a solution other than converting it to a string and treating it like an array of digits.
Use digits of size 2^k. To extract the nth digit:
#define BASE (2<<k)
#define MASK (BASE-1)
inline unsigned get_digit(unsigned word, int n) {
return (word >> (n*k)) & MASK;
}
Using the shift and mask (enabled by base being a power of 2) avoids expensive integer-divide instructions.
After that, choosing the best base is an experimental question (time/space tradeoff for your particular hardware). Probably k==3 (base 8) works well and limits the number of buckets, but k==4 (base 16) looks more attractive because it divides the word size. However, there is really nothing wrong with a base that does not divide the word size, and you might find that base 32 or base 64 perform better. It's an experimental question and may likely differ by hardware, according to how the cache behaves and how many elements there are in your array.
Final note: if you are sorting signed integers life is a much bigger pain, because you want to treat the most significant bit as signed. I recommend treating everything as unsigned, and then if you really need signed, in the last step of your radix sort you will swap the buckets, so that buckets with a most significant 1 come before a most significant 0. This problem is definitely easier if k divides the word size.
Don't use base 10, use base 16.
for (int i = 0; i < 8; i++) {
printf("%d\n", (n >> (i*4)) & 0xf);
}
Since integers are stored internally in binary, this will be more efficient than dividing by 10 to determine decimal digits.
In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.
I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.
MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.
See also here for similar discussion on modf function.
EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?
Yes, it is possible even without trickery:
sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.
lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.
caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.
Hello from 11 years in the future.
Considering this still gets occasional votes, I thought I'd add a note about performance, which now even more than then is dramatically limited by memory accesses. You absolutely must use a realistic benchmark (ideally, your whole application) when optimising something like this - the memory access patterns of your application will have a dramatic effect on solutions like lookup tables and caches, and just comparing 'cycles' for your optimised version will lead you wildly astray: it is also very difficult to assign program time to individual instructions, and your profiling tool may mislead you here.
On a related note, consider using simd/vectorised instructions for calculating square roots, like _mm512_sqrt_ps or similar, if they suit your use case.
Take a look at section 15.12.3 of intel's optimisation reference manual, which describes approximation methods, with vectorised instructions, which would probably translate pretty well to other architectures too.
There's a great comparison table here:
http://assemblyrequired.crashworks.org/timing-square-root/
Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).
Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.
You're very likely to gain more speed improvements by changing your algorithms than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)
Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.
Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.
How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).
Don't know if you fixed this, but I've read about it before, and it seems that the fastest thing to do is replace the sqrt function with an inline assembly version;
you can see a description of a load of alternatives here.
The best is this snippet of magic:
double inline __declspec (naked) __fastcall sqrt(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
It's about 4.7x faster than the standard sqrt call with the same precision.
Here is a fast way with a look up table of only 8KB. Mistake is ~0.5% of the result. You can easily enlarge the table, thus reducing the mistake. Runs about 5 times faster than the regular sqrt()
// LUT for fast sqrt of floats. Table will be consist of 2 parts, half for sqrt(X) and half for sqrt(2X).
const int nBitsForSQRTprecision = 11; // Use only 11 most sagnificant bits from the 23 of float. We can use 15 bits instead. It will produce less error but take more place in a memory.
const int nUnusedBits = 23 - nBitsForSQRTprecision; // Amount of bits we will disregard
const int tableSize = (1 << (nBitsForSQRTprecision+1)); // 2^nBits*2 because we have 2 halves of the table.
static short sqrtTab[tableSize];
static unsigned char is_sqrttab_initialized = FALSE; // Once initialized will be true
// Table of precalculated sqrt() for future fast calculation. Approximates the exact with an error of about 0.5%
// Note: To access the bits of a float in C quickly we must misuse pointers.
// More info in: http://en.wikipedia.org/wiki/Single_precision
void build_fsqrt_table(void){
unsigned short i;
float f;
UINT32 *fi = (UINT32*)&f;
if (is_sqrttab_initialized)
return;
const int halfTableSize = (tableSize>>1);
for (i=0; i < halfTableSize; i++){
*fi = 0;
*fi = (i << nUnusedBits) | (127 << 23); // Build a float with the bit pattern i as mantissa, and an exponent of 0, stored as 127
// Take the square root then strip the first 'nBitsForSQRTprecision' bits of the mantissa into the table
f = sqrtf(f);
sqrtTab[i] = (short)((*fi & 0x7fffff) >> nUnusedBits);
// Repeat the process, this time with an exponent of 1, stored as 128
*fi = 0;
*fi = (i << nUnusedBits) | (128 << 23);
f = sqrtf(f);
sqrtTab[i+halfTableSize] = (short)((*fi & 0x7fffff) >> nUnusedBits);
}
is_sqrttab_initialized = TRUE;
}
// Calculation of a square root. Divide the exponent of float by 2 and sqrt() its mantissa using the precalculated table.
float fast_float_sqrt(float n){
if (n <= 0.f)
return 0.f; // On 0 or negative return 0.
UINT32 *num = (UINT32*)&n;
short e; // Exponent
e = (*num >> 23) - 127; // In 'float' the exponent is stored with 127 added.
*num &= 0x7fffff; // leave only the mantissa
// If the exponent is odd so we have to look it up in the second half of the lookup table, so we set the high bit.
const int halfTableSize = (tableSize>>1);
const int secondHalphTableIdBit = halfTableSize << nUnusedBits;
if (e & 0x01)
*num |= secondHalphTableIdBit;
e >>= 1; // Divide the exponent by two (note that in C the shift operators are sign preserving for signed operands
// Do the table lookup, based on the quaternary mantissa, then reconstruct the result back into a float
*num = ((sqrtTab[*num >> nUnusedBits]) << nUnusedBits) | ((e + 127) << 23);
return n;
}