I found some interesting bit twiddling in "source\common\unicode\utf.h" file of ICU library (International Components for Unicode). The bit twiddling is intended for checking whether a number is in a particular range.
// Is a code point in a range of U+d800..U+dbff?
#define U_IS_LEAD(c) (((c)&0xfffffc00)==0xd800)
I have figured out the magic number (0xfffffc00) come from:
MagicNumber = 0xffffffff - (HighBound - LowBound)
However, I also found that the formula doesn't apply to every arbitrary range. Does somebody here know in what circumstance the formula works?
Is there another bit twiddling for checking whether a number is in particular range?
For these tricks to apply, the numbers must have some common features in their binary representation.
0xD800 == 0b1101_1000_0000_0000
0xDBFF == 0b1101_1011_1111_1111
What this test really does is to mask out the lower ten bits. This is usually written as
onlyHighBits = x & ~0x03FF
After this operation ("and not") the lower ten bits of onlyHighBits are guaranteed to be zero. That means that if this number equals the lower range of the interval now, it has been somewhere in the interval before.
This trick works in all cases where the lower and the higher limit of the interval start with the same digits in binary, and at some point the lower limit has only zeroes while the higher limit has only ones. In your example this is at the tenth position from the right.
If you do not have 2^x boundaries type might use the following trick:
if x >= 0 and x < N you can check both by:
if Longword( x ) < Longword( N ) then ...
This works due to the fact that negative numbers in signed numbers correspond to the largest numbers in unsigned datatypes.
You could extend this (when range checking is DISABLED) to:
if Longword( x - A ) < Longword ( ( B - A ) ) then ...
Now you have both tests (range [ A, B >) in a SUB and a CMP plus a single Jcc, assuming (B - A ) is precalculated.
I only use these kind of optimizations when really needed; eg they tend to make your code less readable and it only shaves off a few clock cycles per test.
Note to C like language readers: Longword is Delphi's unsigned 32bit datatype.
The formula works whenever the range you are looking for starts at a multiple of a power of 2 (that is, 1 or more bits at the low end of the binary form of the number ends in 0) and the size of the range is 2^n-1 (that is, low&high == low and low|high == high).
Related
I'm using GMP's low-level interface (mpn_, see https://gmplib.org/manual/Low_002dlevel-Functions) to do some fixed-size 192 bit (three limb) integer calculations.
Currently I am trying to divide one random uint192 by another random uint192 and fail to select the right function. There are multiple candidates:
mpn_tdiv_qr - Documentation says "most significant limb of the divisor must be non-zero"
mpn_divrem - Documentation says "obsolete"
mpn_divrem_1 - Takes a single limb as a divisor
mpn_divexact_1 - Documentation says "expecting ... to divide exactly"
So there is an obsolete function (mpn_divrem), a function that takes small (<= 64bit) divisors only (mpn_divrem_1), one that has undefined behavior with remainders (mpn_divexact_1) and one that takes large (>= 129bit) divisors only (mpn_tdiv_qr).
I tried mpn_tdiv_qr since I thought I'm misunderstanding the documentation and because it replaces the obsolete function. But it actually crashes when most significant limb is zero.
Did I miss something? How can I divide two random 3-limb-numbers? Or numbers with a divisor of 65..128 non-zero bits (e.g. 0xffffffff'ffffffffffffffff'ffffffffffffffff / 0xff'ffffffffffffffff)?
Ok, solution is simple: Instead of passing the number of allocated limbs to mpn_tdiv_qr pass the number of limbs minus the number of leading zero limbs of the divisor instead. E.g. when using 3 limbs and a concrete divisor has limb[2] == 0, limb[1] != 0 and limb[0] == 0 a 2 is passed for the divisor length. This way it is guaranteed that the most significant limb (in this case [1]) is non-zero.
I know from articles like "Why you should never cast floats to ints" and many others like it that casting a float to a signed int is expensive. I'm also aware that certain conversion instructions or SIMD vector instructions on some architectures can speed the process. I'm curious if converting an integer to floating point is also expensive, as all the material I've found on the subject only talks about how expensive it is to convert from floating point to integer.
Before anyone says "Why don't you just test it?" I'm not talking about performance on a particular architecture, I'm interested in the algorithmic behavior of the conversion across multiple platforms adhering to the IEEE 754-2008 standard. Is there something inherent to the algorithm for conversion that affects performance in general?
Intuitively, I would think that conversion from integer to floating point would be easier in general for the following reasons:
Rounding is only necessary if the precision of the integer exceeds the precision of the binary floating point number, e.g. 32-bit integer to 32-bit float might require rounding, but 32-bit integer to 64-bit float won't, and neither will a 32-bit integer that only uses 24-bits of precision.
There is no need to check for NAN or +/- INF or +/- 0.
There is no danger of overflow or underflow.
What are reasons that conversion from int to float could result in poor cross-platform performance, if any (other than a platform emulating floating point numbers in software)? Is conversion from int to float generally cheaper than float to int?
Intel specifies in its "Architectures Optimization Reference Manual" that CVTSI2SD has 3-4 cycles latency (and 1 cycle throughput) on the basic desktop/server line since Core2. This can be accepted as a good example.
From the hardware point of view, such conversion requires some assistance which makes it fit in reasonable cycle amount, otherwise, it gets too expensive. A naive but rather good explanation follows. In all consideration, I assume a single CPU clock cycle is enough for an operation like full-width integer adding (but not radically longer!), and all results of previous cycle are applied on cycle boundary.
The first clock cycle with appropriate hardware assistance (priority encoder) gives Count Leading Zeros (CLZ) result among with detecting two special cases: 0 and INT_MIN (MSB set and all other bits clear). 0 and INT_MIN are better to be processed separately (load constant to destination register and finish). Otherwise, if the input integer was negative, it shall be negated; this usually requires one more cycle (because negation is combination of inversion and adding of a carry bit). So, 1-2 cycles are spent.
At the same time, it can calculate the biased exponent prediction, based on CLZ result. Notice we needn't take care of denormalized values or infinity. (Can we predict CLZ(-x) based on CLZ(x), if x < 0? If we can, this economizes us 1 cycle.)
Then, shift is applied (1 cycle again, with barrel shifter) to place the integer value so its highest 1 is at a fixed position (e.g. with standard 3 extension bits and 24-bit mantissa, this is bit number 26). This usage of barrel shifter shall combine of all low bits to the sticky bit (a separate custom barrel shifter instance can be needed; but this is waaaay cheaper than cache megabytes or OoO dispatcher). Now, up to 3 cycles.
Then, rounding is applied. Rounding is analyzing, in our case, of 4 lowest current value bits (mantissa LSB, guard, round and sticky), and, OTOH, the current rounding mode and target sign (extracted at cycle 1). Rounding to zero (RZ) results in ignoring guard/round/sticky bits. Rounding to -∞ (RMI) for positive value and to +∞ (RPI) for negative is the same as to zero. Rounding to ∞ of opposite sign results in adding 1 to the main mantissa. Finally, rounding-to-nearest-ties-to-even (RNE): x000...x011 -> discard; x101...x111 -> add 1; 0100 -> discard; 1100 -> add 1. If hardware is fast enough to add this result at the same cycle (I guess it's likely), we have up to 4 cycles now.
This adding on the previous step can lead in carry (like 1111 -> 10000), so, exponent can increase. The final cycle is to pack sign (from cycle 1), mantissa (to "significand") and biased exponent (calculated on cycle 2 from CLZ result and possibly adjusted with carry from cycle 4). So, 5 cycles now.
Is conversion from int to float generally cheaper than float to int?
We can estimate the same conversion e.g. from binary32 to int32 (signed). Let's assume that conversion of NaN, INF or too big value results in fixed value, say, INT_MIN (-2147483648). In that case:
Split and analyze the input value: S - sign; BE - biased exponent; M - mantissa (significand); also apply rounding mode. A "conversion impossible" (overflow or invalid) signal is generated if: BE >= 158 (this includes NaN and INF). A "zero" signal is generated if BE < 127 (abs(x) < 1) and {RZ, or (x > 0 and RMI), or (x < 0 and RPI)}; or, if BE < 126 (abs(x) < 0.5) with RNE; or, BE = 126, significand = 0 (without hidden bit) and RNE. Otherwise, signals for final +1 or -1 can be generated for cases: BE < 127 and: x < 0 and RMI; x > 0 and RPI; BE = 126 and RNE. All these signals can be calculated during one cycle using boolean logic circuitry, and lead to finalize result at the first cycle. In parallel and independently, calculate 157-BE using a separate adder for using at cycle 2.
If not finalized yet, we have abs(x) >= 1, so, BE >= 127, but BE <= 157 (so abs(x) < 2**31). Get 157-BE from cycle 1, this is needed shift amount. Apply the right shift by this amount, using the same barrel shifter, as in int -> float algorithm, to a value with (again) 3 additional bits and sticky bit gathering. Here, 2 cycles are spent.
Apply rounding (see above). 3 cycles spent, and carry can be produced. Here, we can again detect integer overflow and produce the respective result value. Forget additional bits, only 31 bits are valued now.
Finally, negate the resulting value, if x was negative (sign=1). Up to 4 cycles spent.
I'm not an experienced binary logic developer so could miss some chance to compact this sequence, but it looks rather close to Intel values. So, the conversions themselves are quite cheaper, provided hardware assistance is present (saying again, it results in no more than a few thousand gates, so is tiny for the contemporary chip production).
You can also take a look at Berkeley Softfloat library - it implements virtually the same approach with minor modifications. Start with ui32_to_f32.c source file. They use more additional bits for intermediate values, but this isn't principal.
See #Netch's excellent answer re the algorithm, but it's not just the algorithm. The FPU runs asynchronously, so the int->FP operation can start and the CPU can then execute the next instruction. But when storing FP to integer, there has to be an FWAIT (Intel).
For a monte carlo integration process, I need to pull a lot of random samples from
a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but
doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any
kind of lengthy precomputation is likely worth it in the face of the sheer number of
samples).
I have at my disposal a very fast uniform pseudo random number generator that
typically produces unsigned 64 bits integers (all the ints in the discussion
below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable
of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do
better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there
I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type
techniques that produce the modulo directly.
Second intuition: I am actually not interested in the modulo itself. What I truly want is
to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is
B) mod(prgn(), N) is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact
it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of
the PRNG. As a matter of fact, I already do that. But like I said, even that optimization
is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique
Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.
From there, we generate numbers as follows:
Generate a number
and it with your mask
if result > n, go to 1
The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.
In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.
On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.
If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).
Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.
Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to
ensure that your table size is a power of two (add padding if you must!)
replace the modulo operation with a bitmask operation. Like this:
size_t tableSize = 1 << 16;
size_t tableMask = tableSize - 1;
...
histogram[prng() & tableMask]
A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.
--
Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
This is just as fast as the bitmask, but it uses different bits.
You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram
[10, 5, 6]
extend it to length 16 like so (assuming -1 is an appropriate sentinel):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.
This is very much a memory vs. speed trade-off.
Just a few ideas to complement the other good answers:
What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)
When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution.
That way, the compiler will know the number of buckets.
Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.
Could the number of buckets be reduced without hurting the accuracy of the integration too much?
The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)
To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:
Following may help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
Live example
Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.
Thanks everyone for you suggestions.
First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.
After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by #Gene:
pre-compute normalizer as:
auto normalizer = histogram.size() / (1.0+urng.max());
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.
Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.
I am currently developing an utility that handles all arithmetic operations on bitsets.
The bitset can auto-resize to fit any number, so it can perform addition / subtraction / division / multiplication and modulo on very big bitsets (i've come up to load a 700Mo movie inside to treat it just as a primitive integer)
I'm facing one problem though, i need for my addition to resize my bitset to fit the exact number of bits needed after an addition, but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
I have the whole code that i can share with you to point out the problem if my question is not clear enough.
Thanks in advance.
jav974
but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
Nor will you: there's no way given "only the number of bits that both numbers are handling".
In the case of same-signed numbers, you may need one extra bit - you can start at the most significant bit of the smaller number, and scan for 0s that would absorb the impact of a carry. For example:
1010111011101 +
..10111010101
..^ start here
As both numbers have a 1 here you need to scan left until you hit a 0 (in which case the result has the same number of digits as the larger input), or until you reach the most significant bit of the larger number (in which case there's one more digit in the result).
1001111011101 +
..10111010101
..^ start here
In this case where the longer input has a 0 at the starting location, you first need to do a right-moving scan to establish whether there'll be a carry from the right of that starting position before launching into the left-moving scan above.
When signs differ:
if one value has 2 or more digits less than the other, then the number of digits required in the result will be either the same or one less than the digits in the larger input
otherwise, you'll have to do more of the work for an addition just to work out how many digits the result needs.
This is assuming the sign bit is separate from the count of magnitude bits.
Finally the number of representative bits after an addition is at maximum the number of bits of the one that owns the most + 1.
Here is an explanation, using an unsigned char:
For max unsigned char :
11111111 (255)
+ 11111111 (255)
= 111111110 (510)
Naturally if max + max = (bits of max + 1) then for x and y between 0 and max the result bits is at max + 1 (very maximum)
this works the same way with signed integers.
I'm trying to implement BigInt and have read some threads and articles regarding it, most of them suggested to use higher bases (256 or 232 or even 264).
Why higher bases are good for this purpose?
Other question I have is how am I supposed to convert a string into higher base (>16). I read there is no standard way, except for base64. And the last question, how do I use those higher bases. Some examples would be great.
The CPU cycles spent multiplying or adding a number that fits in a register tend to be identical. So you will get the least number of iterations, and best performance, by using up the whole register. That is, on a 32-bit architecture, make your base unit 32 bits, and on a 64-bit architecture, make it 64 bits. Otherwise--say, if you only fill up 8 bits of your 32 bit register--you are wasting cycles.
first answer stated this best. I personally use base 2^16 to keep from overflowing in multiplication. This allows any two digits to be multiplied together once without ever overflowing.
converting to a higher base requires a fast division method as well as packing the numbers as much as possible (assuming ur BigInt lib can handle multiple bases).
Consider base 10 -> base 2. The actual conversions would be 10 -> 10000 -> 32768 -> 2. This may seem slower, but converting from base 10 to 10000 is super fast. The amount of iterations for converting between 10000 and 32768 is very fast as there are very few digits to iterate over. Unpacking 32768 to 2 is also extremely fast.
So first pack the base to the largest base it can go to. To do this, just combine the digits together. This base should be <= 2^16 to prevent overflow.
Next, combine the digits together until they are >= the target base. From here, divide by the target base using a fast division algorithm that would normally overflow in any other scenario.
Some quick code
if (!packed) pack()
from = remake() //moves all of the digits on the current BigInt to a new one, O(1)
loop
addNode()
loop
remainder = 0
remainder = remainder*fromBase + from.digit
enter code here`exitwhen remainder >= toBase
set from = from.prev
exitwhen from.head
if (from.head) node.digit = remainder
else node.digit = from.fastdiv(fromBase, toBase, remainder)
exitwhen from.head
A look at fast division
loop
digit = remainder/divide
remainder = remainder%divide
//gather digits to divide again
loop
this = prev
if (head) return remainder
remainder = remainder*base + digit
exitwhen remainder >= divide
digit = 0
return remainder
Finally, unpack if you should unpack
Packing is just combining the digits together.
Example of base 10 to base 10000
4th*10 + 3rd
*10 + 2nd
*10 + 1st
???
You should have a Base class that stores alphabet + size for toString. If the Base is invalid, then just display the digits in a comma separated list.
All of your methods should be using the BigInt's current base, not some constant.
That's all?
From there, you'll be able to do things like
BigInt i = BigInt.convertString("1234567", Base["0123456789"])
Where the [] is overloaded and will either create a new base or return the already created base.
You'll also be able to do things like
i.toString()
i.base = Base["0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"]
i.base = 256
i.base = 45000
etc ^_^.
Also, if you are using integers and you want to be able to return BigInt remainders, division can get a bit tricky =P, but it's still pretty easy ^_^.
This is a BigInt library I coded in vjass (yes, for Warcraft 3, lol, don't judge me)
Things like TriggerEvaluate(evalBase) are just to keep the threads from crashing (evil operation limit).
Good luck :D