How can I safely downcast this?

How can I safely downcast this? - c++

Edit: I got this to work (see in answers below) in VS2012, but it still doesn't properly downcast in Xcode.
I am trying to downcast from an unsigned long to an int in C++, but data loss seems inevitable:
unsigned long bigInt= randomBigNumber;
int x = 2;
I then pass a big number to a function that accepts signed ints:
void myFunc(bigInt/(ULONG_MAX/x));
If randomBigNumber is a repeated random number -- such as in a for-loop -- I figure I should get a relatively evenly distributed number of ones and zeros, but I am only getting zeros.
How can I downcast this so as to get some ones?
Thanks.

Re
” I am only getting zeros
That’s because you’re dividing by a pretty big number. If the range of the random number generator is less than that big number, you can only get zeroes.
Re
” How can I downcast this so as to get some ones?
You can do
myFunc(bigInt % 2);
There is however an impact on randomness. At least of old the least significant bits of a pseudo-random number were likely to be very much less than perfectly random, due to imperfect generators. So except for efficiency you might be better off doing e.g.
myFunc((bigInt / 8) % 2);
if you are interested in nice randomness. It all depends on the generator of course. And there are other more elaborate techniques for improving randomness.

I figured one way:
unsigned long bigInt = randomBigNumber;
unsigned long x = ULONG_MAX / 2;
myFunc(bigInt / x);
That worked for me. I was able to generate zero and ones.
Note: I can only get this to work in VS2012, but not Xcode. Why?

Related

Is using an unsigned rather than signed int more likely to cause bugs? Why?

In the Google C++ Style Guide, on the topic of "Unsigned Integers", it is suggested that
Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler.
What is wrong with modular arithmetic? Isn't that the expected behaviour of an unsigned int?
What kind of bugs (a significant class) does the guide refer to? Overflowing bugs?
Do not use an unsigned type merely to assert that a variable is non-negative.
One reason that I can think of using signed int over unsigned int, is that if it does overflow (to negative), it is easier to detect.

Some of the answers here mention the surprising promotion rules between signed and unsigned values, but that seems more like a problem relating to mixing signed and unsigned values, and doesn't necessarily explain why signed variables would be preferred over unsigned outside of mixing scenarios.
In my experience, outside of mixed comparisons and promotion rules, there are two primary reasons why unsigned values are bug magnets as follows.
Unsigned values have a discontinuity at zero, the most common value in programming
Both unsigned and signed integers have a discontinuities at their minimum and maximum values, where they wrap around (unsigned) or cause undefined behavior (signed). For unsigned these points are at zero and UINT_MAX. For int they are at INT_MIN and INT_MAX. Typical values of INT_MIN and INT_MAX on system with 4-byte int values are -2^31 and 2^31-1, and on such a system UINT_MAX is typically 2^32-1.
The primary bug-inducing problem with unsigned that doesn't apply to int is that it has a discontinuity at zero. Zero, of course, is a very common value in programs, along with other small values like 1,2,3. It is common to add and subtract small values, especially 1, in various constructs, and if you subtract anything from an unsigned value and it happens to be zero, you just got a massive positive value and an almost certain bug.
Consider code iterates over all values in a vector by index except the last0.5:
for (size_t i = 0; i < v.size() - 1; i++) { // do something }
This works fine until one day you pass in an empty vector. Instead of doing zero iterations, you get v.size() - 1 == a giant number1 and you'll do 4 billion iterations and almost have a buffer overflow vulnerability.
You need to write it like this:
for (size_t i = 0; i + 1 < v.size(); i++) { // do something }
So it can be "fixed" in this case, but only by carefully thinking about the unsigned nature of size_t. Sometimes you can't apply the fix above because instead of a constant one you have some variable offset you want to apply, which may be positive or negative: so which "side" of the comparison you need to put it on depends on the signedness - now the code gets really messy.
There is a similar issue with code that tries to iterate down to and including zero. Something like while (index-- > 0) works fine, but the apparently equivalent while (--index >= 0) will never terminate for an unsigned value. Your compiler might warn you when the right hand side is literal zero, but certainly not if it is a value determined at runtime.
Counterpoint
Some might argue that signed values also have two discontinuities, so why pick on unsigned? The difference is that both discontinuities are very (maximally) far away from zero. I really consider this a separate problem of "overflow", both signed and unsigned values may overflow at very large values. In many cases overflow is impossible due to constraints on the possible range of the values, and overflow of many 64-bit values may be physically impossible). Even if possible, the chance of an overflow related bug is often minuscule compared to an "at zero" bug, and overflow occurs for unsigned values too. So unsigned combines the worst of both worlds: potentially overflow with very large magnitude values, and a discontinuity at zero. Signed only has the former.
Many will argue "you lose a bit" with unsigned. This is often true - but not always (if you need to represent differences between unsigned values you'll lose that bit anyways: so many 32-bit things are limited to 2 GiB anyways, or you'll have a weird grey area where say a file can be 4 GiB, but you can't use certain APIs on the second 2 GiB half).
Even in the cases where unsigned buys you a bit: it doesn't buy you much: if you had to support more than 2 billion "things", you'll probably soon have to support more than 4 billion.
Logically, unsigned values are a subset of signed values
Mathematically, unsigned values (non-negative integers) are a subset of signed integers (just called _integers).2. Yet signed values naturally pop out of operations solely on unsigned values, such as subtraction. We might say that unsigned values aren't closed under subtraction. The same isn't true of signed values.
Want to find the "delta" between two unsigned indexes into a file? Well you better do the subtraction in the right order, or else you'll get the wrong answer. Of course, you often need a runtime check to determine the right order! When dealing with unsigned values as numbers, you'll often find that (logically) signed values keep appearing anyways, so you might as well start of with signed.
Counterpoint
As mentioned in footnote (2) above, signed values in C++ aren't actually a subset of unsigned values of the same size, so unsigned values can represent the same number of results that signed values can.
True, but the range is less useful. Consider subtraction, and unsigned numbers with a range of 0 to 2N, and signed numbers with a range of -N to N. Arbitrary subtractions result in results in the range -2N to 2N in _both cases, and either type of integer can only represent half of it. Well it turns out that the region centered around zero of -N to N is usually way more useful (contains more actual results in real world code) than the range 0 to 2N. Consider any of typical distribution other than uniform (log, zipfian, normal, whatever) and consider subtracting randomly selected values from that distribution: way more values end up in [-N, N] than [0, 2N] (indeed, resulting distribution is always centered at zero).
64-bit closes the door on many of the reasons to use unsigned values as numbers
I think the arguments above were already compelling for 32-bit values, but the overflow cases, which affect both signed and unsigned at different thresholds, do occur for 32-bit values, since "2 billion" is a number that can exceeded by many abstract and physical quantities (billions of dollars, billions of nanoseconds, arrays with billions of elements). So if someone is convinced enough by the doubling of the positive range for unsigned values, they can make the case that overflow does matter and it slightly favors unsigned.
Outside of specialized domains 64-bit values largely remove this concern. Signed 64-bit values have an upper range of 9,223,372,036,854,775,807 - more than nine quintillion. That's a lot of nanoseconds (about 292 years worth), and a lot of money. It's also a larger array than any computer is likely to have RAM in a coherent address space for a long time. So maybe 9 quintillion is enough for everybody (for now)?
When to use unsigned values
Note that the style guide doesn't forbid or even necessarily discourage use of unsigned numbers. It concludes with:
Do not use an unsigned type merely to assert that a variable is non-negative.
Indeed, there are good uses for unsigned variables:
When you want to treat an N-bit quantity not as an integer, but simply a "bag of bits". For example, as a bitmask or bitmap, or N boolean values or whatever. This use often goes hand-in-hand with the fixed width types like uint32_t and uint64_t since you often want to know the exact size of the variable. A hint that a particular variable deserves this treatment is that you only operate on it with with the bitwise operators such as ~, |, &, ^, >> and so on, and not with the arithmetic operations such as +, -, *, / etc.
Unsigned is ideal here because the behavior of the bitwise operators is well-defined and standardized. Signed values have several problems, such as undefined and unspecified behavior when shifting, and an unspecified representation.
When you actually want modular arithmetic. Sometimes you actually want 2^N modular arithmetic. In these cases "overflow" is a feature, not a bug. Unsigned values give you what you want here since they are defined to use modular arithmetic. Signed values cannot be (easily, efficiently) used at all since they have an unspecified representation and overflow is undefined.
0.5 After I wrote this I realized this is nearly identical to Jarod's example, which I hadn't seen - and for good reason, it's a good example!
1 We're talking about size_t here so usually 2^32-1 on a 32-bit system or 2^64-1 on a 64-bit one.
2 In C++ this isn't exactly the case because unsigned values contain more values at the upper end than the corresponding signed type, but the basic problem exists that manipulating unsigned values can result in (logically) signed values, but there is no corresponding issue with signed values (since signed values already include unsigned values).

As stated, mixing unsigned and signed might lead to unexpected behaviour (even if well defined).
Suppose you want to iterate over all elements of vector except for the last five, you might wrongly write:
for (int i = 0; i < v.size() - 5; ++i) { foo(v[i]); } // Incorrect
// for (int i = 0; i + 5 < v.size(); ++i) { foo(v[i]); } // Correct
Suppose v.size() < 5, then, as v.size() is unsigned, s.size() - 5 would be a very large number, and so i < v.size() - 5 would be true for a more expected range of value of i. And UB then happens quickly (out of bound access once i >= v.size())
If v.size() would have return signed value, then s.size() - 5 would have been negative, and in above case, condition would be false immediately.
On the other side, index should be between [0; v.size()[ so unsigned makes sense.
Signed has also its own issue as UB with overflow or implementation-defined behaviour for right shift of a negative signed number, but less frequent source of bug for iteration.

One of the most hair-raising examples of an error is when you MIX signed and unsigned values:
#include <iostream>
int main() {
auto qualifier = -1 < 1u ? "makes" : "does not make";
std::cout << "The world " << qualifier << " sense" << std::endl;
}
The output:
The world does not make sense
Unless you have a trivial application, it's inevitable you'll end up with either dangerous mixes between signed and unsigned values (resulting in runtime errors) or if you crank up warnings and make them compile-time errors, you end up with a lot of static_casts in your code. That's why it's best to strictly use signed integers for types for math or logical comparison. Only use unsigned for bitmasks and types representing bits.
Modeling a type to be unsigned based on the expected domain of the values of your numbers is a Bad Idea. Most numbers are closer to 0 than they are to 2 billion, so with unsigned types, a lot of your values are closer to the edge of the valid range. To make things worse, the final value may be in a known positive range, but while evaluating expressions, intermediate values may underflow and if they are used in intermediate form may be VERY wrong values. Finally, even if your values are expected to always be positive, that doesn't mean that they won't interact with other variables that can be negative, and so you end up with a forced situation of mixing signed and unsigned types, which is the worst place to be.

Why is using an unsigned int more likely to cause bugs than using a signed int?
Using an unsigned type is not more likely to cause bugs than using a signed type with certain classes of tasks.
Use the right tool for the job.
What is wrong with modular arithmetic? Isn't that the expected behaviour of an unsigned int?
Why is using an unsigned int more likely to cause bugs than using a signed int?
If the task if well-matched: nothing wrong. No, not more likely.
Security, encryption, and authentication algorithm count on unsigned modular math.
Compression/decompression algorithms too as well as various graphic formats benefit and are less buggy with unsigned math.
Any time bit-wise operators and shifts are used, the unsigned operations do not get messed up with the sign-extension issues of signed math.
Signed integer math has an intuitive look and feel readily understood by all including learners to coding. C/C++ was not targeted originally nor now should be an intro-language. For rapid coding that employs safety nets concerning overflow, other languages are better suited. For lean fast code, C assumes that coders knows what they are doing (they are experienced).
A pitfall of signed math today is the ubiquitous 32-bit int that with so many problems is well wide enough for the common tasks without range checking. This leads to complacency that overflow is not coded against. Instead, for (int i=0; i < n; i++) int len = strlen(s); is viewed as OK because n is assumed < INT_MAX and strings will never be too long, rather than being full ranged protected in the first case or using size_t, unsigned or even long long in the 2nd.
C/C++ developed in an era that included 16-bit as well as 32-bit int and the extra bit an unsigned 16-bit size_t affords was significant. Attention was needed in regard to overflow issues be it int or unsigned.
With 32-bit (or wider) applications of Google on non-16 bit int/unsigned platforms, affords the lack of attention to +/- overflow of int given its ample range. This makes sense for such applications to encourage int over unsigned. Yet int math is not well protected.
The narrow 16-bit int/unsigned concerns apply today with select embedded applications.
Google's guidelines apply well for code they write today. It is not a definitive guideline for the larger wide scope range of C/C++ code.
One reason that I can think of using signed int over unsigned int, is that if it does overflow (to negative), it is easier to detect.
In C/C++, signed int math overflow is undefined behavior and so not certainly easier to detect than defined behavior of unsigned math.
As #Chris Uzdavinis well commented, mixing signed and unsigned is best avoided by all (especially beginners) and otherwise coded carefully when needed.

I have some experience with Google's style guide, AKA the Hitchhiker's Guide to Insane Directives from Bad Programmers Who Got into the Company a Long Long Time Ago. This particular guideline is just one example of the dozens of nutty rules in that book.
Errors only occur with unsigned types if you try to do arithmetic with them (see Chris Uzdavinis example above), in other words if you use them as numbers. Unsigned types are not intended to be used to store numeric quantities, they are intended to store counts such as the size of containers, which can never be negative, and they can and should be used for that purpose.
The idea of using arithmetical types (like signed integers) to store container sizes is idiotic. Would you use a double to store the size of a list, too? That there are people at Google storing container sizes using arithmetical types and requiring others to do the same thing says something about the company. One thing I notice about such dictates is that the dumber they are, the more they need to be strict do-it-or-you-are-fired rules because otherwise people with common sense would ignore the rule.

Using unsigned types to represent non-negative values...
is more likely to cause bugs involving type promotion, when using signed and unsigned values, as other answer demonstrate and discuss in depth, but
is less likely to cause bugs involving choice of types with domains capable of representing undersirable/disallowed values. In some places you'll assume the value is in the domain, and may get unexpected and potentially hazardous behavior when other value sneak in somehow.
The Google Coding Guidelines puts emphasis on the first kind of consideration. Other guideline sets, such as the C++ Core Guidelines, put more emphasis on the second point. For example, consider Core Guideline I.12:
I.12: Declare a pointer that must not be null as not_null
Reason
To help avoid dereferencing nullptr errors. To improve performance by
avoiding redundant checks for nullptr.
Example
int length(const char* p); // it is not clear whether length(nullptr) is valid
length(nullptr); // OK?
int length(not_null<const char*> p); // better: we can assume that p cannot be nullptr
int length(const char* p); // we must assume that p can be nullptr
By stating the intent in source, implementers and tools can provide
better diagnostics, such as finding some classes of errors through
static analysis, and perform optimizations, such as removing branches
and null tests.
Of course, you could argue for a non_negative wrapper for integers, which avoids both categories of errors, but that would have its own issues...

The google statement is about using unsigned as a size type for containers. In contrast, the question appears to be more general. Please keep that in mind, while you read on.
Since most answers so far reacted to the google statement, less so to the bigger question, I will start my answer about negative container sizes and subsequently try to convince anyone (hopeless, I know...) that unsigned is good.
Signed container sizes
Lets assume someone coded a bug, which results in a negative container index. The result is either undefined behavior or an exception / access violation. Is that really better than getting undefined behavior or an exception / access violation when the index type was unsigned? I think, no.
Now, there is a class of people who love to talk about mathematics and what is "natural" in this context. How can an integral type with negative number be natural to describe something, which is inherently >= 0? Using arrays with negative sizes much? IMHO, especially mathematically inclined people would find this mismatch of semantics (size/index type says negative is possible, while a negative sized array is hard to imagine) irritating.
So, the only question, remaining on this matter is if - as stated in the google comment - a compiler could actually actively assist in finding such bugs. And even better than the alternative, which would be underflow protected unsigned integers (x86-64 assembly and probably other architectures have means to achieve that, only C/C++ does not use those means). The only way I can fathom is if the compiler automatically added run time checks (if (index < 0) throwOrWhatever) or in case of compile time actions produce a lot of potentially false positive warnings/errors "The index for this array access could be negative." I have my doubts, this would be helpful.
Also, people who actually write runtime checks for their array/container indices, it is more work dealing with signed integers. Instead of writing if (index < container.size()) { ... } you now have to write: if (index >= 0 && index < container.size()) { ... }. Looks like forced labor to me and not like an improvement...
Languages without unsigned types suck...
Yes, this is a stab at java. Now, I come from embedded programming background and we worked a lot with field buses, where binary operations (and,or,xor,...) and bit wise composition of values is literally the bread and butter. For one of our products, we - or rather a customer - wanted a java port... and I sat opposite to the luckily very competent guy who did the port (I refused...). He tried to stay composed... and suffer in silence... but the pain was there, he could not stop cursing after a few days of constantly dealing with signed integral values, which SHOULD be unsigned... Even writing unit tests for those scenarios is painful and me, personally I think java would have been better off if they had omitted signed integers and just offered unsigned... at least then, you do not have to care about sign extensions etc... and you can still interpret numbers as 2s complement.
Those are my 5 cents on the matter.

Fast, unbiased, integer pseudo random generator with arbitrary bounds

For a monte carlo integration process, I need to pull a lot of random samples from
a histogram that has N buckets, and where N is arbitrary (i.e. not a power of two) but
doesn't change at all during the course of the computation.
By a lot, I mean something on the order of 10^10, 10 billions, so pretty much any
kind of lengthy precomputation is likely worth it in the face of the sheer number of
samples).
I have at my disposal a very fast uniform pseudo random number generator that
typically produces unsigned 64 bits integers (all the ints in the discussion
below are unsigned).
The naive way to pull a sample : histogram[ prng() % histogram.size() ]
The naive way is very slow: the modulo operation is using an integer division (IDIV)
which is terribly expensive and the compiler, not knowing the value of histogram.size()
at compile time, can't be up to its usual magic (i.e. http://www.azillionmonkeys.com/qed/adiv.html)
As a matter of fact, the bulk of my computation time is spent extracting that darn modulo.
The slightly less naive way: I use libdivide (http://libdivide.com/) which is capable
of pulling off a very fast "divide by a constant not known at compile time".
That gives me a very nice win (25% or so), but I have a nagging feeling that I can do
better, here's why:
First intuition: libdivide computes a division. What I need is a modulo, and to get there
I have to do an additional mult and a sub : mod = dividend - divisor*(uint64_t)(dividend/divisor). I suspect there might be a small win there, using libdivide-type
techniques that produce the modulo directly.
Second intuition: I am actually not interested in the modulo itself. What I truly want is
to efficiently produce a uniformly distributed integer value that is guaranteed to be strictly smaller than N.
The modulo is a fairly standard way of getting there, because of two of its properties:
A) mod(prng(), N) is guaranteed to be uniformly distributed if prng() is
B) mod(prgn(), N) is guaranteed to belong to [0,N[
But the modulo is/does much more that just satisfy the two constraints above, and in fact
it does probably too much work.
All need is a function, any function that obeys constraints A) and B) and is fast.
So, long intro, but here comes my two questions:
Is there something out there equivalent to libdivide that computes integer modulos directly ?
Is there some function F(X, N) of integers X and N which obeys the following two constraints:
If X is a random variable uniformly distributed then F(X,N) is also unirformly distributed
F(X, N) is guranteed to be in [0, N[
(PS : I know that if N is small, I do not need to cunsume all the 64 bits coming out of
the PRNG. As a matter of fact, I already do that. But like I said, even that optimization
is a minor win when compare to the big fat loss of having to compute a modulo).
Edit : prng() % N is indeed not exactly uniformly distributed. But for N large enough, I don't think it's much of problem (or is it ?)
Edit 2 : prng() % N is indeed potentially very badly distributed. I had never realized how bad it could get. Ouch. I found a good article on this : http://ericlippert.com/2013/12/16/how-much-bias-is-introduced-by-the-remainder-technique

Under the circumstances, the simplest approach may work the best. One extremely simple approach that might work out if your PRNG is fast enough would be to pre-compute one less than the next larger power of 2 than your N to use as a mask. I.e., given some number that looks like 0001xxxxxxxx in binary (where x means we don't care if it's a 1 or a 0) we want a mask like 000111111111.
From there, we generate numbers as follows:
Generate a number
and it with your mask
if result > n, go to 1
The exact effectiveness of this will depend on how close N is to a power of 2. Each successive power of 2 is (obviously enough) double its predecessor. So, in the best case N is exactly one less than a power of 2, and our test in step 3 always passes. We've added only a mask and a comparison to the time taken for the PRNG itself.
In the worst case, N is exactly equal to a power of 2. In this case, we expect to throw away roughly half the numbers we generated.
On average, N ends up roughly halfway between powers of 2. That means, on average, we throw away about one out of four inputs. We can nearly ignore the mask and comparison themselves, so our speed loss compared to the "raw" generator is basically equal to the number of its outputs that we discard, or 25% on average.

If you have fast access to the needed instruction, you could 64-bit multiply prng() by N and return the high 64 bits of the 128-bit result. This is sort of like multiplying a uniform real in [0, 1) by N and truncating, with bias on the order of the modulo version (i.e., practically negligible; a 32-bit version of this answer would have small but perhaps noticeable bias).
Another possibility to explore would be use word parallelism on a branchless modulo algorithm operating on single bits, to get random numbers in batches.

Libdivide, or any other complex ways to optimize that modulo are simply overkill. In a situation as yours, the only sensible approach is to
ensure that your table size is a power of two (add padding if you must!)
replace the modulo operation with a bitmask operation. Like this:
size_t tableSize = 1 << 16;
size_t tableMask = tableSize - 1;
...
histogram[prng() & tableMask]
A bitmask operation is a single cycle on any CPU that is worth its money, you can't beat its speed.
--
Note:
I don't know about the quality of your random number generator, but it may not be a good idea to use the last bits of the random number. Some RNGs produce poor randomness in the last bits and better randomness in the upper bits. If that is the case with your RNG, use a bitshift to get the most significant bits:
size_t bitCount = 16;
...
histogram[prng() >> (64 - bitCount)]
This is just as fast as the bitmask, but it uses different bits.

You could extend your histogram to a "large" power of two by cycling it, filling in the trailing spaces with some dummy value (guaranteed to never occur in the real data). E.g. given a histogram
[10, 5, 6]
extend it to length 16 like so (assuming -1 is an appropriate sentinel):
[10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, 10, 5, 6, -1]
Then sampling can be done via a binary mask histogram[prng() & mask] where mask = (1 << new_length) - 1, with a check for the sentinel value to retry, that is,
int value;
do {
value = histogram[prng() & mask];
} while (value == SENTINEL);
// use `value` here
The extension is longer than necessary to make retries unlikely by ensuring that the vast majority of the elements are valid (e.g. in the example above only 1/16 lookups will "fail", and this rate can be reduced further by extending it to e.g. 64). You could even use a "branch prediction" hint (e.g. __builtin_expect in GCC) on the check so that the compiler orders code to be optimal for the case when value != SENTINEL, which is hopefully the common case.
This is very much a memory vs. speed trade-off.

Just a few ideas to complement the other good answers:
What percent of time is spent in the modulo operation, and how do you know what that percent is? I only ask because sometimes people say something is terribly slow when in fact it is less than 10% of the time and they only think it's big because they're using a silly self-time-only profiler. (I have a hard time envisioning a modulo operation taking a lot of time compared to a random number generator.)
When does the number of buckets become known? If it doesn't change too frequently, you can write a program-generator. When the number of buckets changes, automatically print out a new program, compile, link, and use it for your massive execution.
That way, the compiler will know the number of buckets.
Have you considered using a quasi-random number generator, as opposed to a pseudo-random generator? It can give you higher precision of integration in much fewer samples.
Could the number of buckets be reduced without hurting the accuracy of the integration too much?

The non-uniformity dbaupp cautions about can be side-stepped by rejecting&redrawing values no less than M*(2^64/M) (before taking the modulus).
If M can be represented in no more than 32 bits, you can get more than one value less than M by repeated multiplication (see David Eisenstat's answer) or divmod; alternatively, you can use bit operations to single out bit patterns long enough for M, again rejecting values no less than M.
(I'd be surprised at modulus not being dwarfed in time/cycle/energy consumption by random number generation.)

To feed the bucket, you may use std::binomial_distribution to directly feed each bucket instead of feeding the bucket one sample by one sample:
Following may help:
int nrolls = 60; // number of experiments
const std::size_t N = 6;
unsigned int bucket[N] = {};
std::mt19937 generator(time(nullptr));
for (int i = 0; i != N; ++i) {
double proba = 1. / static_cast<double>(N - i);
std::binomial_distribution<int> distribution (nrolls, proba);
bucket[i] = distribution(generator);
nrolls -= bucket[i];
}
Live example

Instead of integer division you can use fixed point math, i.e integer multiplication & bitshift. Say if your prng() returns values in range 0-65535 and you want this quantized to range 0-99, then you do (prng()*100)>>16. Just make sure that the multiplication doesn't overflow your integer type, so you may have to shift the result of prng() right. Note that this mapping is better than modulo since it's retains the uniform distribution.

Thanks everyone for you suggestions.
First, I am now thoroughly convinced that modulo is really evil.
It is both very slow and yields incorrect results in most cases.
After implementing and testing quite a few of the suggestions, what
seems to be the best speed/quality compromise is the solution proposed
by #Gene:
pre-compute normalizer as:
auto normalizer = histogram.size() / (1.0+urng.max());
draw samples with:
return histogram[ (uint32_t)floor(urng() * normalizer);
It is the fastest of all methods I've tried so far, and as far as I can tell,
it yields a distribution that's much better, even if it may not be as perfect
as the rejection method.
Edit: I implemented David Eisenstat's method, which is more or less the same as Jarkkol's suggestion : index = (rng() * N) >> 32. It works as well as the floating point normalization and it is a little faster (9% faster in fact). So it is my preferred way now.

Integer division algorithm

I was thinking about an algorithm in division of large numbers: dividing with remainder bigint C by bigint D, where we know the representation of C in base b, and D is of form b^k-1. It's probably the easiest to show it on an example. Let's try dividing C=21979182173 by D=999.
We write the number as sets of three digits: 21 979 182 173
We take sums (modulo 999) of consecutive sets, starting from the left: 21 001 183 356
We add 1 to those sets preceding the ones where we "went over 999": 22 001 183 356
Indeed, 21979182173/999=22001183 and remainder 356.
I've calculated the complexity and, if I'm not mistaken, the algorithm should work in O(n), n being the number of digits of C in base b representation. I've also done a very crude and unoptimized version of the algorithm (only for b=10) in C++, tested it against GMP's general integer division algorithm and it really does seem to fare better than GMP. I couldn't find anything like this implemented anywhere I looked, so I had to resort to testing it against general division.
I found several articles which discuss what seem to be quite similar matters, but none of them concentrate on actual implementations, especially in bases different than 2. I suppose that's because of the way numbers are internally stored, although the mentioned algorithm seems useful for, say, b=10, even taking that into account. I also tried contacting some other people, but, again, to no avail.
Thus, my question would be: is there an article or a book or something where the aforementioned algorithm is described, possibly discussing the implementations? If not, would it make sense for me to try and implement and test such an algorithm in, say, C/C++ or is this algorithm somehow inherently bad?
Also, I'm not a programmer and while I'm reasonably OK at programming, I admittedly don't have much knowledge of computer "internals". Thus, pardon my ignorance - it's highly possible there are one or more very stupid things in this post. Sorry once again.
Thanks a lot!
Further clarification of points raised in the comments/answers:
Thanks, everyone - as I didn't want to comment on all the great answers and advice with the same thing, I'd just like to address one point a lot of you touched on.
I am fully aware that working in bases 2^n is, generally speaking, clearly the most efficient way of doing things. Pretty much all bigint libraries use 2^32 or whatever. However, what if (and, I emphasize, it would be useful only for this particular algorithm!) we implement bigints as an array of digits in base b? Of course, we require b here to be "reasonable": b=10, the most natural case, seems reasonable enough. I know it's more or less inefficient both considering memory and time, taking into account how numbers are internally stored, but I have been able to, if my (basic and possibly somehow flawed) tests are correct, produce results faster than GMP's general division, which would give sense to implementing such an algorithm.
Ninefingers notices I'd have to use in that case an expensive modulo operation. I hope not: I can see if old+new crossed, say, 999, just by looking at the number of digits of old+new+1. If it has 4 digits, we're done. Even more, since old<999 and new<=999, we know that if old+new+1 has 4 digits (it can't have more), then, (old+new)%999 equals deleting the leftmost digit of (old+new+1), which I presume we can do cheaply.
Of course, I'm not disputing obvious limitations of this algorithm nor I claim it can't be improved - it can only divide with a certain class of numbers and we have to a priori know the representation of dividend in base b. However, for b=10, for instance, the latter seems natural.
Now, say we have implemented bignums as I outlined above. Say C=(a_1a_2...a_n) in base b and D=b^k-1. The algorithm (which could be probably much more optimized) would go like this. I hope there aren't many typos.
if k>n, we're obviously done
add a zero (i.e. a_0=0) at the beginning of C (just in case we try to divide, say, 9999 with 99)
l=n%k (mod for "regular" integers - shouldn't be too expensive)
old=(a_0...a_l) (the first set of digits, possibly with less than k digits)
for (i=l+1; i < n; i=i+k) (We will have floor(n/k) or so iterations)
new=(a_i...a_(i+k-1))
new=new+old (this is bigint addition, thus O(k))
aux=new+1 (again, bigint addition - O(k) - which I'm not happy about)
if aux has more than k digits
delete first digit of aux
old=old+1 (bigint addition once again)
fill old with zeroes at the beginning so it has as much digits as it should
(a_(i-k)...a_(i-1))=old (if i=l+1, (a _ 0...a _ l)=old)
new=aux
fill new with zeroes at the beginning so it has as much digits as it should
(a_i...a_(i+k-1)=new
quot=(a_0...a_(n-k+1))
rem=new
There, thanks for discussing this with me - as I said, this does seem to me to be an interesting "special case" algorithm to try to implement, test and discuss, if nobody sees any fatal flaws in it. If it's something not widely discussed so far, even better. Please, let me know what you think. Sorry about the long post.
Also, just a few more personal comments:
#Ninefingers: I actually have some (very basic!) knowledge of how GMP works, what it does and of general bigint division algorithms, so I was able to understand much of your argument. I'm also aware GMP is highly optimized and in a way customizes itself for different platforms, so I'm certainly not trying to "beat it" in general - that seems as much fruitful as attacking a tank with a pointed stick. However, that's not the idea of this algorithm - it works in very special cases (which GMP does not appear to cover). On an unrelated note, are you sure general divisions are done in O(n)? The most I've seen done is M(n). (And that can, if I understand correctly, in practice (Schönhage–Strassen etc.) not reach O(n). Fürer's algorithm, which still doesn't reach O(n), is, if I'm correct, almost purely theoretical.)
#Avi Berger: This doesn't actually seem to be exactly the same as "casting out nines", although the idea is similar. However, the aforementioned algorithm should work all the time, if I'm not mistaken.

Your algorithm is a variation of a base 10 algorithm known as "casting out nines". Your example is using base 1000 and "casting out" 999's (one less than the base). This used to be taught in elementary school as way to do a quick check on hand calculations. I had a high school math teacher who was horrified to learn that it wasn't being taught anymore and filled us in on it.
Casting out 999's in base 1000 won't work as a general division algorithm. It will generate values that are congruent modulo 999 to the actual quotient and remainder - not the actual values. Your algorithm is a bit different and I haven't checked if it works, but it is based on effectively using base 1000 and the divisor being 1 less than the base. If you wanted to try it for dividing by 47, you would have to convert to a base 48 number system first.
Google "casting out nines" for more information.
Edit: I originally read your post a bit too quickly, and you do know of this as a working algorithm. As #Ninefingers and #Karl Bielefeldt have stated more clearly than me in their comments, what you aren't including in your performance estimate is the conversion into a base appropriate for the particular divisor at hand.

I feel the need to add to this based on my comment. This isn't an answer, but an explanation as to the background.
A bignum library uses what are called limbs - search for mp_limb_t in the gmp source, which are usually a fixed-size integer field.
When you do something like addition, one way (albeit inefficient) to approach it is to do this:
doublelimb r = limb_a + limb_b + carryfrompreviousiteration
This double-sized limb catches the overflow of limb_a + limb_b in the case that the sum is bigger than the limb size. So if the total is bigger than 2^32 if we're using uint32_t as our limb size, the overflow can be caught.
Why do we need this? Well, what you typically do is loop through all the limbs - you've done this yourself in dividing your integer up and going through each one - but we do it LSL first (so the smallest limb first) just as you'd do arithmetic by hand.
This might seem inefficient, but this is just the C way of doing things. To really break out the big guns, x86 has adc as an instruction - add with carry. What this does is an arithmetic and on your fields and sets the carry bit if the arithmetic overflows the size of the register. The next time you do add or adc, the processor factors in the carry bit too. In subtraction it's called the borrow flag.
This also applies to shift operations. As such, this feature of the processor is crucial to what makes bignums fast. So the fact is, there's electronic circuitry in the chip for doing this stuff - doing it in software is always going to be slower.
Without going into too much detail, operations are built up from this ability to add, shift, subtract etc. They're crucial. Oh and you use the full width of your processor's register per limb if you're doing it right.
Second point - conversion between bases. You cannot take a value in the middle of a number and change it's base, because you can't account for the overflow from the digit beneath it in your original base, and that number can't account for the overflow from the digit beneath... and so on. In short, every time you want to change base, you need to convert the entire bignum from the original base to your new base back again. So you have to walk the bignum (all the limbs) three times at least. Or, alternatively, detect overflows expensively in all other operations... remember, now you need to do modulo operations to work out if you overflowed, whereas before the processor was doing it for us.
I should also like to add that whilst what you've got is probably quick for this case, bear in mind that as a bignum library gmp does a fair bit of work for you, like memory management. If you're using mpz_ you're using an abstraction above what I've described here, for starters. Finally, gmp uses hand optimised assembly with unrolled loops for just about every platform you've ever heard of, plus more. There's a very good reason it ships with Mathematica, Maple et al.
Now, just for reference, some reading material.
Modern Computer Arithmetic is a Knuth-like work for arbitrary precision libraries.
Donald Knuth, Seminumerical Algorithms (The Art of Computer Programming Volume II).
William Hart's blog on implementing algorithm's for bsdnt in which he discusses various division algorithms. If you're interested in bignum libraries, this is an excellent resource. I considered myself a good programmer until I started following this sort of stuff...
To sum it up for you: division assembly instructions suck, so people generally compute inverses and multiply instead, as you do when defining division in modular arithmetic. The various techniques that exist (see MCA) are mostly O(n).
Edit: Ok, not all of the techniques are O(n). Most of the techniques called div1 (dividing by something not bigger than a limb are O(n). When you go bigger you end up with O(n^2) complexity; this is hard to avoid.
Now, could you implement bigints as an array of digits? Well yes, of course you could. However, consider the idea just under addition
/* you wouldn't do this just before add, it's just to
show you the declaration.
*/
uint32_t* x = malloc(num_limbs*sizeof(uint32_t));
uint32_t* y = malloc(num_limbs*sizeof(uint32_t));
uint32_t* a = malloc(num_limbs*sizeof(uint32_t));
uint32_t m;
for ( i = 0; i < num_limbs; i++ )
{
m = 0;
uint64_t t = x[i] + y[i] + m;
/* now we need to work out if that overflowed at all */
if ( (t/somebase) >= 1 ) /* expensive division */
{
m = t % somebase; /* get the overflow */
}
}
/* frees somewhere */
That's a rough sketch of what you're looking at for addition via your scheme. So you have to run the conversion between bases. So you're going to need a conversion to your representation for the base, then back when you're done, because this form is just really slow everywhere else. We're not talking about the difference between O(n) and O(n^2) here, but we are talking about an expensive division instruction per limb or an expensive conversion every time you want to divide. See this.
Next up, how do you expand your division for general case division? By that, I mean when you want to divide those two numbers x and y from the above code. You can't, is the answer, without resorting to bignum-based facilities, which are expensive. See Knuth. Taking modulo a number greater than your size doesn't work.
Let me explain. Try 21979182173 mod 1099. Let's assume here for simplicity's sake that the biggest size field we can have is three digits. This is a contrived example, but the biggest field size I know if uses 128 bits using gcc extensions. Anyway, the point is, you:
21 979 182 173
Split your number into limbs. Then you take modulo and sum:
21 1000 1182 1355
It doesn't work. This is where Avi is correct, because this is a form of casting out nines, or an adaption thereof, but it doesn't work here because our fields have overflowed for a start - you're using the modulo to ensure each field stays within its limb/field size.
So what's the solution? Split your number up into a series of appropriately sized bignums? And start using bignum functions to calculate everything you need to? This is going to be much slower than any existing way of manipulating the fields directly.
Now perhaps you're only proposing this case for dividing by a limb, not a bignum, in which case it can work, but hensel division and precomputed inverses etc do to without the conversion requirement. I have no idea if this algorithm would be faster than say hensel division; it would be an interesting comparison; the problem comes with a common representation across the bignum library. The representation chosen in existing bignum libraries is for the reasons I've expanded on - it makes sense at the assembly level, where it was first done.
As a side note; you don't have to use uint32_t to represent your limbs. You use a size ideally the size of the registers of the system (say uint64_t) so that you can take advantage of assembly-optimised versions. So on a 64-bit system adc rax, rbx only sets the overflow (CF) if the result overspills 2^64 bits.
tl;dr version: the problem isn't your algorithm or idea; it's the problem of converting between bases, since the representation you need for your algorithm isn't the most efficient way to do it in add/sub/mul etc. To paraphrase knuth: This shows you the difference between mathematical elegance and computational efficiency.

If you need to frequently divide by the same divisor, using it (or a power of it) as your base makes division as cheap as bit-shifting is for base 2 binary integers.
You could use base 999 if you want; there's nothing special about using a power-of-10 base except that it makes conversion to decimal integer very cheap. (You can work one limb at a time instead of having to do a full division over the whole integer. It's like the difference between converting a binary integer to decimal vs. turning every 4 bits into a hex digit. Binary -> hex can start with the most significant bits, but converting to non-power-of-2 bases has to be LSB-first using division.)
For example, to compute the first 1000 decimal digits of Fibonacci(109) for a code-golf question with a performance requirement, my 105 bytes of x86 machine code answer used the same algorithm as this Python answer: the usual a+=b; b+=a Fibonacci iteration, but divide by (a power of) 10 every time a gets too large.
Fibonacci grows faster than carry propagates, so discarding the low decimal digits occasionally doesn't change the high digits long-term. (You keep a few extra beyond the precision you want).
Dividing by a power of 2 doesn't work, unless you keep track of how many powers of 2 you've discarded, because the eventual binary -> decimal conversion at the end would depend on that.
So for this algorithm, you have to do extended-precision addition, and division by 10 (or whatever power of 10 you want).
I stored base-109 limbs in 32-bit integer elements. Dividing by 109 is trivially cheap: just a pointer increment to skip the low limb. Instead of actually doing a memmove, I just offset the pointer used by the next add iteration.
I think division by a power of 10 other than 10^9 would be somewhat cheap, but would require an actual division on each limb, and propagating the remainder to the next limb.
Extended-precision addition is somewhat more expensive this way than with binary limbs, because I have to generate the carry-out manually with a compare: sum[i] = a[i] + b[i]; carry = sum < a; (unsigned comparison). And also manually wrap to 10^9 based on that compare, with a conditional-move instruction. But I was able to use that carry-out as an input to adc (x86 add-with-carry instruction).
You don't need a full modulo to handle the wrapping on addition, because you know you've wrapped at most once.
This wastes a just over 2 bits of each 32-bit limb: 10^9 instead of 2^32 = 4.29... * 10^9. Storing base-10 digits one per byte would be significantly less space efficient, and very much worse for performance, because an 8-bit binary addition costs the same as a 64-bit binary addition on a modern 64-bit CPU.
I was aiming for code-size: for pure performance I would have used 64-bit limbs holding base-10^19 "digits". (2^64 = 1.84... * 10^19, so this wastes less than 1 bit per 64.) This lets you get twice as much work done with each hardware add instruction. Hmm, actually this might be a problem: the sum of two limbs might wrap the 64-bit integer, so just checking for > 10^19 isn't sufficient anymore. You could work in base 5*10^18, or in base 10^18, or do more complicated carry-out detection that checks for binary carry as well as manual carry.
Storing packed BCD with one digit per 4 bit nibble would be even worse for performance, because there isn't hardware support for blocking carry from one nibble to the next within a byte.
Overall, my version ran about 10x faster than the Python extended-precision version on the same hardware (but it had room for significant optimization for speed, by dividing less often). (70 seconds or 80 seconds vs. 12 minutes)
Still, I think for this particular implementation of that algorithm (where I only needed addition and division, and division happened after every few additions), the choice of base-10^9 limbs was very good. There are much more efficient algorithms for the Nth Fibonacci number that don't need to do 1 billion extended-precision additions.

Just how random is std::random_shuffle?

I'd like to generate a random number of reasonably arbitrary length in C++. By "reasonably arbitary" I mean limited by speed and memory of the host computer.
Let's assume:
I want to sample a decimal number (base 10) of length ceil(log10(MY_CUSTOM_RAND_MAX)) from 0 to 10^(ceil(log10(MY_CUSTOM_RAND_MAX))+1)-1
I have a vector<char>
The length of vector<char> is ceil(log10(MY_CUSTOM_RAND_MAX))
Each char is really an integer, a random number between 0 and 9, picked with rand() or similar methods
If I use std::random_shuffle to shuffle the vector, I could iterate through each element from the end, multiplying by incremented powers of ten to convert it to unsigned long long or whatever that gets mapped to my final range.
I don't know if there are problems with std::random_shuffle in terms of how random it is or isn't, particularly when also picking a sequence of rand() results to populate the vector<char>.
How sketchy is std::random_shuffle for generating a random number of arbitrary length in this manner, in a quantifiable sense?
(I realize that there is a library in Boost for making random int numbers. It's not clear what the range limitations are, but it looks like MAX_INT. That said, I realize that said library exists. This is more of a general question about this part of the STL in the generation of an arbitrarily large random number. Thanks in advance for focusing your answers on this part.)

I'm slightly unclear as to the focus of this question, but I'll try to answer it from a few different angles:
The quality of the standard library rand() function is typically poor. However, it is very easy to find replacement random number generators which are of a higher quality (you mentioned Boost.Random yourself, so clearly you're aware of other RNGs). It is also possible to boost (no pun intended) the quality of rand() output by combining the results of multiple calls, as long as you're careful about it: http://www.azillionmonkeys.com/qed/random.html
If you don't want the decimal representation in the end, there's little to no point in generating it and then converting to binary. You can just as easily stick multiple 32-bit random numbers (from rand() or elsewhere) together to make an arbitrary bit-width random number.
If you're generating the individual digits (binary or decimal) randomly, there is little to no point in shuffling them afterwards.

handling large number

This is Problem 3 from Project Euler site
I'm not out after the solution, but I probably guess you will know what my approach is. To my question now, how do I handle numbers exceeding unsigned int?
Is there a mathematical approach for this, if so where can I read about it?

Have you tried unsigned long long or even more better/specifically uint64_t?
If you want to work with numbers bigger than the range of uint64_t [264-1] [64 bit integer, unsigned], then you should look into bignum: http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic.
600,851,475,143 is the number given by the question and 264-1 is equal to 18,446,744,073,709,551,615. It is definitely big enough.

Having recently taught a kid I know prime factorization, the algorithm is trivial provided you have a list of primes.
Starting with 2, divide that into the target as many times as it can and leave zero remainder.
Take the next prime (3) and divide that into the target as in step one
Write down each factor you found and repeat until you run out of remainder.
Added, per request, algorithmic pseudo-code:
def factor(n):
"""returns a list of the prime factors of n"""
factors = []
p = primes.generator()
while n > 1:
x = p.next()
while n % x == 0:
n = n / x
factors.append(x)
return factors
Where successive calls to p.next() yields the next value in the series of primes {2, 3, 5, 7, 11, ...}
Any resemblance of that pseudo-code to actual working Python code is purely coincidental. I probably shouldn't mention that the definition of primes.generator() is one line shorter (but one line is 50 characters long). I originally wrote this "code" because the GNU factor program wouldn't accept inputs where log2(n) >= 40.

use
long long
in GCC
and
__int64
in VC

Use
long long
This is supported in both GCC and newer versions of Visual Studio (2008 and later, I believe).

Perhaps the easiest way to handle your problem is to use Python. Python version > 2.5 supports built-in long precision arithmetic operation. The precision is only depended on your computer memory. You can find more information about it from this link.

long long will do it for that problem. For other Project Euler problems that exceed long long, I'd probably use libgmp (specifically its C++ wrapper classes).

In Windows, if your compiler doesn't support 64 bit integers, you can use LARGE_INTEGER and ULARGE_INTEGER.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js