I have a class BigNum:
struct BigNum{
vector <int> digits;
BigNum(vector <int> data){
for(int item : data){d.push_back(item);}
}
int get_digit(size_t index){
return (index >= d.size() ? 0 : d[index]);
}
};
and I'm trying to write code to multiply two BigNums. Currently, I've been using the traditional method of multiplication, which is multiplying the first number by each digit of the other and adding it to a running total. Here's my code:
BigNum add(BigNum a, BigNum b){ // traditional adding: goes digit by digit and keeps a "carry" variable
vector <int> ret;
int carry = 0;
for(size_t i = 0; i < max(a.digits.size(), b.digits.size()); ++i){
int curr = a.get_digit(i) + b.get_digit(i) + carry;
ret.push_back(curr%10);
carry = curr/10;
}
// leftover from carrying values
while(carry != 0){
ret.push_back(carry%10);
carry /= 10;
}
return BigNum(ret);
}
BigNum mult(BigNum a, BigNum b){
BigNum ret({0});
for(size_t i = 0; i < a.d.size(); ++i){
vector <int> row(i, 0); // account for the zeroes at the end of each row
int carry = 0;
for(size_t j = 0; j < b.d.size(); ++j){
int curr = a.d[i] * b.d[j] + carry;
row.push_back(curr%10);
carry = curr/10;
}
while(carry != 0){ // leftover from carrying
row.push_back(carry%10);
carry /= 10;
}
ret = add(ret, BigNum(row)); // add the current row to our running sum
}
return ret;
}
This code still works pretty slowly; it takes around a minute to calculate the factorial of 1000. Is there a better way to multiply two BigNums? If not, is there a better way to represent large numbers that will speed up this code?
If you use a different base, say 2^16 instead of 10, the multiplication will be much faster.
But getting to print in decimal will be longer.
Get a ready made bignum library. Those tend to be optimized to death, all the way down to specific CPU models, with assembly where necessary.
GMP and MPIR are two popular ones. The latter is more Windows friendly.
One way is to use a larger base than ten. It's a huge waste, in both time and space, to take an int, able to hold values up to about four billion (unsigned variant) and use it to store single digits.
What you can do is use unsigned int/long values for a start, then choose a base such that the square of that base will fit into the value. So, for example, the square root of the largest 32-bit unsigned int is a touch over 65,000 so you choose 10,000 as the base.
So a "bigdigit" (I'll use that term for a digit in the base-10,000 scheme, is effectively equal to four decimal digits (just digits from here on), and this has several effects:
much less space taken up (about 1/1,000th of the space);
still no chance of overflow when you multiply four-digit groups.
faster multiplications, doing four digits at a time rather than one; and
still easy printing since it's in a base-ten-to-the-power-of-something format.
Those last two points warrant some explanation.
On the second last one, it should be something like sixteen times faster since, to multiply 1234 and 5678, each digit in the first has to be multiplied with every digit in the second. For a normal digit, that's sixteen multiplications, while it's only one for a bigdigit.
Since the bigdigits are exactly four digits, the output is still relatively easy, something like:
printf("%d", node[0]);
for (int i = 1; i < node_count; ++i) {
printf("%04d", node[0]);
}
Beyond that, and the normal C++ optimisations like passing const references rather than copying all objects, you can examine the same tricks used by MPIR and GMP. I tend to avoid them myself since they have (or did have at some point) a rather nasty habit of just violently exiting programs when they ran out of memory, something I find inexcusable in a general purpose library. In any case, I have routines built up over time that do, while nowhere near as much as GMP, certainly more than I need (and that use the same algorithms in many cases).
One of the tricks for multiplication is the Karatsuba algorithm (to be honest, I'm not sure if GMP/MPIR use this but, unless they've got something much better, I suspect they would).
It basically involves splitting the numbers into parts so that a = a1a0 is the first, and b = b1b0. In other words:
a = a1 x Bp + a0
b = b1 x Bp + b0
The Bp is just some integral power of the actual base you're using, and can generally be the closest value to the square root of the larger number (about half as many digits).
You then work out:
c2 = a1 x b1
c0 = a0 x b0
c1 = (a1 + a0) x (b1 + b0) - c2 - c0
That last point is tricky but it has been proven mathematically. I suggest if you want to go into that level of depth, I'm not the best person for the job. At some point, even I, the consumate "don't believe anything you can't prove yourself" type, have take the expert opinions as fact :-)
Then you work some add/shift magic (multiplication looks to be involved but, since it's multiplication by a power of the base, it's really just a matter of shifting values left).
c = c2 x B2p + c1 x Bp + c0
Now you may be wondering why three multiplications is a better approach than one, but you need to take into account that these multiplications are using far fewer digits than the original. If you remember back to the comment I made above about doing one multiplication rather than sixteen when switching from base-10 to base-10,000, you'll realise the number of digit multiplications is proportional to the square of the numbers of digits.
That means it can be better to perform three smaller multiplications even with some extra shifting and adding. And the beauty of this solution is that you can recursively apply it to the smaller numbers until you get down to the point where you're just multiplying two unsigned int values.
I probably haven't done the concept justice, and you do need to watch for and adjust the case where c1 becomes negative but, if you want raw speed, this is the sort of thing you'll have to look into.
And, as my more advanced math buddies will tell me (quite often), if you're not willing to have your entire head explode, you probably shouldn't be doing math :-)
Related
I am attempting to vectorize this fairly expensive function (Scaler Now working!):
template<typename N, typename POW>
inline constexpr bool isPower(const N n, const POW p) noexcept
{
double x = std::log(static_cast<double>(n)) / std::log(static_cast<double>(p));
return (x - std::trunc(x)) < 0.000001;
}//End of isPower
Here's what I have so far (for 32-bit int only):
template<typename RETURN_T>
inline RETURN_T count_powers_of(const std::vector<int32_t>& arr, const int32_t power)
{
RETURN_T cnt = 0;
const __m256 _MAGIC = _mm256_set1_ps(0.000001f);
const __m256 _POWER_D = _mm256_set1_ps(static_cast<float>(para));
const __m256 LOG_OF_POWER = _mm256_log_ps(_POWER_D);
__m256i _count = _mm256_setzero_si256();
__m256i _N_INT = _mm256_setzero_si256();
__m256 _N_DBL = _mm256_setzero_ps();
__m256 LOG_OF_N = _mm256_setzero_ps();
__m256 DIVIDE_LOG = _mm256_setzero_ps();
__m256 TRUNCATED = _mm256_setzero_ps();
__m256 CMP_MASK = _mm256_setzero_ps();
for (size_t i = 0uz; (i + 8uz) < end; i += 8uz)
{
//Set Values
_N_INT = _mm256_load_si256((__m256i*) &arr[i]);
_N_DBL = _mm256_cvtepi32_ps(_N_INT);
LOG_OF_N = _mm256_log_ps(_N_DBL);
DIVIDE_LOG = _mm256_div_ps(LOG_OF_N, LOG_OF_POWER);
TRUNCATED = _mm256_sub_ps(DIVIDE_LOG, _mm256_trunc_ps(DIVIDE_LOG));
CMP_MASK = _mm256_cmp_ps(TRUNCATED, _MAGIC, _CMP_LT_OQ);
_count = _mm256_sub_epi32(_count, _mm256_castps_si256(CMP_MASK));
}//End for
cnt = static_cast<RETURN_T>(util::_mm256_sum_epi32(_count));
}//End of count_powers_of
The scaler version runs in about 14.1 seconds.
The scaler version called from std::count_if with par_unseq runs in 4.5 seconds.
The vectorized version runs in just 155 milliseconds but produces the wrong result. Albeit vastly closer now.
Testing:
int64_t count = 0;
for (size_t i = 0; i < vec.size(); ++i)
{
if (isPower(vec[i], 4))
{
++count;
}//End if
}//End for
std::cout << "Counted " << count << " powers of 4.\n";//produces 4,996,215 powers of 4 in a vector of 1 billion 32-bit ints consisting of a uniform distribution of 0 to 1000
std::cout << "Counted " << count_powers_of<int32_t>(vec, 4) << " powers of 4.\n";//produces 4,996,865 powers of 4 on the same array
This new vastly simplified code often produces results that are either slightly off the correct number of powers found (usually higher). I think the problem is my reinterpret cast from __m256 to _m256i but when I try use a conversation (with floor) instead I get a number that's way off (in the billions again).
It could also be this sum function (based off of code by #PeterCordes ):
inline uint32_t _mm_sum_epi32(__m128i& x)
{
__m128i hi64 = _mm_unpackhi_epi64(x, x);
__m128i sum64 = _mm_add_epi32(hi64, x);
__m128i hi32 = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));
__m128i sum32 = _mm_add_epi32(sum64, hi32);
return _mm_cvtsi128_si32(sum32);
}
inline uint32_t _mm256_sum_epi32(__m256i& v)
{
__m128i sum128 = _mm_add_epi32(
_mm256_castsi256_si128(v),
_mm256_extracti128_si256(v, 1));
return _mm_sum_epi32(sum128);
}
I know this has got to be a floating-point precision/comparison issue; Is there a better way to approach this?
Thanks for all your insights and suggestions thus far.
A more sensible unit-test would be to non-random: Check all powers in a loop to make sure they're all true, like x *= base;, and count how many powers there are <= n. Then check all numbers from 0..n in a loop, once each to verify the right total. If both those checks succeed, that means it returned false in all the cases it should have, otherwise the count would be wrong.
Re: the original version:
This seems to depend on there being no floating-point rounding error. You do d == (N)d which (if N is an integral type) checks that the ratio of two logs is an exact integer; even 1 bit in the mantissa will make it unequal. Hardly surprising that a different log implementation would give different results, if one has different rounding error.
Except your scalar code at least is even more broken because it takes d = floor(log ratio) so it's already always an exact integer.
I just tried your scalar version for a testcase like return isPower(5, 4) to ask if 5 is a power of 4. It returns true: https://godbolt.org/z/aMT94ro6o . So yeah, your code is super broken, and is in fact only checking that n>0 or something. That would explain why 999 of 1000 of your "random" inputs from 0..999 were counted as powers of 4, which is obviously super broken.
I think it's impossible to achieve correctness with your FP log ratio idea: FP rounding error means you can't expect exact equality, but allowing a range would probably let in non-exact powers.
You might want to special-case integral N, power-of-2 pow. That can go vastly vaster by checking that n has a single bit set (n & (n-1) == 0) and that it's at a valid position. (e.g. for pow=4, n & 0b...10101010 != 0). You can construct the constant by multiplying and adding until overflow or something. Or 32/pow times? Anyway, one psubd/pand/pcmpeqd, pand/pcmpeqd, and pand/psubd per 8 elements, with maybe some room to optimize that further.
Otherwise, in the general case, you can brute-force check 32-bit integers one at a time against the 32 or fewer possible powers that fit in an int32_t. e.g. broadcast-load, 4x vpcmpeqd / vpsubd into multiple accumulators. (The smallest possible base, 2, can have exponents up to 2^31` and still fit in an unsigned int). log_3(2^31) is 19, so you'd only need three AVX2 vectors of powers. Or log_4(2^31) is 15.5 so you'd only need 2 vectors to hold every non-overflowing power.
That only handles 1 input element per vector instead of 4 doubles, but it's probably faster than your current FP attempt, as well as fixing the correctness problems. I could see that running more than 4x the throughput per iteration of what you're doing now, or even 8x, so it should be good for speed. And of course has the advantage that correctness is possible!!
Speed gets even better for bases of 4 or greater, only 2x compare/sub per input element, or 1x for bases of 16 or greater. (<= 8 elements to compare against can fit in one vector).
Implementation mistakes in the attempt to vectorize this probably-unfixable algorithm:
_mm256_rem_epi32 is slow library function, but you're using it with a constant divisor of 2! Integer mod 2 is just n & 1 for non-negative. Or if you need to handle negative remainders, you can use the tricks compilers use to implement int % 2: https://godbolt.org/z/b89eWqEzK where it shifts down the sign bit as a correction to do signed division.
Updated version using (x - std::trunc(x)) < 0.000001;
This might work, especially if you limit it to small n. I'd worry that with large n, the difference between an exact power and off-by-1 would be a small ratio. (I haven't really looked at the details, though.)
Your vectorization with __m256 vectors of single-precision float is doomed for large n, but could be ok for small n: float32 can't represent every int32_t, so large odd integers (above 2^24) get rounded to multiples of 2, or multiples of 4 above 2^25, etc.
float has less relative precision in general, so it might not have enough to spare for this algorithm. Or maybe there's something that could be fixed, IDK, I haven't looked closely since the update.
I'd still recommend trying a simple compare-for-equality against all possible powers in the range, broadcast-loading each element. That will definitely work exactly, and if it's as fast then there's no need to try to fix this version using FP logs.
__m256 _N_DBL = _mm256_setzero_ps(); is a confusing name; it's a vector of float, not double. (And it's not part of a standard library header so it shouldn't be using a leading underscore.)
Also, there's zero point initializing it with zero there, since it gets written unconditionally inside the loop. In fact it's only ever used inside the loop, so it could just be declared at that scope, when you're ready to give it a value. Only declare variables in outer scopes if you need them after a loop.
Consider the problem:
It can be shown that for some powers of two in decimal format like:
2^9 = 512
2^89 = 618,970,019,642,690,137,449,562,112
The results end in a string consisting of 1s and 2s. In fact, it can be proven that for every integer R, there
exists a power of 2 such that 2K where K > 0 has a string of only 1s and 2s in its last R digits.
It can be shown clearly in the table below:
R Smallest K 2^K
1 1 2
2 9 512
3 89 ...112
4 89 ...2112
Using this technique, what then is the sum of all the smallest K values for 1 <= R <= 10?
Proposed sol:
Now this problem ain't that difficult to solve. You can simply do
int temp = power(2, int)
and then if you can get the length of the temp then multiply it with
(100^len)-i or (10^len)-i
// where i would determine how many last digits you want.
Now this temp = power(2,int) gets much higher with increasing int that you can't even store it in the int type or even in long int....
So what would be done. And is there any other solution based on bit strings. I guess that might make this problem easy.
Thanks in advance.
No, I doubt there are any solutions based on "strings of bits". That would be quite inefficient. But there are Bignum Libraries like GMP which feature variable types either fixed-size much bigger than int types, or of arbitrary size limited only by memory capacity, plus matching sets of math operations, working similarly to software FPU emulation.
Quoting after reference with a minor paraphrase.
#include <gmpxx.h>
int
main (void)
{
mpz_class a, b, c;
a = 1234;
b = "-5676739826856836954375492356569366529629568926519085610160816539856926459237598";
c = a+b;
cout << "sum is " << c << "\n";
cout << "absolute value is " << abs(c) << "\n";
return 0;
}
Thanks to C++ operator overloading, it is much easier to use than ANSI C version.
Since you are only interested in the the n least significant digits of your result, you could try to devise an algorithm that only calculates those. Based on the standard algorithm for written multiplication you can see that the n least significant digits of the product are entirely determined by the n least significant digits of the multiplicands. Based on this it should be possible to create an algorithm that calculates as many digits of R^K as fit into a long int.
The only problem you might run into is that there may be numbers that end in a matching sequence that is longer that a long int can hold. In that case you can still resort to calculating additional digits using your own algorithm or a library.
Note that this is basically the same thing that big-number libraries do, only your approach might be more efficient, because you are calculating less digits that you are unlikely to need.
Try GMP, http://gmplib.org/
It can store a number with any size if it fits in the memory.
Altough you might be better off with less brute force approach.
You can store binary strings in std::bitset or in std::vector
www.cplusplus.com/reference/bitset/bitset/
I think bitset is your choice.
Using big arithmetic for operations on powers of 2 is not though
I'm doing a BigInt implementation in C++ and I'm having a hard time figuring out how to create a converter from (and to) string (C string would suffice for now).
I implement the number as an array of unsigned int (so basically putting blocks of bits next to each other). I just can't figure out how to convert a string to this representation.
For example if usigned int would be 32b and i'd get a string of "4294967296", or "5000000000" or basically anything larger than what a 32b int can hold, how would I properly convert it to appropriate binary representation?
I know I'm missing something obvious, and I'm only asking for a push to the right direction. Thanks for help and sorry for asking such a silly question!
Well one way (not necessarily the most efficient) is to implement the usual arithmetic operators and then just do the following:
// (pseudo-code)
// String to BigInt
String s = ...;
BigInt x = 0;
while (!s.empty())
{
x *= 10;
x += s[0] - '0';
s.pop_front();
}
Output(x);
// (pseudo-code)
// BigInt to String
BigInt x = ...;
String s;
while (x > 0)
{
s += '0' + x % 10;
x /= 10;
}
Reverse(s);
Output(s);
If you wanted to do something trickier than you could try the following:
If input I is < 100 use above method.
Estimate D number of digits of I by bit length * 3 / 10.
Mod and Divide by factor F = 10 ^ (D/2), to get I = X*F + Y;
Execute recursively with I=X and I=Y
Implement and test the string-to-number algorithm using a builtin type such as int.
Implement a bignum class with operator+, operator*, and whatever else the above algorithm uses.
Now the algorithm should work unchanged with the bignum class.
Use the string conversion algo to debug the class, not the other way around.
Also, I'd encourage you to try and write at a high level, and not fall back on C constructs. C may be simpler, but usually does not make things easier.
Take a look at, for instance, mp_toradix and mp_read_radix in Michael Bromberger's MPI.
Note that repeated division by 10 (used in the above) performs very poorly, which shows up when you have very big integers. It's not the "be all and end all", but it's more than good enough for homework.
A divide and conquer approach is possible. Here is the gist. For instance, given the number 123456789, we can break it into pieces: 1234 56789, by dividing it by a power of 10. (You can think of these pieces of two large digits in base 100,000. Now performing the repeated division by 10 is now cheaper on the two pieces! Dividing 1234 by 10 three times and 56879 by 10 four times is cheaper than dividing 123456789 by 10 eight times.
Of course, a really large number can be recursively broken into more than two pieces.
Bruno Haibl's CLN (used in CLISP) does something like that and it is blazingly fast compared to MPI, in converting numbers with thousands of digits to numeric text.
I know there is a way of finding the sum of digits of 100!(or any other big number's factorial) using Python. But I find it really tough when it comes to C++ as the the size of even LONG LONG is not enough.
I just want to know if there is some other way.
I get it that it is not possible as our processor is generally 32 bits. What I am referring is some other kind of tricky technique or algorithm which can accomplish the same using the same resources.
Use a digit array with the standard, on-paper method of multiplication. For example, in C :
#include <stdio.h>
#define DIGIT_COUNT 256
void multiply(int* digits, int factor) {
int carry = 0;
for (int i = 0; i < DIGIT_COUNT; i++) {
int digit = digits[i];
digit *= factor;
digit += carry;
digits[i] = digit % 10;
carry = digit / 10;
}
}
int main(int argc, char** argv) {
int n = 100;
int digits[DIGIT_COUNT];
digits[0] = 1;
for (int i = 1; i < DIGIT_COUNT; i++) { digits[i] = 0; }
for (int i = 2; i < n; i++) { multiply(digits, i); }
int digitSum = 0;
for (int i = 0; i < DIGIT_COUNT; i++) { digitSum += digits[i]; }
printf("Sum of digits in %d! is %d.\n", n, digitSum);
return 0;
}
How are you going to find the sum of digits of 100!. If you calculate 100! first, and then find the sum, then what is the point. You will have to use some intelligent logic to find it without actually calculating 100!. Remove all the factors of five because they are only going to add zeros. Think in this direction rather than thinking about the big number. Also I am sure the final answer i.e. the sum of the digits will be within LONG LONG.
There are C++ big int libraries, but I think the emphasis here is on algorithm rather than library.
long long is not a part of C++. g++ provides it as an extension.
Arbitrary Precision Arithmetic is something that you are looking for. Check out the pseudocode given in the wiki page.
Furthermore long long cannot store such large values. So you can either create your BigInteger Class or you can use some 3rd party libraries like GMP or C++ BigInteger.
If you're referring to the Project Euler problem, my reading of that is that it wants you to write your own arbitrary-precision integer library or class that can multiply numbers.
My suggestion is to store the base-10 digits of a number, in reverse order to the way you'd normally write them, because you'll need to convert the number to base 10 in the end, anyway. Storing the digits in reverse order makes writing the addition and multiplication routines slightly easier, in my opinion. Then write addition and multiplication routines that emulate how you would add or multiply numbers manually.
Observe that multiplying any number by 10 or 100 does not change the sum of the digits.
Once you recognize that, see that multiplying by 2 and 5, or by 20 and 50, also does not change the sum, since 2x5 = 10 and 20x50 = 1000.
Then notice that anytime your current computation ends in a 0, you can simply divide by 10, and keep calculating your factorial.
Make a few more observations about shortcuts to eliminate numbers from 1 to 100, and I think you might be able to fit the answer into standard ints.
There are a number of BigInteger libraries available in C++. Just Google "C++ BigInteger". But if this is a programming contest problem then you should better try to implement your own BigInteger library.
Nothing in project Euler requires more than __int64.
I would suggest trying to do it using base 10000.
You could take the easy road and use perl/python/lisp/scheme/erlang/etc to calculate 100! using one of their built-in bignum libraries or the fact some languages use exact integer arithmetic. Then take that number, store it into a string, and find the sum of the characters (accounting for '0' = 48 etc).
Or, you could consider that in 100!, you will get a really large number with many many zeros. If you calculate 100! iteratively, consider dividing by 10 every time the current factorial is divisible by 10. I believe this will yield a result within the range of long long or something.
Or, probably a better exercise is to write your own big int library. You will need it for some later problems if you do not determine the clever tricks.
I'm working on Project Euler to brush up on my C++ coding skills in preparation for the programming challenge(s) we'll be having this next semester (since they don't let us use Python, boo!).
I'm on #16, and I'm trying to find a way to keep real precision for 2¹°°°
For instance:
int main(){
double num = pow(2, 1000);
printf("%.0f", num):
return 0;
}
prints
10715086071862673209484250490600018105614050000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Which is missing most of the numbers (from python):
>>> 2**1000
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376L
Granted, I can write the program with a Python 1 liner
sum(int(_) for _ in str(2**1000))
that gives me the result immediately, but I'm trying to find a way to do it in C++. Any pointers? (haha...)
Edit:
Something outside the standard libs is worthless to me - only dead-tree code is allowed in those contests, and I'm probably not going to print out 10,000 lines of external code...
If you just keep track of each digit in a char array, this is easy. Doubling a digit is trivial, and if the result is greater than 10 you just subtract 10 and add a carry to the next digit. Start with a value of 1, loop over the doubling function 1000 times, and you're done. You can predict the number of digits you'll need with ceil(1000*log(2)/log(10)), or just add them dynamically.
Spoiler alert: it appears I have to show the code before anyone will believe me. This is a simple implementation of a bignum with two functions, Double and Display. I didn't make it a class in the interest of simplicity. The digits are stored in a little-endian format, with the least significant digit first.
typedef std::vector<char> bignum;
void Double(bignum & num)
{
int carry = 0;
for (bignum::iterator p = num.begin(); p != num.end(); ++p)
{
*p *= 2;
*p += carry;
carry = (*p >= 10);
*p -= carry * 10;
}
if (carry != 0)
num.push_back(carry);
}
void Display(bignum & num)
{
for (bignum::reverse_iterator p = num.rbegin(); p != num.rend(); ++p)
std::cout << static_cast<int>(*p);
}
int main(int argc, char* argv[])
{
bignum num;
num.push_back(1);
for (int i = 0; i < 1000; ++i)
Double(num);
Display(num);
std::cout << std::endl;
return 0;
}
You need a bignum library, such as this one.
You probably need a pointer here (pun intended)
In C++ you would need to create your own bigint lib in order to do the same as in python.
C/C++ operates on fundamental data types. You are using a double which has only 64 bits to store a 1000 bit number. double uses 51 bit for the significant digits and 11 bit for the magnitude.
The only solution for you is to either use a library like bignum mentioned elsewhere or to roll out your own.
UPDATE: I just browsed to the Euler Problem site and found that Problem 13 is about summing large integers. The iterated method can become very tricky after a short while, so I'd suggest to use the code from Problem #13 you should have already to solve this, because 2**N => 2**(N-1) + 2**(N-1)
Using bignums is cheating and not a solution. Also, you don't need to compute 2**1000 or anything like that to get to the result. I'll give you a hint:
Take the first few values of 2**N:
0 1 2 4 8 16 32 64 128 256 ...
Now write down for each number the sum of its digits:
1 2 4 8 7 5 10 11 13 ...
You should notice that (x~=y means x and y have the same sum of digits)
1+1=2, 1+(1+2)=4, 1+(1+2+4)=8, 1+(1+2+4+8)=16~=7 1+(1+2+4+8+7)=23~=5
Now write a loop.
Project Euler = Think before Compute!
If you want to do this sort of thing on a practical basis, you're looking for an arbitrary precision arithmetic package. There are a number around, including NTL, lip, GMP, and MIRACL.
If you're just after something for Project Euler, you can write your own code for raising to a power. The basic idea is to store your large number in quite a few small pieces, and implement your own carries, borrows, etc., between the pieces.
Isn't pow(2, 1000) just 2 left-shifted 1000 times, essentially? It should have an exact binary representation in a double float. It shouldn't require a bignum library.