Hey, my friends and I are trying to beat each other's runtimes for generating "Self Numbers" between 1 and a million. I've written mine in c++ and I'm still trying to shave off precious time.
Here's what I have so far,
#include <iostream>
using namespace std;
bool v[1000000];
int main(void) {
long non_self = 0;
for(long i = 1; i < 1000000; ++i) {
if(!(v[i])) std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
v[non_self] = 1;
}
std::cout << "1000000" << '\n';
return 0;
}
The code works fine now, I just want to optimize it.
Any tips? Thanks.
I built an alternate C solution that doesn't require any modulo or division operations:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
int v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
for (j4=0; j4<10; j4++) {
for (j3=0; j3<10; j3++) {
for (j2=0; j2<10; j2++) {
for (j1=0; j1<10; j1++) {
s = j6 + j5 + j4 + j3 + j2 + j1;
v[n + s] = 1;
n++;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%6d\n", n);
}
}
It generates 97786 self numbers including 1 and 1000000.
With output, it takes
real 0m1.419s
user 0m0.060s
sys 0m0.152s
When I redirect output to /dev/null, it takes
real 0m0.030s
user 0m0.024s
sys 0m0.004s
on my 3 Ghz quad core rig.
For comparison, your version produces the same number of numbers, so I assume we're either both correct or equally wrong; but your version chews up
real 0m0.064s
user 0m0.060s
sys 0m0.000s
under the same conditions, or about 2x as much.
That, or the fact that you're using longs, which is unnecessary on my machine. Here, int goes up to 2 billion. Maybe you should check INT_MAX on yours?
Update
I had a hunch that it may be better to calculate the sum piecewise. Here's my new code:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
char v[1100000];
int j1, j2, j3, j4, j5, j6, s, n=0;
int s1, s2, s3, s4, s5;
memset(v, 0, sizeof(v));
for (j6=0; j6<10; j6++) {
for (j5=0; j5<10; j5++) {
s5 = j6 + j5;
for (j4=0; j4<10; j4++) {
s4 = s5 + j4;
for (j3=0; j3<10; j3++) {
s3 = s4 + j3;
for (j2=0; j2<10; j2++) {
s2 = s3 + j2;
for (j1=0; j1<10; j1++) {
v[s2 + j1 + n++] = 1;
}
}
}
}
}
}
for (n=1; n<=1000000; n++) {
if (!v[n]) printf("%d\n", n);
}
}
...and what do you know, that brought down the time for the top loop from 12 ms to 4 ms. Or maybe 8, my clock seems to be getting a bit jittery way down there.
State of affairs, Summary
The actual finding of self numbers up to 1M is now taking roughly 4 ms, and I'm having trouble measuring any further improvements. On the other hand, as long as output is to the console, it will continue to take about 1.4 seconds, my best efforts to leverage buffering notwithstanding. The I/O time so drastically dwarfs computation time that any further optimization would be essentially futile. Thus, although inspired by further comments, I've decided to leave well enough alone.
All times cited are on my (pretty fast) machine and are for comparison purposes with each other only. Your mileage may vary.
Generate the numbers once, copy the output into your code as a gigantic string. Print the string.
Those mods (%) look expensive. If you are allowed to move to base 16 (or even base 2), then you can probably code this a lot faster. If you have to stay in decimal, try creating an array of digits for each place (units, tens, hundreds) and build some rollover code. That will make summating the numbers far easier.
Alternatively, you could recognise the behaviour of the core self function (let's call it s):
s = n + f(b,n)
where f(b,n) is the sum of the digits of the number n in base b.
For base 10, it's clear that as the ones (also known as least significant) digit moves from 0,1,2,...,9, that n and f(b,n) proceed in lockstep as you move from n to n+1, it's only that 10% of the time that 9 rolls to 0 that it doesnt, so:
f(b,n+1) = f(b,n) + 1 // 90% of the time
thus the core self function s advances as
n+1 + f(b,n+1) = n + 1 + f(b,n) + 1 = n + f(b,n) + 2
s(n+1) = s(n) + 2 // again, 90% of the time
In the remaining (and easily identifiable) 10% of the time, the 9 rolls back to zero and adds one to the next digit, which in the simplest case subtracts (9-1) from the running total, but might cascade up through a series of 9s, to subtract 99-1, 999-1 etc.
So the first optimisation can remove most of the work from 90% of your cycles!
if ((n % 10) != 0)
{
n + f(b,n) = n-1 + f(b,n-1) + 2;
}
or
if ((n % 10) != 0)
{
s = old_s + 2;
}
That should be enough to substantially increase your performance without really changing your algorithm.
If you want more, then work out a simple algorithm for the change between iterations for the remaining 10%.
If you want your output to be fast, it may be worth investigating replacing iostream output with plain old printf() - depends on the rules for winning the competition whether this is important.
Multithread (use different arrays/ranges for every thread). Also, dont use more threads than your number of cpu cores =)
cout or printf within a loop will be slow. If you can remove any prints from a loop you will see significant performance increase.
Since the range is limited (1 to 1000000) the maximum sum of the digits does not exceed 9*6 = 54. This means that to implement the sieve a circular buffer of 54 elements should be perfectly sufficient (and the size of the sieve grows very slowly as the range increases).
You already have a sieve-based solution, but it is based on pre-building the full-length buffer (sieve of 1000000 elements), which is rather inelegant (if not completely unacceptable). The performance of your solution also suffers from non-locality of memory access.
For example, this is a possible very simple implementation
#define N 1000000U
void print_self_numbers(void)
{
#define NMARKS 64U /* make it 64 just in case (and to make division work faster :) */
unsigned char marks[NMARKS] = { 0 };
unsigned i, imark;
for (i = 1, imark = i; i <= N; ++i, imark = (imark + 1) % NMARKS)
{
unsigned digits, sum;
if (!marks[imark])
printf("%u ", i);
else
marks[imark] = 0;
sum = i;
for (digits = i; digits > 0; digits /= 10)
sum += digits % 10;
marks[sum % NMARKS] = 1;
}
}
(I'm not going for the best possible performance in terms of CPU clocks here, just illustrating the key idea with the circular buffer.)
Of course, the range can be easily turned into a parameter of the function, while the size of the curcular buffer can be easily calculated at run-time from the range.
As for "optimizations"... There's no point in trying to optimize the code that contains I/O operations. You won't achieve anything by such optimizations. If you want to analyze the performance of the algorithm itself, you'll have to put the generated numbers into an output array and print them later.
For such simple task, the best option would be to think of alternative algorithms to produce the same result. %10 is not usually considered a fast operation.
Why not use the recurrence relation given on the wikipedia page instead?
That should be blazingly fast.
EDIT: Ignore this .. the recurrence relation generates some but not all of the self numbers.
In fact only very few of them. Thats not particularly clear from thewikipedia page though :(
This may help speed up C++ iostreams output:
cin.tie(0);
ios::sync_with_stdio(false);
Put them in main before you start writing to cout.
I created a CUDA-based solution based on Carl Smotricz's second algorithm. The code to identify Self Numbers itself is extremely fast -- on my machine it executes in ~45 nanoseconds; this is about 150 x faster than Carl Smotricz's algorithm, which ran in 7 milliseconds on my machine.
There is a bottleneck, however, and that seems to be the PCIe interface. It took my code a whopping 43 milliseconds to move the computed data from the graphics card back to RAM. This might be optimizable, and I will look in to this.
Still, 45 nanosedons is pretty darn fast. Scary fast, actually, and I added code to my program which runs Carl Smotricz's algorithm and compares the results for accuracy. The results are accurate. Here is the program output (compiled in VS2008 64-bit, Windows7):
UPDATE
I recompiled this code in release mode with full optimization and using static runtime libraries, with signifigant results. The optimizer seems to have done very well with Carl's algorithm, reducing the runtime from 7 ms to 1 ms. The CUDA implementation sped up as well, from 35 us to 20 us. The memory copy from video card to RAM was unaffected.
Program Output:
Running on device: 'Quadro NVS 295'
Reference Implementation Ran In 15603 ticks (7 ms)
Kernel Executed in 40 ms -- Breakdown:
[kernel] : 35 us (0.09%)
[memcpy] : 40 ms (99.91%)
CUDA Implementation Ran In 111889 ticks (51 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The code is as follows:
file : main.h
#pragma once
#include <cstdlib>
#include <functional>
typedef std::pair<int*, size_t> sized_ptr;
static sized_ptr make_sized_ptr(int* ptr, size_t size)
{
return make_pair<int*, size_t>(ptr, size);
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMemory, unsigned const blocks, unsigned const threads);
inline std::string format_elapsed(double d)
{
char buf[256] = {0};
if( d < 0.00000001 )
{
// show in ps with 4 digits
sprintf(buf, "%0.4f ps", d * 1000000000000.0);
}
else if( d < 0.00001 )
{
// show in ns
sprintf(buf, "%0.0f ns", d * 1000000000.0);
}
else if( d < 0.001 )
{
// show in us
sprintf(buf, "%0.0f us", d * 1000000.0);
}
else if( d < 0.1 )
{
// show in ms
sprintf(buf, "%0.0f ms", d * 1000.0);
}
else if( d <= 60.0 )
{
// show in seconds
sprintf(buf, "%0.2f s", d);
}
else if( d < 3600.0 )
{
// show in min:sec
sprintf(buf, "%01.0f:%02.2f", floor(d/60.0), fmod(d,60.0));
}
// show in h:min:sec
else
sprintf(buf, "%01.0f:%02.0f:%02.2f", floor(d/3600.0), floor(fmod(d,3600.0)/60.0), fmod(d,60.0));
return buf;
}
inline std::string format_pct(double d)
{
char buf[256] = {0};
sprintf(buf, "%.2f", 100.0 * d);
return buf;
}
file: main.cpp
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include "C:\CUDA\include\cuda_runtime.h"
#include <cstdlib>
#include <iostream>
#include <string>
using namespace std;
#include <cmath>
#include <map>
#include <algorithm>
#include <list>
#include "main.h"
int main()
{
unsigned numVals = 1000000;
int* gold = new int[numVals];
memset(gold, 0, sizeof(int)*numVals);
LARGE_INTEGER li = {0}, li2 = {0};
QueryPerformanceFrequency(&li);
__int64 freq = li.QuadPart;
// get cuda properties...
cudaDeviceProp cdp = {0};
cudaError_t err = cudaGetDeviceProperties(&cdp, 0);
cout << "Running on device: '" << cdp.name << "'" << endl;
// first run the reference implementation
QueryPerformanceCounter(&li);
for( int j6=0, n = 0; j6<10; j6++ )
{
for( int j5=0; j5<10; j5++ )
{
for( int j4=0; j4<10; j4++ )
{
for( int j3=0; j3<10; j3++ )
{
for( int j2=0; j2<10; j2++ )
{
for( int j1=0; j1<10; j1++ )
{
int s = j6 + j5 + j4 + j3 + j2 + j1;
gold[n + s] = 1;
n++;
}
}
}
}
}
}
QueryPerformanceCounter(&li2);
__int64 ticks = li2.QuadPart-li.QuadPart;
cout << "Reference Implementation Ran In " << ticks << " ticks" << " (" << format_elapsed((double)ticks/(double)freq) << ")" << endl;
// now run the cuda version...
unsigned threads = cdp.maxThreadsPerBlock;
unsigned blocks = numVals/threads;
if( numVals%threads ) ++blocks;
unsigned computeSlots = blocks * threads; // this may be != the number of vals since we want 32-thread warps
// allocate device memory for test
int* deviceTest = 0;
err = cudaMalloc(&deviceTest, sizeof(int)*computeSlots);
err = cudaMemset(deviceTest, 0, sizeof(int)*computeSlots);
int* hostTest = new int[numVals]; // the repository for the resulting data on the host
memset(hostTest, 0, sizeof(int)*numVals);
// run the CUDA code...
LARGE_INTEGER li3 = {0}, li4={0};
QueryPerformanceCounter(&li3);
ComputeSelfNumbers(make_sized_ptr(hostTest, numVals), make_sized_ptr(deviceTest, computeSlots), blocks, threads);
QueryPerformanceCounter(&li4);
__int64 ticksCuda = li4.QuadPart-li3.QuadPart;
cout << "CUDA Implementation Ran In " << ticksCuda << " ticks" << " (" << format_elapsed((double)ticksCuda/(double)freq) << ")" << endl;
cout << "Compute Slots: " << computeSlots << " (" << blocks << " blocks X " << threads << " threads)" << endl;
unsigned errorCount = 0;
for( size_t i = 0; i < numVals; ++i )
{
if( gold[i] != hostTest[i] )
{
++errorCount;
}
}
cout << "Number of Errors: " << errorCount << endl;
return 0;
}
file: self.cu
#pragma warning( disable : 4231)
#include <windows.h>
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <iomanip>
using namespace std;
#include "main.h"
__global__ void SelfNum(int * slots)
{
__shared__ int N;
N = (blockIdx.x * blockDim.x) + threadIdx.x;
const int numDigits = 10;
__shared__ int digits[numDigits];
for( int i = 0, temp = N; i < numDigits; ++i, temp /= 10 )
{
digits[numDigits-i-1] = temp - 10 * (temp/10) /*temp % 10*/;
}
__shared__ int s;
s = 0;
for( int i = 0; i < numDigits; ++i )
s += digits[i];
slots[N+s] = 1;
}
__host__ void ComputeSelfNumbers(sized_ptr hostMem, sized_ptr deviceMem, const unsigned blocks, const unsigned threads)
{
LARGE_INTEGER li = {0};
QueryPerformanceFrequency(&li);
double freq = (double)li.QuadPart;
LARGE_INTEGER liStart = {0};
QueryPerformanceCounter(&liStart);
// run the kernel
SelfNum<<<blocks, threads>>>(deviceMem.first);
LARGE_INTEGER liKernel = {0};
QueryPerformanceCounter(&liKernel);
cudaMemcpy(hostMem.first, deviceMem.first, hostMem.second*sizeof(int), cudaMemcpyDeviceToHost); // dont copy the overflow - just throw it away
LARGE_INTEGER liMemcpy = {0};
QueryPerformanceCounter(&liMemcpy);
// display performance stats
double e = double(liMemcpy.QuadPart - liStart.QuadPart)/freq,
eKernel = double(liKernel.QuadPart - liStart.QuadPart)/freq,
eMemcpy = double(liMemcpy.QuadPart - liKernel.QuadPart)/freq;
double pKernel = eKernel/e,
pMemcpy = eMemcpy/e;
cout << "Kernel Executed in " << format_elapsed(e) << " -- Breakdown: " << endl
<< " [kernel] : " << format_elapsed(eKernel) << " (" << format_pct(pKernel) << "%)" << endl
<< " [memcpy] : " << format_elapsed(eMemcpy) << " (" << format_pct(pMemcpy) << "%)" << endl;
}
UPDATE2:
I refactored my CUDA implementation to try to speed it up a bit. I did this by unrolling loops manually, fixing some questionable use of __shared__ memory which might have been an error, and getting rid of some redundancy.
The output of my new kernel is:
Reference Implementation Ran In 69610 ticks (5 ms)
Kernel Executed in 2 ms -- Breakdown:
[kernel] : 39 us (1.57%)
[memcpy] : 2 ms (98.43%)
CUDA Implementation Ran In 62970 ticks (4 ms)
Compute Slots: 1000448 (1954 blocks X 512 threads)
Number of Errors: 0
The only code I changed is the kernel itself, so that's all I will post here:
__global__ void SelfNum(int * slots)
{
int N = (blockIdx.x * blockDim.x) + threadIdx.x;
int s = 0;
int temp = N;
s += temp - 10 * (temp/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
s += temp - 10 * ((temp/=10)/10) /*temp % 10*/;
slots[N+s] = 1;
}
I wonder if multi-threading would help. This algorithm looks like it would lend itself well to multi-threading. (Poor-man's test of this: Create two copies of the program and run them at the same time. If it runs in less than 200% of the time, multi-threading may help).
I was actually surprised that the code below was faster then any other posted here. I probably measured it wrong, but maybe it helps; or at least is interesting.
#include <iostream>
#include <boost/progress.hpp>
class SelfCalc
{
private:
bool array[1000000];
int non_self;
public:
SelfCalc()
{
memset(&array, 0, sizeof(array));
}
void operator()(const int i)
{
if (!(array[i]))
std::cout << i << '\n';
non_self = i + (i%10) + (i/10)%10 + (i/100)%10 + (i/1000)%10 + (i/10000)%10 +(i/100000)%10;
array[non_self] = true;
}
};
class IntIterator
{
private:
int value;
public:
IntIterator(const int _value):value(_value){}
int operator*(){ return value; }
bool operator!=(const IntIterator &v){ return value != v.value; }
int operator++(){ return ++value; }
};
int main()
{
boost::progress_timer t;
SelfCalc selfCalc;
IntIterator i(1), end(100000);
std::for_each(i, end, selfCalc);
std::cout << 100000 << std::endl;
return 0;
}
Fun problem. The problem as stated does not specify what base it must be in. I fiddled around with it some and wrote a base-2 version. It generates an extra few thousand entries because the termination point of 1,000,000 is not as natural with base-2. This pre-counts the number of bits in a byte for a table lookup. The generation of the result set (without the I/O) took 2.4 ms.
One interesting thing (assuming I wrote it correctly) is that the base-2 version has about 250,000 "self numbers" up to 1,000,000 while there are just under 100,000 base-10 self numbers in that range.
#include <windows.h>
#include <stdio.h>
#include <string.h>
void StartTimer( _int64 *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
double StopTimer( _int64 t1 )
{
_int64 t2, ldFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&ldFreq );
return ((double)( t2 - t1 ) / (double)ldFreq) * 1000.0;
}
#define RANGE 1000000
char sn[0x100000 + 32];
int bitCount[256];
// precompute bitcounts for each byte
void PreCountBits()
{
int i;
// generate count of bits in each byte
memset( bitCount, 0, sizeof( bitCount ));
for ( i = 0; i < 256; i++ )
{
int tmp = i;
while ( tmp )
{
if ( tmp & 0x01 )
bitCount[i]++;
tmp >>= 1;
}
}
}
void GenBase2( )
{
int i;
int *b1, *b2, *b3;
int b1sum, b2sum, b3sum;
i = 0;
for ( b1 = bitCount; b1 < bitCount + 256; b1++ )
{
b1sum = *b1;
for ( b2 = bitCount; b2 < bitCount + 256; b2++ )
{
b2sum = b1sum + *b2;
for ( b3 = bitCount; b3 < bitCount + 256; b3++ )
{
sn[i++ + *b3 + b2sum] = 1;
}
}
// 1000000 does not provide a great termination number for base 2. So check
// here. Overshoots the target some but avoids repeated checks
if ( i > RANGE )
return;
}
}
int main( int argc, char* argv[] )
{
int i = 0;
__int64 t1;
memset( sn, 0, sizeof( sn ));
StartTimer( &t1 );
PreCountBits();
GenBase2();
printf( "Generation time = %.3f\n", StopTimer( t1 ));
#if 1
for ( i = 1; i <= RANGE; i++ )
if ( !sn[i] ) printf( "%d\n", i );
#endif
return 0;
}
Maybe try just computing the recurrence relation defined below?
http://en.wikipedia.org/wiki/Self_number
Related
Intro
I do performance tests with the following array sizes 1,000,000, 10,000,000 and 100,000,000. My IDE (CLion) shows me the following warning message:
Index may have a value of '1000000' which is out of bounds
This warning appears only in the first for-loop in each line where bodies[i] is accessed:
void doPerformanceTestBodiesAsAoS(const size_t n) {
auto *bodies = new Body<long double, double[3], double[3]>[n];
for (size_t i = 0; i < n; ++i) {
bodies[i].mass = static_cast<double>(i);
bodies[i].position[0] = static_cast<double>(i);
bodies[i].position[1] = static_cast<double>(i + 1);
bodies[i].position[2] = static_cast<double>(i + 2);
bodies[i].velocity[0] = static_cast<double>(i);
bodies[i].velocity[1] = static_cast<double>(i + 1);
bodies[i].velocity[2] = static_cast<double>(i + 2);
}
const auto startTimeInNanoSeconds = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < (n - 1); ++i) {
const size_t nextBodyIndex = i + 1;
std::sqrt(
PhysicsEngine::math::pow2(bodies[i].position[0] - bodies[nextBodyIndex].position[0]) +
PhysicsEngine::math::pow2(bodies[i].position[1] - bodies[nextBodyIndex].position[1]) +
PhysicsEngine::math::pow2(bodies[i].position[2] - bodies[nextBodyIndex].position[2])
);
}
auto endTimeInNanoSeconds = std::chrono::high_resolution_clock::now();
std::cout.precision(17);
std::cout << "Processing of AoS took: "
<< static_cast<long double>(std::chrono::duration_cast<std::chrono::nanoseconds>(endTimeInNanoSeconds - startTimeInNanoSeconds).count()) / 1e+9 << " seconds for N = " << n << std::endl;
delete[] bodies;
}
// ...
TEST(PerformanceTest, PerformanceTest1MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(1'000'000);
}
TEST(PerformanceTest, PerformanceTest10MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(10'000'000);
}
TEST(PerformanceTest, PerformanceTest100MioBodiesAsAoS) {
doPerformanceTestBodiesAsAoS(100'000'000);
}
Problem
I'm now unsure whether this warning is a false positive (?!) or whether I'm missing something and thus possibly falsifying my performance tests.
What have I already done
I have checked that on my system sizes 1,000,000 - 100,000,000 do not exceed the MAX_INT value to ensure indexing does not arithmetically overflow. This shouldn't happen because int is 4 bytes and size_t is 8 bytes on my system.
I restarted my IDE several times, cleaned up caches etc.
I changed every size_t to int because I assume the index operator [] expects an int instead of size_t? But that didn't resolve the warning.
I searched for similar question but couldn't find any.
So I'm asking for a code review, whether it's false positive or I have a bug.
Additional information
I am using Windows 10 64, MingGW and an Intel processor.
The typedef of size_t is:
#ifdef _WIN64
__MINGW_EXTENSION typedef unsigned __int64 size_t;
#else
typedef unsigned int size_t;
#endif /* _WIN64 */
this is my first time posting a question. I was hoping to get some help on a very old computer science assignment that I never got around to finishing. I'm no longer taking the class, just want to see how to solve this.
Read in an integer (any valid 64-bit
integer = long long type) and output the same number but with commas inserted.
If the user entered -1234567890, your program should output -1,234,567,890. Commas
should appear after every three significant digits (provided more digits remain) starting
from the decimal point and working left toward more significant digits. If the number
entered does not require commas, do not add any. For example, if the input is 234 you
should output 234. The input 0 should produce output 0. Note in the example above
that the number can be positive or negative. Your output must maintain the case of the
input.
I'm relatively new to programming, and this was all I could come up with:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
long long n;
cout << "Enter an integer:" << endl;
cin >> n;
int ones = n % 10;
int tens = n / 10 % 10;
int hund = n / 100 % 10;
int thous = n / 1000 % 10;
int tthous = n / 10000 % 10;
cout << tthous << thous << "," << hund << tens << ones << endl;
return 0;
}
The original assignment prohibited the use of strings, arrays, and vectors, so please refrain from giving suggestions/solutions that involve these.
I'm aware that some sort of for-loop would probably be required to properly insert the commas in the necessary places, but I just do not know how to go about implementing this.
Thank you in advance to anyone who offers their help!
Just to give you an idea how to solve this, I've maiden a simple implementation. Just keep in mind that is just a simple example:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
long long n = -1234567890;
if ( n < 0 )
cout << '-';
n = abs(n);
for (long long i = 1000000000000; i > 0; i /= 1000) {
if ( n / i <= 0 ) continue;
cout << n / i ;
n = n - ( n / i) * i;
if ( n > 0 )
cout << ',';
}
return 0;
}
http://coliru.stacked-crooked.com/a/150f75db89c46e99
The easy solution would be to use ios::imbue to set a locale that would do all the work for you:
std::cout.imbue(std::locale(""));
std::cout << n << std::endl;
However, if the restraints don't allow for strings or vectors I doubt that this would be a valid solution. Instead you could use recursion:
void print(long long n, int counter) {
if (n > 0) {
print(n / 10, ++counter);
if (counter % 3 == 0) {
std::cout << ",";
}
std::cout << n%10;
}
}
void print(long long n) {
if (n < 0) {
std::cout << "-";
n *= -1;
}
print(n, 0);
}
And then in the main simply call print(n);
A small template class comma_sep may be a solution, the usage may be as simple as:
cout << comma_sep<long long>(7497592752850).sep() << endl;
Which outputs:
7,497,592,752,850
Picked from here:
https://github.com/arloan/libimsux/blob/main/comma_sep.hxx
template <class I = int, int maxdigits = 32>
class comma_sep
char buff[maxdigits + maxdigits / 3 + 2];
char * p;
I i;
char sc;
public:
comma_sep(I i, char c = ',') : p(buff), i(i), sc(c) {
if (i < 0) {
buff[0] = '-';
*++p = '\0';
}
}
const char * sep() {
return _sep(std::abs(i));
}
private:
const char * _sep(I i) {
I r = i % 1000;
I n = i / 1000;
if (n > 0) {
_sep(n);
p += sprintf(p, "%c%03d", sc, (int)r);
*p = '\0';
} else {
p += sprintf(p, "%d", (int)r);
*p = '\0';
}
return buff;
}
};
The above class handles only integeral numbers, float/double numbers need to use a partial specialized version:
template<int maxd>
class comma_sep<double, maxd> {
comma_sep<int64_t, maxd> _cs;
char fs[64];
double f;
public:
const int max_frac = 12;
comma_sep(double d, char c = ',') : _cs((int64_t)d, c) {
double np;
f = std::abs(modf(d, &np));
}
const char * sep(int frac = 3) {
if (frac < 1 || frac > max_frac) {
throw std::invalid_argument("factional part too too long or invalid");
}
auto p = _cs.sep();
strcpy(fs, p);
char fmt[8], tmp[max_frac+3];
sprintf(fmt, "%%.%dlf", frac);
sprintf(tmp, fmt, f);
return strcat(fs, tmp + 1);
}
};
The two above classes can be improved by adding type-traits like std::is_integral and/or std::is_floating_point, though.
As I run the program, it crashes with segmentation fault. Also, when I debug the code in codeblocks IDE, I am unable to debug it as well. The program crashes even before debugging begins. I am not able to understand the problem. Any help would be appreciated. Thanks!!
#include <iostream>
#include <math.h>
#include <string>
using namespace std;
// Method to make strings of equal length
int makeEqualLength(string& fnum,string& snum){
int l1 = fnum.length();
int l2 = snum.length();
if(l1>l2){
int d = l1-l2;
while(d>0){
snum = '0' + snum;
d--;
}
return l1;
}
else if(l2>l1){
int d = l2-l1;
while(d>0){
fnum = '0' + fnum;
d--;
}
return l2;
}
else
return l1;
}
int singleDigitMultiplication(string& fnum,string& snum){
return ((fnum[0] -'0')*(snum[0] -'0'));
}
string addStrings(string& s1,string& s2){
int length = makeEqualLength(s1,s2);
int carry = 0;
string result;
for(int i=length-1;i>=0;i--){
int fd = s1[i]-'0';
int sd = s2[i]-'0';
int sum = (fd+sd+carry)%10+'0';
carry = (fd+sd+carry)/10;
result = (char)sum + result;
}
result = (char)carry + result;
return result;
}
long int multiplyByKaratsubaMethod(string fnum,string snum){
int length = makeEqualLength(fnum,snum);
if(length==0) return 0;
if(length==1) return singleDigitMultiplication(fnum,snum);
int fh = length/2;
int sh = length - fh;
string Xl = fnum.substr(0,fh);
string Xr = fnum.substr(fh,sh);
string Yl = snum.substr(0,fh);
string Yr = snum.substr(fh,sh);
long int P1 = multiplyByKaratsubaMethod(Xl,Yl);
long int P3 = multiplyByKaratsubaMethod(Xr,Yr);
long int P2 = multiplyByKaratsubaMethod(addStrings(Xl,Xr),addStrings(Yl,Yr)) - P1-P3;
return (P1*pow(10,length) + P2*pow(10,length/2) + P3);
}
int main()
{
string firstNum = "62";
string secondNum = "465";
long int result = multiplyByKaratsubaMethod(firstNum,secondNum);
cout << result << endl;
return 0;
}
There are three serious issues in your code:
result = (char)carry + result; does not work.The carry has a value between 0 (0 * 0) and 8 (9 * 9). It has to be converted to the corresponding ASCII value:result = (char)(carry + '0') + result;.
This leads to the next issue: The carry is even inserted if it is 0. There is an if statement missing:if (carry/* != 0*/) result = (char)(carry + '0') + result;.
After fixing the first two issues and testing again, the stack overflow still occurs. So, I compared your algorithm with another I found by google:Divide and Conquer | Set 4 (Karatsuba algorithm for fast multiplication)(and possibly was your origin because it's looking very similar). Without digging deeper, I fixed what looked like a simple transfer mistake:return P1 * pow(10, 2 * sh) + P2 * pow(10, sh) + P3;(I replaced length by 2 * sh and length/2 by sh like I saw it in the googled code.) This became obvious for me seeing in the debugger that length can have odd values so that sh and length/2 are distinct values.
Afterwards, your program became working.
I changed the main() function to test it a little bit harder:
#include <cmath>
#include <iostream>
#include <string>
using namespace std;
string intToStr(int i)
{
string text;
do {
text.insert(0, 1, i % 10 + '0');
i /= 10;
} while (i);
return text;
}
// Method to make strings of equal length
int makeEqualLength(string &fnum, string &snum)
{
int l1 = (int)fnum.length();
int l2 = (int)snum.length();
return l1 < l2
? (fnum.insert(0, l2 - l1, '0'), l2)
: (snum.insert(0, l1 - l2, '0'), l1);
}
int singleDigitMultiplication(const string& fnum, const string& snum)
{
return ((fnum[0] - '0') * (snum[0] - '0'));
}
string addStrings(string& s1, string& s2)
{
int length = makeEqualLength(s1, s2);
int carry = 0;
string result;
for (int i = length - 1; i >= 0; --i) {
int fd = s1[i] - '0';
int sd = s2[i] - '0';
int sum = (fd + sd + carry) % 10 + '0';
carry = (fd + sd + carry) / 10;
result.insert(0, 1, (char)sum);
}
if (carry) result.insert(0, 1, (char)(carry + '0'));
return result;
}
long int multiplyByKaratsubaMethod(string fnum, string snum)
{
int length = makeEqualLength(fnum, snum);
if (length == 0) return 0;
if (length == 1) return singleDigitMultiplication(fnum, snum);
int fh = length / 2;
int sh = length - fh;
string Xl = fnum.substr(0, fh);
string Xr = fnum.substr(fh, sh);
string Yl = snum.substr(0, fh);
string Yr = snum.substr(fh, sh);
long int P1 = multiplyByKaratsubaMethod(Xl, Yl);
long int P3 = multiplyByKaratsubaMethod(Xr, Yr);
long int P2
= multiplyByKaratsubaMethod(addStrings(Xl, Xr), addStrings(Yl, Yr))
- P1 - P3;
return P1 * pow(10, 2 * sh) + P2 * pow(10, sh) + P3;
}
int main()
{
int nErrors = 0;
for (int i = 0; i < 1000; i += 3) {
for (int j = 0; j < 1000; j += 3) {
long int result
= multiplyByKaratsubaMethod(intToStr(i), intToStr(j));
bool ok = result == i * j;
cout << i << " * " << j << " = " << result
<< (ok ? " OK." : " ERROR!") << endl;
nErrors += !ok;
}
}
cout << nErrors << " error(s)." << endl;
return 0;
}
Notes about changes I've made:
Concerning std library: Please, don't mix headers with ".h" and without. Every header of std library is available in "non-suffix-flavor". (The header with ".h" are either C header or old-fashioned.) Headers of C library have been adapted to C++. They have the old name with prefix "c" and without suffix ".h".
Thus, I replaced #include <math.h> by #include <cmath>.
I couldn't resist to make makeEqualLength() a little bit shorter.
Please, note, that a lot of methods in std use std::size_t instead of int or unsigned. std::size_t has appropriate width to do array subscript and pointer arithmetic i.e it has "machine word width". I believed a long time that int and unsigned should have "machine word width" also and didn't care about size_t. When we changed in Visual Studio from x86 (32 bits) to x64 (64 bits), I learnt the hard way that I had been very wrong: std::size_t is 64 bits now but int and unsigned are still 32 bits. (MS VC++ is not an exception. Other compiler vendors (but not all) do it the same way.)I inserted some C type casts to remove the warnings from compiler output. Such casts to remove warnings (regardless you use C casts or better the C++ casts) should always be used with care and should be understood as confirmation: Dear compiler. I see you have concerns but I (believe to) know and assure you that it should work fine.
I'm not sure about your intention to use long int in some places. (Probably, you transferred this code from original source without caring about.) As your surely know, the actual size of all int types may differ to match best performance of the target platform. I'm working on a Intel-PC with Windows 10, using Visual Studio. sizeof (int) == sizeof (long int) (32 bits). This is independent whether I compile x86 code (32 bits) or x64 code (64 bits). The same is true for gcc (on cygwin in my case) as well as on any Intel-PC with Linux (AFAIK). For a granted larger type than int you have to choose long long int.
I did the sample session in cygwin on Windows 10 (64 bit):
$ g++ -std=c++11 -o karatsuba karatsuba.cc
$ ./karatsuba
0 * 0 = 0 OK.
0 * 3 = 0 OK.
0 * 6 = 0 OK.
etc. etc.
999 * 993 = 992007 OK.
999 * 996 = 995004 OK.
999 * 999 = 998001 OK.
0 error(s).
$
I'm sure this may seem like a trivial problem, but I'm having issues with the "clock()" function in my program (please note I have checked similar issues but they didn't seem to relate in the same context). My clock outputs are almost always 0, however there seem to be a few 10's as well (which is why I'm baffled). I considered the fact that maybe the function calling is too quickly processed but judging by the sorting algorithms, surely there should be some time taken.
Thank you all in advance for any help! :)
P.S I'm really sorry about the mess regarding correlation of variables between functions (it's a group code I've merged together, and I'm focusing of correct output before beautifying it) :D
#include <iostream>
#include <cstdlib>
#include <ctime>
using namespace std;
int bruteForce(int array[], const int& sizeofArray);
int maxSubArray (int array[], int arraySize);
int kadane_Algorithm(int array[], int User_size);
int main()
{
int maxBF, maxDC, maxKD;
clock_t t1, t2, t3;
int arraySize1 = 1;
double t1InMSEC, t2InMSEC, t3InMSEC;
while (arraySize1 <= 30000)
{
int* array = new int [arraySize1];
for (int i = 0; i < arraySize1; i++)
{
array[i] = rand()% 100 + (-50);
}
t1 = clock();
maxBF = bruteForce(array, arraySize1);
t1 = clock() - t1;
t1InMSEC = (static_cast <double>(t1))/CLOCKS_PER_SEC * 1000.00;
t2 = clock();
maxDC = maxSubArray(array, arraySize1);
t2 = clock() - t2;
t2InMSEC = (static_cast <double>(t2))/CLOCKS_PER_SEC * 1000.00;
t3 = clock();
maxKD = kadane_Algorithm(array, arraySize1);
t3 = clock() - t3;
t3InMSEC = (static_cast <double>(t3))/CLOCKS_PER_SEC * 1000.00;
cout << arraySize1 << '\t' << t1InMSEC << '\t' << t2InMSEC << '\t' << t3InMSEC << '\t' << endl;
arraySize1 = arraySize1 + 100;
delete [] array;
}
return 0;
}
int bruteForce(int array[], const int& sizeofArray)
{
int maxSumOfArray = 0;
int runningSum = 0;
int subArrayIndex = sizeofArray;
while(subArrayIndex >= 0)
{
runningSum += array[subArrayIndex];
if (runningSum >= maxSumOfArray)
{
maxSumOfArray = runningSum;
}
subArrayIndex--;
}
return maxSumOfArray;
}
int maxSubArray (int array[], int arraySize)
{
int leftSubArray = 0;
int rightSubArray = 0;
int leftSubArraySum = -50;
int rightSubArraySum = -50;
int sum = 0;
if (arraySize == 1) return array[0];
else
{
int midPosition = arraySize/2;
leftSubArray = maxSubArray(array, midPosition);
rightSubArray = maxSubArray(array, (arraySize - midPosition));
for (int j = midPosition; j < arraySize; j++)
{
sum = sum + array[j];
if (sum > rightSubArraySum)
rightSubArraySum = sum;
}
sum = 0;
for (int k = (midPosition - 1); k >= 0; k--)
{
sum = sum + array[k];
if (sum > leftSubArraySum)
leftSubArraySum = sum;
}
}
int maxSubArraySum = 0;
if (leftSubArraySum > rightSubArraySum)
{
maxSubArraySum = leftSubArraySum;
}
else maxSubArraySum = rightSubArraySum;
return max(maxSubArraySum, (leftSubArraySum + rightSubArraySum));
}
int kadane_Algorithm(int array[], int User_size)
{
int maxsofar=0, maxending=0, i;
for (i=0; i < User_size; i++)
{
maxending += array[i];
if (maxending < 0)
{
maxending = 0 ;
}
if (maxsofar < maxending)
{
maxsofar = maxending;
}
}
return maxending;
}
Output is as follows: (just used a snippet for visualization)
29001 0 0 0
29101 0 10 0
29201 0 0 0
29301 0 0 0
29401 0 10 0
29501 0 0 0
29601 0 0 0
29701 0 0 0
29801 0 10 0
29901 0 0 0
Your problem is probably that clock() doesn't have enough resolution: you truly are taking zero time to within the precision allowed. And if not, then it's almost surely that the compiler is optimizing away the computation since it's computing something that isn't being used.
By default, you really should be using the chrono header rather than the antiquated C facilities for timing. In particular, high_resolution_clock tends to be good for measuring relatively quick things should you find yourself really needing that.
Accurately benchmarking things is a nontrivial exercise, and you really should read up on how to do it properly. There are a surprising number of issues involving things like cache or surprising compiler optimizations or variable CPU frequency that many programmers have never even thought about before, and ignoring them can lead to incorrect conclusions.
One particular aspect is that you should generally arrange to time things that actually take some time; e.g. run whatever it is you're testing on thousands of different inputs, so that the duration is, e.g., on the order of a whole second or more. This tends to improve both the precision and the accuracy of your benchmarks.
Thanks for all the help guys! It seems to be some kind of issue with any windows/windows emulation software.
I decided to boot up ubuntu and rather give it a shot there, and voila! I get a perfect output :)
I guess Open Source really is the best way to go :D
I am trying to solve a problem, a part of which requires me to calculate (2^n)%1000000007 , where n<=10^9. But my following code gives me output "0" even for input like n=99.
Is there anyway other than having a loop which multilplies the output by 2 every time and finding the modulo every time (this is not I am looking for as this will be very slow for large numbers).
#include<stdio.h>
#include<math.h>
#include<iostream>
using namespace std;
int main()
{
unsigned long long gaps,total;
while(1)
{
cin>>gaps;
total=(unsigned long long)powf(2,gaps)%1000000007;
cout<<total<<endl;
}
}
You need a "big num" library, it is not clear what platform you are on, but start here:
http://gmplib.org/
this is not I am looking for as this will be very slow for large numbers
Using a bigint library will be considerably slower pretty much any other solution.
Don't take the modulo every pass through the loop: rather, only take it when the output grows bigger than the modulus, as follows:
#include <iostream>
int main() {
int modulus = 1000000007;
int n = 88888888;
long res = 1;
for(long i=0; i < n; ++i) {
res *= 2;
if(res > modulus)
res %= modulus;
}
std::cout << res << std::endl;
}
This is actually pretty quick:
$ time ./t
./t 1.19s user 0.00s system 99% cpu 1.197 total
I should mention that the reason this works is that if a and b are equivalent mod m (that is, a % m = b % m), then this equality holds multiple k of a and b (that is, the foregoing equality implies (a*k)%m = (b*k)%m).
Chris proposed GMP, but if you need just that and want to do things The C++ Way, not The C Way, and without unnecessary complexity, you may just want to check this out - it generates few warnings when compiling, but is quite simple and Just Works™.
You can split your 2^n into chunks of 2^m. You need to find: `
2^m * 2^m * ... 2^(less than m)
Number m should be 31 is for 32-bit CPU. Then your answer is:
chunk1 % k * chunk2 * k ... where k=1000000007
You are still O(N). But then you can utilize the fact that all chunk % k are equal except last one and you can make it O(1)
I wrote this function. It is very inefficient but it works with very large numbers. It uses my self-made algorithm to store big numbers in arrays using a decimal like system.
mpfr2.cpp
#include "mpfr2.h"
void mpfr2::mpfr::setNumber(std::string a) {
for (int i = a.length() - 1, j = 0; i >= 0; ++j, --i) {
_a[j] = a[i] - '0';
}
res_size = a.length();
}
int mpfr2::mpfr::multiply(mpfr& a, mpfr b)
{
mpfr ans = mpfr();
// One by one multiply n with individual digits of res[]
int i = 0;
for (i = 0; i < b.res_size; ++i)
{
for (int j = 0; j < a.res_size; ++j) {
ans._a[i + j] += b._a[i] * a._a[j];
}
}
for (i = 0; i < a.res_size + b.res_size; i++)
{
int tmp = ans._a[i] / 10;
ans._a[i] = ans._a[i] % 10;
ans._a[i + 1] = ans._a[i + 1] + tmp;
}
for (i = a.res_size + b.res_size; i >= 0; i--)
{
if (ans._a[i] > 0) break;
}
ans.res_size = i+1;
a = ans;
return a.res_size;
}
mpfr2::mpfr mpfr2::mpfr::pow(mpfr a, mpfr b) {
mpfr t = a;
std::string bStr = "";
for (int i = b.res_size - 1; i >= 0; --i) {
bStr += std::to_string(b._a[i]);
}
int i = 1;
while (!0) {
if (bStr == std::to_string(i)) break;
a.res_size = multiply(a, t);
// Debugging
std::cout << "\npow() iteration " << i << std::endl;
++i;
}
return a;
}
mpfr2.h
#pragma once
//#infdef MPFR2_H
//#define MPFR2_H
// C standard includes
#include <iostream>
#include <string>
#define MAX 0x7fffffff/32/4 // 2147483647
namespace mpfr2 {
class mpfr
{
public:
int _a[MAX];
int res_size;
void setNumber(std::string);
static int multiply(mpfr&, mpfr);
static mpfr pow(mpfr, mpfr);
};
}
//#endif
main.cpp
#include <iostream>
#include <fstream>
// Local headers
#include "mpfr2.h" // Defines local mpfr algorithm library
// Namespaces
namespace m = mpfr2; // Reduce the typing a bit later...
m::mpfr tetration(m::mpfr, int);
int main() {
// Hardcoded tests
int x = 7;
std::ofstream f("out.txt");
m::mpfr t;
for(int b=1; b<x;b++) {
std::cout << "2^^" << b << std::endl; // Hardcoded message
t.setNumber("2");
m::mpfr res = tetration(t, b);
for (int i = res.res_size - 1; i >= 0; i--) {
std::cout << res._a[i];
f << res._a[i];
}
f << std::endl << std::endl;
std::cout << std::endl << std::endl;
}
char c; std::cin.ignore(); std::cin >> c;
return 0;
}
m::mpfr tetration(m::mpfr a, int b)
{
m::mpfr tmp = a;
if (b <= 0) return m::mpfr();
for (; b > 1; b--) tmp = m::mpfr::pow(a, tmp);
return tmp;
}
I created this for tetration and eventually hyperoperations. When the numbers get really big it can take ages to calculate and a lot of memory. The #define MAX 0x7fffffff/32/4 is the number of decimals one number can have. I might make another algorithm later to combine multiple of these arrays into one number. On my system the max array length is 0x7fffffff aka 2147486347 aka 2^31-1 aka int32_max (which is usually the standard int size) so I had to divide int32_max by 32 to make the creation of this array possible. I also divided it by 4 to reduce memory usage in the multiply() function.
- Jubiman