Refactoring C to C++ for Bernoulli Number calculation - c++

Rosettacode has an article for the calculation of Bernoulli numbers. Unfortunately it does not provide an example in C++, only one in C (as of December 27, 2016).
I am not familiar with C, but a lot of it is recognizable. How could this program be adapted for C++?
#include <stdlib.h>
#include <gmp.h>
#define mpq_for(buf, op, n)\
do {\
size_t i;\
for (i = 0; i < (n); ++i)\
mpq_##op(buf[i]);\
} while (0)
void bernoulli(mpq_t rop, unsigned int n)
{
unsigned int m, j;
mpq_t *a = malloc(sizeof(mpq_t) * (n + 1));
mpq_for(a, init, n + 1);
for (m = 0; m <= n; ++m) {
mpq_set_ui(a[m], 1, m + 1);
for (j = m; j > 0; --j) {
mpq_sub(a[j-1], a[j], a[j-1]);
mpq_set_ui(rop, j, 1);
mpq_mul(a[j-1], a[j-1], rop);
}
}
mpq_set(rop, a[0]);
mpq_for(a, clear, n + 1);
free(a);
}
int main(void)
{
mpq_t rop;
mpz_t n, d;
mpq_init(rop);
mpz_inits(n, d, NULL);
unsigned int i;
for (i = 0; i <= 60; ++i) {
bernoulli(rop, i);
if (mpq_cmp_ui(rop, 0, 1)) {
mpq_get_num(n, rop);
mpq_get_den(d, rop);
gmp_printf("B(%-2u) = %44Zd / %Zd\n", i, n, d);
}
}
mpz_clears(n, d, NULL);
mpq_clear(rop);
return 0;
}
Thanks! Even general recommendations are helpful!

It would work on C++ without changing almost anything I guess, anyway there are a few things you could change:
malloc for new: mpg_t * a = new mpg_t[n+1];
or: mpq_t * a = (mpq_t *) malloc(sizeof(mpq_t) * (n + 1));
NULL for nullptr
Most C libraries have been renamed (and deprecated) from: something.h to csomething
#include <stdlib.h> is now #include <cstdlib>
You could write (included in cstdint header):
for (uint32_t i = 0; i <= 60; ++i) { /* ... */ }
instead of:
unsigned int i;
for (i = 0; i <= 60; ++i) { /* ... */ }

Basically any C code will run on C++
For minimum refraction to OOB, just create a class called bernoully (for example) which will contain the bernoulli function and the macro,
That's it!

Related

A Program to Find Absolute Euler Pseudoprimes

I am now trying to make a program to find the Absolute Euler Pseudoprimes ('AEPSP' in short, not Euler-Jacobi Pseudoprimes), with the definition that n is an AEPSP if
a(n-1)/2 ≡ ±1 (mod n)
for all positive integers a satisfying that the GCD of a and n is 1.
I used a C++ code to generate AEPSPs, which is based on a code to generate Carmichael numbers:
#include <iostream>
#include <cmath>
#include <algorithm>
#include <numeric>
using namespace std;
unsigned int bm(unsigned int a, unsigned int n, unsigned int p){
unsigned long long b;
switch (n) {
case 0:
return 1;
case 1:
return a % p;
default:
b = bm(a,n/2,p);
b = (b*b) % p;
if (n % 2 == 1) b = (b*a) % p;
return b;
}
}
int numc(unsigned int n){
int a, s;
int found = 0;
if (n % 2 == 0) return 0;
s = sqrt(n);
a = 2;
while (a < n) {
if (a > s && !found) {
return 0;
}
if (gcd(a, n) > 1) {
found = 1;
}
else {
if (bm(a, (n-1)/2, n) != 1) {
return 0;
}
}
a++;
}
return 1;
}
int main(void) {
unsigned int n;
for (n = 3; n < 1e9; n += 2){
if (numc(n)) printf("%u\n",n);
}
return 0;
}
Yet, the program is very slow. It generates AEPSPs up to 1.5e6 in 20 minutes. Does anyone have any ideas on optimizing the program?
Any help is most appreciated. :)
I've come up with a different algorithm, based on sieving for primes upfront while simultaneously marking off non-squarefree numbers. I've applied a few optimizations to pack the information into memory a bit tighter, to help with cache-friendliness as well. Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#define PRIME_BIT (1UL << 31)
#define SQUARE_BIT (1UL << 30)
#define FACTOR_MASK (SQUARE_BIT - 1)
void sieve(uint64_t size, uint32_t *buffer) {
for (uint64_t i = 3; i * i < size; i += 2) {
if (buffer[i/2] & PRIME_BIT) {
for (uint64_t j = i * i; j < size; j += 2 * i) {
buffer[j/2] &= SQUARE_BIT;
buffer[j/2] |= i;
}
for (uint64_t j = i * i; j < size; j += 2 * i * i) {
buffer[j/2] |= SQUARE_BIT;
}
}
}
}
int main(int argc, char **argv) {
if (argc < 2) {
printf("Usage: prog LIMIT\n");
return 1;
}
uint64_t size = atoi(argv[1]);
uint32_t *buffer = malloc(size * sizeof(uint32_t) / 2);
memset(buffer, 0x80, size * sizeof(uint32_t) / 2);
sieve(size, buffer);
for (uint64_t i = 5; i < size; i += 4) {
if (buffer[i/2] & PRIME_BIT)
continue;
if (buffer[i/2] & SQUARE_BIT)
continue;
uint64_t num = i;
uint64_t factor;
while (num > 1) {
if (buffer[num/2] & PRIME_BIT)
factor = num;
else
factor = buffer[num/2] & FACTOR_MASK;
if ((i / 2) % (factor - 1) != 0) {
break;
}
num /= factor;
}
if (num == 1)
printf("Found pseudo-prime: %ld\n", i);
}
}
This produces the pseudo-primes up to 1.5e6 in about 8ms on my machine, and for 2e9 it takes 1.8sec.
The time complexity of the solution is O(n log n) - the sieve is O(n log n), and for each number we either do constant time checks or do a divisibility test for each of its factors, which there are at most log n. So, the main loop is also O(n log n), resulting in O(n log n) overall.

Simple and Efficient FFT C or C++ code for HLS implementation

I'm working on my project and it's related to speech processing. I have to implement parts of the project on an Intel FPGA board using Intel HLS Compiler that converts C code to RTL code for FPGA Implementation.
I need to convert time domain signals to frequency domain using FFT in order to process the input signals of speech.I don't know how does FFT algorithm work i just know that it would be more efficient to use a FFT instead of DFT for hardware implementation.
I need a simple radix-2 code that doesn't involve malloc() function and complex.h library if a code uses them i have to port the code, since i don't have a lot of experience with C it would be hard and time consuming for me to port the code that is only a small part of the project;math.h library is allowed to be included and can be implemented in HLS.
I have searched and found many C and C++ implementations of FFT algorithm but most of them need porting to be suitable for HLS implementation.
I have found following code and i think it's the best match for my implementation since it does not include complex library and malloc function : https://gist.github.com/Determinant/db7889995f08fe982418
I can't understand some parts of aforementioned code:
-What is the following definition used for?
#define comp_mul_self(c, c2) \
-What are the fft function inputs? because the input variables names are not clear enough for me.
void fft(const Comp *sig, Comp *f, int s, int n, int inv)
I would really appreciate if someone with enough knowledge of FFT algorithm could comment the important parts of the code and if there is a better C or C++ code for my implementation please point me the directions.
here is the full code from link above:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
typedef struct Comp {
/* comp of the form: a + bi */
double a, b;
} Comp;
Comp comp_create(double a, double b) {
Comp res;
res.a = a;
res.b = b;
return res;
}
Comp comp_add(Comp c1, Comp c2) {
Comp res = c1;
res.a += c2.a;
res.b += c2.b;
return res;
}
Comp comp_sub(Comp c1, Comp c2) {
Comp res = c1;
res.a -= c2.a;
res.b -= c2.b;
return res;
}
Comp comp_mul(Comp c1, Comp c2) {
Comp res;
res.a = c1.a * c2.a - c1.b * c2.b;
res.b = c1.b * c2.a + c1.a * c2.b;
return res;
}
void comp_print(Comp comp) {
printf("%.6f + %.6f i\n", comp.a, comp.b);
}
/* const double PI = acos(-1); */
#define PI 3.141592653589793
#define SQR(x) ((x) * (x))
/* Calculate e^(ix) */
Comp comp_euler(double x) {
Comp res;
res.a = cos(x);
res.b = sin(x);
return res;
}
#define comp_mul_self(c, c2) \
do { \
double ca = c->a; \
c->a = ca * c2->a - c->b * c2->b; \
c->b = c->b * c2->a + ca * c2->b; \
} while (0)
void dft(const Comp *sig, Comp *f, int n, int inv) {
Comp ep = comp_euler(2 * (inv ? -PI : PI) / (double)n);
Comp ei, ej, *pi = &ei, *pj = &ej, *pp = &ep;
int i, j;
pi->a = pj->a = 1;
pi->b = pj->b = 0;
for (i = 0; i < n; i++)
{
f[i].a = f[i].b = 0;
for (j = 0; j < n; j++)
{
f[i] = comp_add(f[i], comp_mul(sig[j], *pj));
comp_mul_self(pj, pi);
}
comp_mul_self(pi, pp);
}
}
void fft(const Comp *sig, Comp *f, int s, int n, int inv) {
int i, hn = n >> 1;
Comp ep = comp_euler((inv ? PI : -PI) / (double)hn), ei;
Comp *pi = &ei, *pp = &ep;
if (!hn) *f = *sig;
else
{
fft(sig, f, s << 1, hn, inv);
fft(sig + s, f + hn, s << 1, hn, inv);
pi->a = 1;
pi->b = 0;
for (i = 0; i < hn; i++)
{
Comp even = f[i], *pe = f + i, *po = pe + hn;
comp_mul_self(po, pi);
pe->a += po->a;
pe->b += po->b;
po->a = even.a - po->a;
po->b = even.b - po->b;
comp_mul_self(pi, pp);
}
}
}
void print_result(const Comp *sig, const Comp *sig0, int n) {
int i;
double err = 0;
for (i = 0; i < n; i++)
{
Comp t = sig0[i];
t.a /= n;
t.b /= n;
comp_print(t);
t = comp_sub(t, sig[i]);
err += t.a * t.a + t.b * t.b;
}
printf("Error Squared = %.6f\n", err);
}
void test_dft(const Comp *sig, Comp *f, Comp *sig0, int n) {
int i;
puts("## Direct DFT ##");
dft(sig, f, n, 0);
for (i = 0; i < n; i++)
comp_print(f[i]);
puts("----------------");
dft(f, sig0, n, 1);
print_result(sig, sig0, n);
puts("################");
}
void test_fft(const Comp *sig, Comp *f, Comp *sig0, int n) {
int i;
puts("## Cooley–Tukey FFT ##");
fft(sig, f, 1, n, 0);
for (i = 0; i < n; i++)
comp_print(f[i]);
puts("----------------------");
fft(f, sig0, 1, n, 1);
print_result(sig, sig0, n);
puts("######################");
}
int main() {
int n, i, k;
Comp *sig, *f, *sig0;
scanf("%d", &k);
n = 1 << k;
sig = (Comp *)malloc(sizeof(Comp) * (size_t)n);
sig0 = (Comp *)malloc(sizeof(Comp) * (size_t)n);
f = (Comp *)malloc(sizeof(Comp) * (size_t)n);
for (i = 0; i < n; i++)
{
sig[i].a = rand() % 10;
sig[i].b = 0;
}
puts("## Original Signal ##");
for (i = 0; i < n; i++)
comp_print(sig[i]);
puts("#####################");
test_dft(sig, f, sig0, n);
test_fft(sig, f, sig0, n);
return 0;
}

C++ simple trick with an array pointer improves performance

I have found a strange behavior in my heap sort routine (see below).
void hpsort(unsigned long n, double *data)
{
unsigned long i, ir, j, l;
double rra;
if (n < 2) return;
l = (n - 2) / 2 + 1;
ir = n - 1;
for (;;)
{
if (l > 0) rra = data[--l];
else
{
rra = data[ir];
data[ir] = data[0];
if (--ir == 0) { data[0] = rra; break; }
}
i = l;
j = l + l + 1;
while (j <= ir)
{
if (j < ir && data[j] < data[j+1]) ++j;
if (rra < data[j])
{
data[i] = data[j];
i = j;
j += j + 1;
}
else break;
}
data[i] = rra;
}
return;
}
If I do a benchmark calling this routine like this
double* array = (double*)malloc(sizeof(double) * N);
... fill in the array ...
hpsort(N, array);
it takes X seconds. but if I add just a single line
void hpsort(unsigned int n, double *data)
{
++data;
and do benchmark as
double* array = (double*)malloc(sizeof(double) * N);
... fill in the array ...
hpsort(N, array-1);
it takes about 0.96X seconds (i.e. 4% faster). This performance difference is stable from run to run.
It feels like g++ compiler does bounds checking in the first case, while in the second case I can cheat it somehow. But I never heard that bounds checking is done for C arrays...
Any ideas why I get this strange difference in performance?
p.s. compilation is done with g++ -O2. by the way, changing unsigned long to long int also decreases performance by about 3 to 4%.
p.p.s. the "Defined Behavior" version also shows performance improvement
void hpsort(unsigned int n, double *data)
{
--data;
and benchmark as
double* array = (double*)malloc(sizeof(double) * N);
... fill in the array ...
hpsort(N, array+1);
p.p.p.s. Performance comparison
Size of array Faster Slower
10 1.46 1.60
100 1.41 1.62
1000 1.84 1.96
10000 1.78 1.87
100000 1.72 1.80
1000000 1.76 1.83
10000000 1.98 2.03
here is my code for hpsort.cpp
void hpsort1(unsigned long n, double *data)
{
unsigned long i, ir, j, l;
double rra;
if (n < 2) return;
l = (n - 2) / 2 + 1;
ir = n - 1;
for (;;)
{
if (l > 0) rra = data[--l];
else
{
rra = data[ir];
data[ir] = data[0];
if (--ir == 0)
{
data[0] = rra;
break;
}
}
i = l;
j = l + l + 1;
while (j <= ir)
{
if (j < ir && data[j] < data[j+1]) ++j;
if (rra < data[j])
{
data[i] = data[j];
i = j;
j += j + 1;
}
else break;
}
data[i] = rra;
}
return;
}
void hpsort2(unsigned long n, double *data)
{
unsigned long i, ir, j, l;
double rra;
--data;
if (n < 2) return;
l = (n - 2) / 2 + 1;
ir = n - 1;
for (;;)
{
if (l > 0) rra = data[--l];
else
{
rra = data[ir];
data[ir] = data[0];
if (--ir == 0)
{
data[0] = rra;
break;
}
}
i = l;
j = l + l + 1;
while (j <= ir)
{
if (j < ir && data[j] < data[j+1]) ++j;
if (rra < data[j])
{
data[i] = data[j];
i = j;
j += j + 1;
}
else break;
}
data[i] = rra;
}
return;
}
and here is my benchmarking code heapsort-benchmark.cpp
#include <vector>
#include <alloca.h>
#include <limits.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
using namespace std;
void hpsort1(unsigned long n, double *data);
void hpsort2(unsigned long n, double *data);
typedef double element_t;
typedef void(*Test)(element_t*, element_t*, int);
const int sizes [] = {10, 100, 1000, 10000, 100000, 1000000, 10000000};
const int largest_size = sizes[sizeof(sizes)/sizeof(int)-1];
vector<double> result_times; // results are pushed into this vector
clock_t start_time;
void do_row(int size) // print results for given size of processed array
{
printf("%10d \t", size);
for (int i=0; i<result_times.size(); ++i) printf("%.2f\t", result_times[i]);
printf("\n");
result_times.clear();
}
inline void start_timer() { start_time = clock(); }
inline double timer()
{
clock_t end_time = clock();
return (end_time - start_time)/double(CLOCKS_PER_SEC);
}
void run(Test f, element_t* first, element_t* last, int number_of_times)
{
start_timer();
while (number_of_times-- > 0) f(first,last,number_of_times);
result_times.push_back(timer());
}
void random_shuffle(double *first, double *last)
{
size_t i, j, n;
double tmp;
n = last-first;
srand((unsigned int)0);
for (i=n-1; i>0; --i)
{
j = rand() % (i+1);
tmp = first[i];
first[i] = first[j];
first[j] = tmp;
}
return;
}
void hpsort1_test(element_t* first, element_t* last, int number_of_times)
{
size_t num_elements = (last-first);
element_t* array = (element_t*)malloc(sizeof(element_t)*num_elements);
memcpy(array, first, sizeof(element_t)*num_elements);
hpsort1(num_elements, array);
free(array);
}
void hpsort2_test(element_t* first, element_t* last, int number_of_times)
{
size_t num_elements = (last-first);
element_t* array = (element_t*)malloc(sizeof(element_t)*num_elements);
memcpy(array, first, sizeof(element_t)*num_elements);
hpsort2(num_elements, array+1);
free(array);
}
void initialize(element_t* first, element_t* last)
{
element_t x = 0.;
while (first != last) { *first++ = x; x += 1.; }
}
double logtwo(double x) { return log(x)/log((double) 2.0); }
int number_of_tests(int size)
{
double n = (double)size;
double largest_n = (double)largest_size;
return int(floor((largest_n * logtwo(largest_n)) / (n * logtwo(n))));
}
void run_tests(int size)
{
const int n = number_of_tests(size);
element_t *buffer = (element_t *)malloc(size * sizeof(element_t));
element_t* buffer_end = &buffer[size];
initialize(buffer, buffer + size); // fill in the elements
for (int i = 0; i < size/2; ++i) buffer[size/2 + i] = buffer[i]; // fill in the second half with values of the first half
//random_shuffle(buffer, buffer_end); // shuffle if you do not want an ordered array
run(hpsort2_test, buffer, buffer_end, n);
run(hpsort1_test, buffer, buffer_end, n);
do_row(size);
free(buffer);
}
int main()
{
const int n = sizeof(sizes)/sizeof(int);
for (int i = 0; i < n; ++i) run_tests(sizes[i]);
}
I compile and run it as
g++ -O2 -c heapsort-benchmark.cpp
g++ -O2 -c hpsort.cpp
g++ -O2 -o heapsort-benchmark heapsort-benchmark.o hpsort.o
./heapsort-benchmark
The first column will be faster version
Unable to get consistent results like OP.
IMO OP's small differences are not part of the difference in code, but part an artifact of testing.
void hpsort(unsigned long n, double *data) {
unsigned long i, ir, j, l;
double rra;
...
}
void hpsort1(unsigned long n, double *data) {
--data;
unsigned long i, ir, j, l;
double rra;
...
}
Test code
#include <time.h>
#include <stdlib.h>
void test(const char *s, int code, size_t n) {
srand(0);
double* array = (double*) malloc(sizeof(double) * n * 2);
// make 2 copies of same random data
for (size_t i = 0; i < n; i++) {
array[i] = rand();
array[i+n] = array[i];
}
double dt0;
double dt1;
clock_t c0 = clock();
clock_t c1,c2;
if (code) {
hpsort1(n, array + 1);
c1 = clock();
hpsort(n, &array[n]);
c2 = clock();
dt0 = (double) (c2 - c1)/CLOCKS_PER_SEC;
dt1 = (double) (c1 - c0)/CLOCKS_PER_SEC;
} else {
hpsort(n, array);
c1 = clock();
hpsort1(n, &array[n]+1);
c2 = clock();
dt0 = (double) (c1 - c0)/CLOCKS_PER_SEC;
dt1 = (double) (c2 - c1)/CLOCKS_PER_SEC;
}
free(array);
const char *cmp = dt0==dt1 ? "==" : (dt0<dt1 ? "<" : ">");
printf("%s %f %2s %f Diff:% f%%\n", s, dt0, cmp, dt1, 100*(dt1-dt0)/dt0);
}
int main() {
//srand((unsigned) time(0));
size_t n = 3000000;
for (int i=0; i<10; i++) {
test("heap first", 0, n);
test("heap1 first", 1, n);
fflush(stdout);
}
}
Output
heap first 1.263000 > 1.201000 Diff:-4.908947%
heap1 first 1.295000 < 1.326000 Diff: 2.393822%
heap first 1.342000 > 1.295000 Diff:-3.502235%
heap1 first 1.279000 < 1.295000 Diff: 1.250977%
heap first 1.279000 == 1.279000 Diff: 0.000000%
heap1 first 1.280000 > 1.279000 Diff:-0.078125%
heap first 1.295000 > 1.294000 Diff:-0.077220%
heap1 first 1.280000 > 1.279000 Diff:-0.078125%
heap first 1.279000 == 1.279000 Diff: 0.000000%
heap1 first 1.295000 > 1.279000 Diff:-1.235521%
heap first 1.263000 < 1.295000 Diff: 2.533650%
heap1 first 1.280000 > 1.279000 Diff:-0.078125%
heap first 1.295000 > 1.263000 Diff:-2.471042%
heap1 first 1.295000 < 1.310000 Diff: 1.158301%
heap first 1.310000 < 1.326000 Diff: 1.221374%
heap1 first 1.326000 < 1.342000 Diff: 1.206637%
heap first 1.279000 == 1.279000 Diff: 0.000000%
heap1 first 1.264000 < 1.295000 Diff: 2.452532%
heap first 1.279000 > 1.264000 Diff:-1.172791%
heap1 first 1.279000 > 1.264000 Diff:-1.172791%

CUDA: How to fill a vector of dynamic size on device and return its contents to another device function?

I want to know which is the proper technique to fill an dynamic size array on device (int *row, in the code bellow) and then return its content, to be used by another device function.
Aiming to contextualize the question, the code bellow attempt to span an arbitrary function in a basis set of Legendre polynomials using Gauss-Legendre quadratures running on the GPU.
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
__device__ double *d_droot, *d_dweight;
/*How could be returned the array or the pointer to the array int *row, on the device, that is filled by this function? */
__device__
void Pascal_Triangle(int n_row, int * row) {
int a[100][100];
int i, j;
//first row and first coloumn has the same value=1
for (i = 1; i <= n_row; i++) {
a[i][1] = a[1][i] = 1;
}
//Generate the full Triangle
for (i = 2; i <= n_row; i++) {
for (j = 2; j <= n_row - i; j++) {
if (a[i - 1][j] == 0 || a[i][j - 1] == 0) {
break;
}
a[i][j] = a[i - 1][j] + a[i][j - 1];
}
}
row = new int[n_row];
for (i = 1; i <= n_row; i++) {
row[i] = a[i][n_row-1];
}
}
__device__
double Legendre_poly(int order, double x)
{
int n,k;
double val=0;
int *binomials;
for(n=order; n>=0; n--)
{
Pascal_Triangle(n, binomials); /*Here are the problems*/
for(k=0; k<=n; k++)
val += binomials[k]*pow(x-1,n-k)*pow(x-1,k);
}
return val;
}
__device__ __host__
double f(double alpha,double x)
{
/*function expanded on a basis of Legendre palynomials. */
return exp(-alpha*x*x);
}
/*Kernel that computes the expansion by quadratures*/
__global__ void Span(int n, double alpha, double a, double b, double *coefficients)
{
/*
Parameters:
n: Total number of expansion coeficients
a: Upper integration limit
b: Lower integration limit
d_droots[]: roots for the quadrature
d_dweight[]: weights for the quadrature
coefficients[]: allocate N expansion coefficients.
*/
double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
int dummy;
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
{
coefficients[i] = 0.0;
for (dummy = 0; dummy < 5; dummy++)
coefficients[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*Legendre_poly(dummy,c1 * d_droot[dummy] + c2)*c1;
}
}
int main(void)
{
int N = 1<<23;
int N_nodes = 5;
double *droot, *dweight, *dresult, *d_dresult, *d_droot_temp, *d_dweight_temp;
/*double version in host*/
droot =(double*)malloc(N_nodes*sizeof(double));
dweight =(double*)malloc(N_nodes*sizeof(double));
dresult =(double*)malloc(N*sizeof(double)); /*will recibe the results of N quadratures!*/
/*double version in device*/
cudaMalloc(&d_droot_temp, N_nodes*sizeof(double));
cudaMalloc(&d_dweight_temp, N_nodes*sizeof(double));
cudaMalloc(&d_dresult, N*sizeof(double)); /*results for N quadratures will be contained here*/
/*double version of the roots and weights*/
droot[0] = 0.90618;
droot[1] = 0.538469;
droot[2] = 0.0;
droot[3] = -0.538469;
droot[4] = -0.90618;
dweight[0] = 0.236927;
dweight[1] = 0.478629;
dweight[2] = 0.568889;
dweight[3] = 0.478629;
dweight[4] = 0.236927;
/*double copy host-> device*/
cudaMemcpy(d_droot_temp, droot, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_dweight_temp, dweight, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(d_droot, &d_droot_temp, sizeof(double *));
cudaMemcpyToSymbol(d_dweight, &d_dweight_temp, sizeof(double *));
// Perform the expansion
Span<<<(N+255)/256, 256>>>(N,1.0, -3.0, 3.0, d_dresult); /*This kerlnel works OK*/
cudaMemcpy(dresult, d_dresult, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaFree(d_dresult);
cudaFree(d_droot_temp);
cudaFree(d_dweight_temp);
}
and here is the makefile to compile the code above:
objects = main.o
all: $(objects)
nvcc -arch=sm_20 $(objects) -o span
%.o: %.cpp
nvcc -x cu -arch=sm_20 -I. -dc $< -o $#
clean:
rm -f *.o span
Thanks in advance for any suggestions.
(sorry my previous answer was off-base)
You are passing a row pointer to this function:
void Pascal_Triangle(int n_row, int * row) {
You are then attempting to overwrite this pointer with a new value:
row = new int[n_row];
Once you return from this function, row in the calling environment will be unmodified. (This is an ordinay C/C++ issue, not specific to CUDA.)
This is perhaps a confusing issue, but the pointer value of row is passed by value to the function Pascal_Triangle. You cannot modify the pointer value in the function, and expect the modified value to show up in the calling environment. (You can modify the contents of the locations that the pointer points to, which would be the usual reason to pass row by pointer.)
There are a few ways to fix this issue. The simplest might be just to pass the pointer by reference:
void Pascal_Triangle(int n_row, int * &row) {
Your code seems to have other defects in it. I would suggest that you employ proper cuda error checking and also run your code with cuda-memcheck.
In particular, the in-kernel new operator behaves in a similar fashion to in-kernel malloc, and it has similar limitations.
You are running out of device heap space, so many of your new operations are failing, and returning a NULL pointer.
As a test for this, it's good debug practice to put a line like this after your new operation:
if (row == NULL) assert(0);
(you'll also need to include assert.h)
If you do that, you'll find that this assert is being hit.
I haven't calculated how much device heap space your code actually needs, but it appears to be using quite a bit. In C++, it's customary to delete an allocation made by new once you're done with it. You might want to investigate freeing the allocations done with new, or else (even better) re-use the allocation (i.e. allocate it once per thread), and avoid the reallocation altogether.
here's a modification to your code that demonstrates the above (one allocation per thread) and compiles and runs without error for me:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
__device__ double *d_droot, *d_dweight;
/*How could be returned the array or the pointer to the array int *row, on the device, that is filled by this function? */
__device__
void Pascal_Triangle(int n_row, int *row) {
int a[100][100];
int i, j;
//first row and first coloumn has the same value=1
for (i = 1; i <= n_row; i++) {
a[i][1] = a[1][i] = 1;
}
//Generate the full Triangle
for (i = 2; i <= n_row; i++) {
for (j = 2; j <= n_row - i; j++) {
if (a[i - 1][j] == 0 || a[i][j - 1] == 0) {
break;
}
a[i][j] = a[i - 1][j] + a[i][j - 1];
}
}
for (i = 1; i <= n_row; i++) {
row[i] = a[i][n_row-1];
}
}
__device__
double Legendre_poly(int order, double x, int *my_storage)
{
int n,k;
double val=0;
int *binomials = my_storage;
if (binomials == NULL) assert(0);
for(n=order; n>=0; n--)
{
Pascal_Triangle(n, binomials); /*Here are the problems*/
for(k=0; k<=n; k++)
val += binomials[k]*pow(x-1,n-k)*pow(x-1,k);
}
return val;
}
__device__ __host__
double f(double alpha,double x)
{
/*function expanded on a basis of Legendre palynomials. */
return exp(-alpha*x*x);
}
/*Kernel that computes the expansion by quadratures*/
__global__ void Span(int n, double alpha, double a, double b, double *coefficients)
{
/*
Parameters:
n: Total number of expansion coeficients
a: Upper integration limit
b: Lower integration limit
d_droots[]: roots for the quadrature
d_dweight[]: weights for the quadrature
coefficients[]: allocate N expansion coefficients.
*/
double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
int dummy;
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
{
#define MY_LIM 5
int *thr_storage = new int[MY_LIM];
if (thr_storage == NULL) assert(0);
coefficients[i] = 0.0;
for (dummy = 0; dummy < MY_LIM; dummy++)
coefficients[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*Legendre_poly(dummy,c1 * d_droot[dummy] + c2, thr_storage)*c1;
delete thr_storage;
}
}
int main(void)
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (1048576ULL*1024));
int N = 1<<23;
int N_nodes = 5;
double *droot, *dweight, *dresult, *d_dresult, *d_droot_temp, *d_dweight_temp;
/*double version in host*/
droot =(double*)malloc(N_nodes*sizeof(double));
dweight =(double*)malloc(N_nodes*sizeof(double));
dresult =(double*)malloc(N*sizeof(double)); /*will recibe the results of N quadratures!*/
/*double version in device*/
cudaMalloc(&d_droot_temp, N_nodes*sizeof(double));
cudaMalloc(&d_dweight_temp, N_nodes*sizeof(double));
cudaMalloc(&d_dresult, N*sizeof(double)); /*results for N quadratures will be contained here*/
/*double version of the roots and weights*/
droot[0] = 0.90618;
droot[1] = 0.538469;
droot[2] = 0.0;
droot[3] = -0.538469;
droot[4] = -0.90618;
dweight[0] = 0.236927;
dweight[1] = 0.478629;
dweight[2] = 0.568889;
dweight[3] = 0.478629;
dweight[4] = 0.236927;
/*double copy host-> device*/
cudaMemcpy(d_droot_temp, droot, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_dweight_temp, dweight, N_nodes*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(d_droot, &d_droot_temp, sizeof(double *));
cudaMemcpyToSymbol(d_dweight, &d_dweight_temp, sizeof(double *));
// Perform the expansion
Span<<<(N+255)/256, 256>>>(N,1.0, -3.0, 3.0, d_dresult); /*This kerlnel works OK*/
cudaMemcpy(dresult, d_dresult, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaFree(d_dresult);
cudaFree(d_droot_temp);
cudaFree(d_dweight_temp);
}
This code has a couple advantages:
it can run with a much smaller reservation on the device heap
it's considerably quicker than the vast number of allocations that your code was trying to do.
EDIT:
instead of the assert you could do something like this:
/*Kernel that computes the expansion by quadratures*/
__global__ void Span(int n, double alpha, double a, double b, double *coefficients)
{
/*
Parameters:
n: Total number of expansion coeficients
a: Upper integration limit
b: Lower integration limit
d_droots[]: roots for the quadrature
d_dweight[]: weights for the quadrature
coefficients[]: allocate N expansion coefficients.
*/
double c1 = (b - a) / 2, c2 = (b + a) / 2, sum = 0;
int dummy;
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
{
#define MY_LIM 5
int *thr_storage = new int[MY_LIM];
if (thr_storage == NULL) printf("allocation failure!\");
else {
coefficients[i] = 0.0;
for (dummy = 0; dummy < MY_LIM; dummy++)
coefficients[i] += d_dweight[dummy] * f(alpha,c1 * d_droot[dummy] + c2)*Legendre_poly(dummy,c1 * d_droot[dummy] + c2, thr_storage)*c1;
delete thr_storage;
}
}
}

How can I solve this using BIT?

I found a nice math problem, but I still can't solve it, I tried to find one solution using google and found that it can be solve using the Binary Indexed Tree data structure, but the solution is not clear to me.
Here the Problem called Finding Magic Triplets, it can be found in Uva online judge:
(a + b^2) mod k = c^3 mod k, where a<=b<=c and 1 <= a, b, c <= n.
given n and k (1 <= n, k <= 10^5), how many different magic triplets exist for known values of n and k. A triplet is different from another if any of the three values is not same in both triplets.
and here the solution that I found:
#include <cstdio>
#include <cstring>
using namespace std;
typedef long long int64;
const int MAX_K = (int)(1e5);
int N, K;
struct BinaryIndexedTree{
typedef int64 bit_t;
static const int MAX_BIT = 3*MAX_K + 1;
bit_t data[MAX_BIT+1];
int SIZE;
void init(int size){
memset(data, 0, sizeof(data));
SIZE = size;
}
bit_t sum(int n){
bit_t ret = 0;
for(;n;n-=n&-n){
ret += data[n];
}
return ret;
}
bit_t sum(int from, int to){
return sum(to)-sum(from);
}
void add(int n, bit_t x){
for(n++;n<=SIZE;n+=n&-n){
data[n]+=x;
}
}
};
BinaryIndexedTree bitree;
void init(){
scanf("%d%d", &N, &K);
}
int64 solve(){
bitree.init(2*K+1);
int64 ans = 0;
for(int64 i=N; i>=1; i--){
int64 b = i * i % K, c = i * i * i % K;
bitree.add(c, 1);
bitree.add(c+K, 1);
bitree.add(c+2*K, 1);
int64 len = i;
if(len >= K){
ans += (len / K) * bitree.sum(K);
len %= K;
}
if(len > 0){
ans += bitree.sum(b + 1, b + len + 1);
}
}
return ans;
}
int main(){
int T;
scanf("%d", &T);
for(int i=0; i<T; i++){
init();
printf("Case %d: %lld\n", i+1, solve());
}
return 0;
}
Are you determined to use BITs? I would have thought ordinary arrays would do. I would start by creating three arrays of size k, where arrayA[i] = the number of values of a in range equal to i mod k, arrayB[i] = the number of values of b in range where b^2 = i mod k, and arrayC[i] = the number of values of c in range where c^3 = i mod k. N and k are both <= 10^5 so you could just consider each value of a in turn, b in turn, and c in turn, though you can be cleverer if k is much smaller than n, because will be some sort of fiddly fence-post counting expression that allows you to work out how many numbers in the range 0..n are equal to i mod k for each i.
Given those three arrays then consider each possible pair of numbers i, j where 0<=i,j < k and work out that there are arrayA[i] * arrayB[j] pairs which have those values mod k. Sum these up in arrayAB[i + j mod k] to find the number of ways that you can chose a + b^2 mod k = x for 0<=x < k. Now you have two arrays arrayAB and arrayC, where arrayAB[i] * arrayC[i] is the number of ways of finding a triple where a + b^2 = c^3] = i, so sum this over all 0<=i < k to get your answer.