I am trying to implement a DFT using the fftw library (link to FFTW documentation).
All the libraries have been correctly linked, and the project builds just fine. However, the code doesn't run the moment any function from the fftw library is called.
#include <iostream>
#include <fftw3.h>
using namespace std;
int main() {
int vectorSize = 100;
cout << vectorSize << endl;
fftw_complex vec[vectorSize], vecOut[vectorSize];
for(int i = 0; i < vectorSize; i++) {
vec[i][0] = i;
vec[i][1] = 1;
// Call to function to create an FFT plan
fftw_plan plan = fftw_plan_dft_1d(vectorSize, vec, vecOut, FFTW_FORWARD, FFTW_ESTIMATE);
cout << "test" << endl;
return 0;
If I comment the line where the fftw_plan is instantiated, the code outputs 100 and "test" as expected. There are no issues in the build, as far as I can tell. I haven't really been able to find any post which describes a similar problem.
I am running this on eclipse, using MinGW and the 32 bit version of the pre-compiled binary available for windows (download link).
Any help would be really appreciated :)
Fftw requires input/output to be 16-byte aligned. When you declare the arrays on stack, this can't be guaranteed. So you need to call fftw_malloc or other function to allocate the arrays. Also, your code only creates the plan but doesn't execute it, thus no fft is carried out on the input data.
I am trying to compile and run the following program called test.cu:
#include <iostream>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// Kernel function to add the elements of two arrays
void add(int n, float* x, float* y)
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
int main(void)
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 2.0f;
y[i] = 1.0f;
// Run kernel on 1M elements on the GPU
add <<<1, 256>>> (N, x, y);
// Wait for GPU to finish before accessing on host
// Check for errors (all values should be 3.0f)
for (int i = 0; i < 10; i++)
std::cout << y[i] << std::endl;
// Free memory
return 0;
I am using visual studio comunity 2019 and it marks the "add <<<1, 256>>> (N, x, y);" line as having an expected an expression error. I tried compiling it and somehow it compiles without mistakes, but when running the .exe file it outputs a bunch of "1" instead of the expected "3".
I also tried compiling using "nvcc test.cu", but initially it said "nvcc fatal : Cannot find compiler 'cl.exe' in PATH", so i added "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\Hostx64\x64" to path and now compiling with nvcc gives the same mistake as compiling with visual studio.
In both cases the program never enter the "add" function.
I am pretty sure the code is right and the problem has something to do with the installation, but i already tried reinstalling cuda toolkit and repairing MCVS, but it didn't work.
The kernel.cu exemple that appears when starting a new project with cuda in visual studio also didn't work. When running it outputted "No kernel image available for execution on the device".
How can is solve this?
nvcc version if that helps:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:35_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28845127_0
Visual Studio provides IntelliSense for C++. In the C++ language, the proper parsing of angle brackets is troublesome. You've got < as less than and for templates, and << as shift. So, the fact is that the guys at NVIDIA choose the worst possible delimiter <<<>>>. This makes Intellisense difficult to work properly. The way to get full IntelliSense in CUDA is to switch from the Runtime API to the Driver API. The C++ is just C++, and the CUDA is still (sort of) C++, there is no <<<>>> badness for the language parsing to have to work around.
You could take a look at the difference between matrixMul and matrixMulDrv. The <<<>>> syntax is handled by the compiler essentially just spitting out code that calls the Driver API calls. You'll link to cuda.lib not cudart.lib, and may have to deal with a "mixed mode" program if you use CUDA-RT only libraries. You could refer to this link for more information.
Also, this link tells how to add Intellisense for CUDA in VS.
I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.
I have an auxiliary function in the R package I'm currently building named rbinom01. Note that it calls random(3).
int rbinom01(int size) {
if (!size) {
return 0;
int64_t result = 0;
while (size >= 32) {
result += __builtin_popcount(random());
size -= 32;
result += __builtin_popcount(random() & ~(LONG_MAX << size));
return result;
When R CMD check my_package, I got the following warning:
* checking compiled code ... NOTE
File ‘ my_package/libs/my_package.so’:
Found ‘_random’, possibly from ‘random’ (C)
Object: ‘ my_function.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor use Fortran I/O
nor system RNGs.
See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.
I headed to the Document, and it says I can use one of the *_rand function, along with a family of distribution functions. Well that's cool, but my package simply needs a stream of random bits rather than a random double. The easiest way I can have it is by using random(3) or maybe reading from /dev/urandom, but that makes my package "unportable".
This post suggests using sample, but unfortunately it doesn't fit into my use case. For my application, generating random bits is apparently critical to the performance, so I don't want it waste any time calling unif_rand, multiply the result by N and round it. Anyway, the reason I'm using C++ is to exploit bit-level parallelism.
Surely I can hand-roll my own PRNG or copy and paste the code of a state-of-the-art PRNG like xoshiro256**, but before doing that I would like to see if there are any easier alternatives.
Incidentally, could someone please link a nice short tutorial of Rcpp to me? Writing R Extensions is comprehensive and awesome but it would take me weeks to finish. I'm looking for a more concise version, but preferably it should be more informative than a call to Rcpp.package.skeleton.
As suggested by #Ralf Stubner's answer, I have re-wrote the original code as follow. However, I'm getting the same result every time. How can I seed it properly and at the same time keep my code "portable"?
int rbinom01(int size) {
dqrng::xoshiro256plus rng;
if (!size) {
return 0;
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
Rcout << sizeof(rng()) << std::endl;
size -= 64;
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
There are different R packages that make PRNGs available as C++ header only libraries:
BH: Everything from boost.random
sitmo: Various Threefry versions
dqrng: PCG family, xoshiro256+ and xoroshiro128+
You can make use of any of these by adding LinkingTo to your package's DECRIPTION. Typically these PRNGs are modeled after the C++11 random header, which means you have to control their life-cycle and seeding yourself. In a single-threaded environment I like to use anonymous namespaces for life-cycle control, e.g.:
#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
// [[Rcpp::plugins(cpp11)]]
namespace {
dqrng::xoshiro256plus rng{};
// [[Rcpp::export]]
void set_seed(int seed) {
// [[Rcpp::export]]
int rbinom01(int size) {
if (!size) {
return 0;
int result = 0;
while (size >= 64) {
result += __builtin_popcountll(rng());
size -= 64;
result += __builtin_popcountll(rng() & ((1LLU << size) - 1));
return result;
/*** R
However, using runif isn't all bad and certainly faster than accessing /dev/urandom. In dqrng there is a convenient wrapper for this.
As for tutorials: Besides WRE the Rcpp package vignette is a must read. R Packages by Hadley Wickham also has a chapter on "compiled code" if you want to go the devtools-way.
I am new to c++ programming and StackOverflow, but I have some experience with core Java. I wanted to participate in programming Olympiads and I choose c++ because c++ codes are generally faster than that of an equivalent Java code.
I was solving some problems involving recursion and DP at zonal level and I came across this question called Sequence game
But unfortunately my code doesn't seem to work. It exits with exit code 3221225477, but I can't make anything out of it. I remember Java did a much better job of pointing out my mistakes, but here in c++ I don't have a clue of what's happening. Here's the code btw,
#include <iostream>
#include <fstream>
#include <cstdio>
#include <algorithm>
#include <vector>
#include <set>
using namespace std;
int N, minimum, maximum;
set <unsigned int> result;
vector <unsigned int> integers;
bool status = true;
void score(unsigned int b, unsigned int step)
if(step < N)
unsigned int subtracted;
unsigned int added = b + integers[step];
bool add_gate = (added <= maximum);
bool subtract_gate = (b <= integers[step]);
if (subtract_gate)
subtracted = b - integers[step];
subtract_gate = subtract_gate && (subtracted >= minimum);
if(add_gate && subtract_gate)
score(added, step++);
score(subtracted, step++);
else if(!(add_gate) && !(subtract_gate))
status = false;
else if(add_gate)
score(added, step++);
else if(subtract_gate)
score(subtracted, step++);
else return;
int main()
ifstream input("input.txt"); // attach to input file
streambuf *cinbuf = cin.rdbuf(); // save old cin buffer
cin.rdbuf(input.rdbuf()); // redirect cin to input.txt
ofstream output("output.txt"); // attach to output file
streambuf *coutbuf = cout.rdbuf(); // save old cout buffer
cout.rdbuf(output.rdbuf()); // redirect cout to output.txt
unsigned int b;
for(unsigned int i = 0; i < N; ++i)
score(b, 0);
set<unsigned int>::iterator iter = result.begin();
return 0;
(Note: I intentionally did not use typedef).
I compiled this code with mingw-w64 in a windows machine and here is the Output:
[Finished in 19.8s with exit code 3221225477] ...
Although I have an intel i5-8600, it took so much time to compile, much of the time was taken by the antivirus to scan my exe file, and even sometimes it keeps on compiling for long without any intervention from the anti-virus.
(Note: I did not use command line, instead I used used sublime text to compile it).
I even tried tdm-gcc, and again some other peculiar exit code came up. I even tried to run it on a Ubuntu machine, but unfortunately it couldn't find the output file. When I ran it on a Codechef Online IDE, even though it did not run properly, but the error message was less scarier than that of mingw's.
It said that there was a run-time error and "SIGSEGV" was displayed as an error code. Codechef states that
A SIGSEGV is an error(signal) caused by an invalid memory reference or
a segmentation fault. You are probably trying to access an array
element out of bounds or trying to use too much memory. Some of the
other causes of a segmentation fault are : Using uninitialized
pointers, dereference of NULL pointers, accessing memory that the
program doesn’t own.
It's been a few days that I am trying to solve this, and I am really frustrated by now. First when i started solving this problem I used c arrays, then changed to vectors and finally now to std::set, while hopping that it will solve the problem, but nothing worked. I tried a another dp problem, and again this was the case.
It would be great if someone help me figure out what's wrong in my code.
Thanks in advance.
3221225477 converted to hex is 0xC0000005, which stands for STATUS_ACCESS_VIOLATION, which means you tried to access (read, write or execute) invalid memory.
I remember Java did a much better job of pointing out my mistakes, but here in c++ I don't have a clue of what's happening.
When you run into your program crashing, you should run it under a debugger. Since you're running your code on Windows, I highly recommend Visual Studio 2017 Community Edition. If you ran your code under it, it would point exact line where the crash happens.
As for your crash itself, as PaulMcKenzie points out in the comment, you're indexing an empty vector, which makes std::cin write into out of bounds memory.
integers is a vector which is a dynamic contiguous array whose size is not known at compile time here. So when it is defined initially, it is empty. You need to insert into the vector. Change the following:
for(unsigned int i = 0; i < N; ++i)
to this:
int j;
for(unsigned int i = 0; i < N; ++i) {
cin>> j;
P.W's answer is correct, but an alternative to using push_back is to pre-allocate the vector after N is known. Then you can read from cin straight into the vector elements as before.
integers = vector<unsigned int>(N);
for (unsigned int i = 0; i < N; i++)
cin >> integers[i];
This method has the added advantage of only allocating memory for the vector once. The push_back method will reallocate if the underlying buffer fills up.
I am trying to generate 5000 by 5000 random number matrix. Here is what I do with MATLAB:
for i = 1:100
And here is what I do in C++:
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
using namespace std;
int main(){
int N = 5000;
double ** A = new double*[N];
for (int i=0;i<N;i++)
A[i] = new double[N];
clock_t start = clock();
for (int k=0;k<100;k++){
for (int i=0;i<N;i++){
for (int j=0;j<N;j++){
A[i][j] = rand();
cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << endl;
MATLAB takes around 38 seconds while C++ takes around 90 seconds.
In another question, people executed the same code and got same speeds for both C++ and MATLAB.
I am using visual C++ with the following optimizations
I would like to learn what I am missing here? Thank you for all the help.
EDIT: Here is the key thing though...
Why MATLAB is faster than C++ in creating random numbers?
In this question, people gave me answers where their C++ speeds are same as MATLAB. When I use the same code I get way worse speeds and I am trying to understand why.
Your test is flawed, as others have noted, and does not even address the statement made by the title. You are comparing an inbuilt Matlab function to C++, not Matlab code itself, which in fact executes 100x more slowly than C++. Matlab is just a wrapper around the BLAS/LAPACK libraries in C/Fortran so one would expect a Matlab script, and a competently written C++ to be approximately equivalent, and indeed they are: This code in Matlab 2007b
tic; A = rand(5000); toc
executes in 810ms on my machine and this
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <ctime>
#define N 5000
int main()
clock_t start = clock();
int num_rows = N,
num_cols = N;
double * A = new double[N*N];
for (int i=0; i<N*N; ++i)
A[i] = rand();
std::cout << "T="<< (clock()-start)/(double)(CLOCKS_PER_SEC/1000)<< "ms " << std::endl;
return 0;
executes in 830ms. A slight advantage for Matlab's in-house RNG over rand() is not too surprising. Note also the single indexing. This is how Matlab does it, internally. It then uses a clever indexing system (developed by others) to give you a matrix-like interface to the data.
In your C++ code, you are doing 5000 allocations of double[5000] on the heap. You would probably get much better speed if you did a single allocation of a double[25000000], and then do your own arithmetic to convert your 2 indices to a single one.
I believe MATLAB utilize multiple cpu cores on your machine. Have you try to write a multi-threaded version and measure the difference?
Also, the quality of (pseudo) random would also make slightly difference (but not that much).
In my experience,
First check that you execute your C++ code in release mode instead of in Debug mode. (Although I see in the picture you are in release mode)
Consider MPI parallelization.
Bear in mind that MATLAB is highly optimized and compiled with the Intel compiler which produces faster executables. You can also try more advanced compilers if you can afford them.
Last you can make a loop aggregation by using a function to generate combinations of i, j in a single loop. (In python this is a common practice given by the function product from the itertools library, see this)
I hope it helps.