I am running the following program in VS 2012 to try out the Thrust function find:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/find.h>
#include <thrust/device_vector.h>
#include <stdio.h>
int main() {
thrust::device_vector<char> input(4);
input[0] = 'a';
input[1] = 'b';
input[2] = 'c';
input[3] = 'd';
thrust::device_vector<char>::iterator iter;
iter = thrust::find(input.begin(), input.end(), 'a');
std::cout << "Index of a = " << iter - input.begin() << std::endl;
return 0;
}
This is a modified version of a code example taken from http://docs.thrust.googlecode.com/hg/group__searching.html#ga99c7a59cef5b9f4cdbc70f37b2e221be
When I run this in Debug mode, my program crashes and I get the error Debug Error! ... R6010 - abort() has been called. However, running this in Release mode I just get my expected output Index of a = 0.
I know that the crash happens because of the line that includes the find function.
What might cause this to happen?
There are a few similar questions e.g. here
To quote a comment : "Thrust is known to not compile and run correctly when built for debugging"
And from the docs: "nvcc does not support device debugging Thrust code. Thrust functions compiled with (e.g., nvcc -G, nvcc --device-debug 0, etc.) will likely crash."
Related
I've discovered an issue impacting several unit tests at my work, which only happens when the unit tests are run with valgrind, in that the value returned from std::cos and std::sin are different for identical inputs depending on if the unit test is run in isolation versus run under valgrind.
This issue only seems to happen for some specific inputs, because many unit tests pass which run through the same code.
Here's a minimally reproducible example (slightly worsened so that my compiler wouldn't optimize away any of the logic):
#include <complex>
#include <iomanip>
#include <iostream>
int main()
{
std::complex<long double> input(0,0), output(0,0);
input = std::complex<long double>(39.21460183660255L, -40);
std::cout << "input: " << std::setprecision(20) << input << std::endl;
output = std::cos(input);
std::cout << "output: " << std::setprecision(20) << output << std::endl;
if (std::abs(output) < 5.0)
{
std::cout << "TEST FAIL" << std::endl;
return 1;
}
std::cout << "TEST PASS" << std::endl;
return 0;
}
Output when run normally:
input: (39.21460183660254728,-40)
output: (6505830161375283.1118,117512680740825220.91)
TEST PASS
Output when run under valgrind:
input: (39.21460183660254728,-40)
output: (0.18053126362312540976,3.2608771240037195405)
TEST FAIL
Notes:
OS: Red Hat Enterprise Linux 7
Compiler: Intel OneAPI 2022 Next generation DPP/C++ Compiler
Valgrind: 3.20 (built with same compiler), also occurred on official distribution of 3.17
Issue did not manifest when unit tests were built with GCC-7 (cannot go back to that compiler) or GCC-11 (another larger bug with boost prevents us from using this with valgrind)
-O0/1/2/3 make no difference on this issue
only compiler flag I have set is "-fp-speculation=safe", which otherwise if unset causes numerical precision issues in other unit tests
Is there any better ways I can figure out what's going on to resolve this situation, or should I submit a bug report to valgrind? I hope this issue is benign but I want to be able to trust my valgrind output.
I am currently writing a hashtable and one of my tests failed after changing some implementation to use vector extensions. Turns out that when I have a std::array (I do not know if it is a problem with std::array itself or something else) and insert elements in a certain order, my code does not work. But once I change the position of two elements, it suddenly works. The element in question is 33, once changed with 17, the code works as expected.
I have tried looking at the compiled assembly in godbolt, but my assembly is really not good enough to deduce any useful information from that.
I have written a minimal-reproducible code sample here:
#include <iostream>
#include <array>
#include <immintrin.h>
int main() {
// Dysfunctional order
std::array<std::uint32_t, 8> keys_not_functional = {{1,17,33,49,0,0,0,0}};
__m256i key_vector = _mm256_set1_epi32(33);
__m256i cmp_vector = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(keys_not_functional.data()));
__m256i cmp = _mm256_cmpeq_epi32(key_vector, cmp_vector);
std::uint8_t mask = _mm256_movemask_epi8(cmp);
if (mask != 0) {
std::uint8_t index = __builtin_ctz(mask) / 4;
std::cout << "Found at in dysfunctional: " << unsigned(index) << std::endl;
}
// Changing 17 and 33 makes this work without a problem
std::array<std::uint32_t, 8> keys_functional = {{1,33,17,49,0,0,0,0}};
key_vector = _mm256_set1_epi32(33);
cmp_vector = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(keys_functional.data()));
cmp = _mm256_cmpeq_epi32(key_vector, cmp_vector);
mask = _mm256_movemask_epi8(cmp);
if (mask != 0) {
std::uint8_t index = __builtin_ctz(mask) / 4;
std::cout << "Found at in functional: " << unsigned(index) << std::endl;
}
return 0;
}
I compiled it with: g++ -std=c++17 -mavx2 -march=native -O0. Running on gcc 7.4.0 in WSL1, Ubuntu 18.04.4 LTS, kernel 4.4.0-18362-Microsoft. My Processor is a Kaby Lake R i5-8250.
I have not tried compiling it on a different system yet. Is this a problem with my configuration/system or even WSL? Can someone point me to the reason for this?
I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.
I am writing a simple code where I try to get the device count.
#include <cuda.h>
#include <iostream>
int main(){
CUcontext cudaContext;
int deviceCount = 0;
CUresult result = cuDeviceGetCount(&deviceCount);
std::cout << "device count = " << deviceCount << std::endl;
}
compiled command : g++ test.cpp -lcuda
When I try to get the count of the device I get zero even though I have gpu.
Or is it supposed to be zero?
You are using the CUDA driver API here.
A driver API code should start with cuInit(0);. If you don't do that, your usage of the driver API will probably return error codes such as initialization error.
You may want to study some CUDA driver API sample codes such as vectorAddDrv.
I've encountered the phenomena that my code gives me different results when i use debug mode or release mode. I've stripped the problem down to the code below. I am using Microsoft Visual Studio Professional 2013 and the libeary boost 1.62
#include "stdafx.h"
#include <iostream>
#include <math.h>
#include <boost/numeric/interval.hpp>
#include <boost/numeric/interval/rounded_arith.hpp>
using namespace std;
using namespace boost::numeric::interval_lib;
using namespace boost::numeric;
typedef interval<double, policies<save_state<rounded_transc_std<double> >,
checking_base<double> > > Interval;
int _tmain(int argc, _TCHAR* argv[])
{
Interval result = (Interval(3.15, 4.6) - Interval(-0.6, 2.1))*sqrt(Interval(2, 2) + Interval(-2, -2)*Interval(10.022631612535406, 10.031726559552226));
cout << "result: " << result.lower() << " " << result.upper();
return 0;
}
The result while in debug mode is 1.#QNAN 1.#QNAN
The result while in release mode is 0 0
I would like to know what causes this problem and how to fix this. Since this causes serious problems in my project if I cannot rely on the results.
sqrt of a negative number is a tough proposition. The problem is Interval(-2, -2). It remains the magic of VisualStudio to produce 0, 0. :). nan is the most appropriate answer to sqrt(-x). you may sqrt of std::complex<T>.