Running OpenCL under Windows 10 - c++

I want to run OpenCL application under Windows 10 using my GTX970 graphics card. But the following code doesn't work =(
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <CL/cl.h>
#include <vector>
#include <fstream>
#include <iostream>
#include <iomanip>
int main() {
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
try
{
cl::Platform::get(&platforms);
std::cout << platforms.size() << std::endl;
for (cl_uint i = 0; i < platforms.size(); ++i)
{
platforms[i].getDevices(CL_DEVICE_TYPE_GPU, &devices);
}
std::cout << devices.size() << std::endl;
}
catch (cl::Error e) {
std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
}
return 0;
}
It gives me error code -1. I am using Visual Studio 2015 Community Edition to launch it with installed NVIDIA CUDA SDK v8.0 and configured paths, so compiler and linker knows about SDK.
Can someone please explain what's wrong with this snippet?
Thanks in advance!
EDIT: Can someone also explain me, why when i try to debug this code it falls when getting platform id, however, when i do not debug this code it prints that i have 2 platforms(my GPU card and integerated GPU)

Probably your iGPU is Intel(I assume you did a combo of gtx970 + intel cpu for gaming) which also has some experimental opencl 2.1 platform support that could give error for an opencl 1.2 app at device picking or platform picking(I had a similar problem).
You should check returned error codes from opencl api commands. Those give better info about what happened.
For example, my system has two platforms for Intel, one being experimental 2.1 only for cpu and one being normal 1.2 for both gpu and cpu.
To check that, query platform version and check its parameter-returned 7th and 9th char values against 1 and 2 for 1.2 or 2 and 0 for 2.0. This should elliminate experimental 2.1 which gives 2 at 7th char and "1" at 9th char (where indexing starts at 0 ofcourse)
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetPlatformInfo.html
CL_PLATFORM_VERSION
Check for both numpad number keycodes and left hand side numbers keycodes.
Nvidia must have 1.2 support already.
If I'm right, you may query for cpu devices and have 2 from Intel and 1 from Nvidia(if it has any) in return.

Related

std::cos gives different result when run with valgrind

I've discovered an issue impacting several unit tests at my work, which only happens when the unit tests are run with valgrind, in that the value returned from std::cos and std::sin are different for identical inputs depending on if the unit test is run in isolation versus run under valgrind.
This issue only seems to happen for some specific inputs, because many unit tests pass which run through the same code.
Here's a minimally reproducible example (slightly worsened so that my compiler wouldn't optimize away any of the logic):
#include <complex>
#include <iomanip>
#include <iostream>
int main()
{
std::complex<long double> input(0,0), output(0,0);
input = std::complex<long double>(39.21460183660255L, -40);
std::cout << "input: " << std::setprecision(20) << input << std::endl;
output = std::cos(input);
std::cout << "output: " << std::setprecision(20) << output << std::endl;
if (std::abs(output) < 5.0)
{
std::cout << "TEST FAIL" << std::endl;
return 1;
}
std::cout << "TEST PASS" << std::endl;
return 0;
}
Output when run normally:
input: (39.21460183660254728,-40)
output: (6505830161375283.1118,117512680740825220.91)
TEST PASS
Output when run under valgrind:
input: (39.21460183660254728,-40)
output: (0.18053126362312540976,3.2608771240037195405)
TEST FAIL
Notes:
OS: Red Hat Enterprise Linux 7
Compiler: Intel OneAPI 2022 Next generation DPP/C++ Compiler
Valgrind: 3.20 (built with same compiler), also occurred on official distribution of 3.17
Issue did not manifest when unit tests were built with GCC-7 (cannot go back to that compiler) or GCC-11 (another larger bug with boost prevents us from using this with valgrind)
-O0/1/2/3 make no difference on this issue
only compiler flag I have set is "-fp-speculation=safe", which otherwise if unset causes numerical precision issues in other unit tests
Is there any better ways I can figure out what's going on to resolve this situation, or should I submit a bug report to valgrind? I hope this issue is benign but I want to be able to trust my valgrind output.

MPI stopped working on multiple cores suddenly

This piece of code was working fine before with mpi
#include <mpi.h>
#include <iostream>
using namespace std;
int id, p;
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
cout << "Processor " << id << " of " << p << endl;
cout.flush();
MPI_Barrier(MPI_COMM_WORLD);
if (id == 0) cout << "Every process has got to this point now!" << endl;
MPI_Finalize();
}
Giving the output:
Processor 0 of 4
Processor 1 of 4
Processor 2 of 4
Processor 3 of 4
Every process has got to this point now!
When run on 4 cores with the command mpiexec -n 4 ${executable filename}$
I restarted my laptop (i'm not sure if this is the cause) and ran the same code and it outputs on one core:
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
I'm using the microsoft mpi and the project configurations haven't changed.
I'm not really sure what to do about this.
I also installed intel parallel studio and integrated it with visual studio before restarting.
But i'm still compiling with Visual c++ (Same configurations as of when it was working fine)
The easy fix was to uninstall intel parallel studio

OpenMP 4.5 won't offload to GPU with target directive

I am trying to make a simple GPU offloading program using openMP. However, when I try to offload it still runs on the default device, i.e. my CPU.
I have installed a compiler, g++ 7.2.0 that has CUDA support (is in on a cluster that I use). When I run the below code it shows me that it can see the 8 GPUs but when I try to offload it says that it is still on the CPU.
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <math.h>
#include <algorithm>
#define n 10000
#define m 10000
using namespace std;
int main()
{
double tol = 1E-10;
double err = 1;
size_t iter_max = 10;
size_t iter = 0;
bool notGPU[1] = {true};
double Anew[n][m];
double A[n][m];
int target[1];
target[0] = omp_get_initial_device();
cout << "Total Devices: " << omp_get_num_devices() << endl;
cout << "Target: " << target[0] << endl;
for (int iter = 0; iter < iter_max; iter++){
#pragma omp target
{
err = 0.0;
#pragma omp parallel for reduction(max:err)
for (int j = 1; j < n-1; ++j){
target[0] = omp_is_initial_device();
for (int i = 1; i < m-1; i++){
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
err = fmax(err, fabs(Anew[j][i] - A[j][i]));
}
}
}
}
if (target[0]){
cout << "not on GPU" << endl;
} else{
cout << "On GPU" << endl;}
return 0;
}
When I run this I always get that it is not on the GPU, but that there are 8 devices available.
This is not a well documented process!
You have to install some packages which look a little like:
sudo apt install gcc-offload-nvptx
You also need to add additional flags to your compilation string. I've globbed together a number of them below. Mix and match until something works, or use them as the basis for further Googling.
gcc -fopenmp -foffload=x86_64-intelmicemul-linux-gnu="-mavx2" -foffload=nvptx-none -foffload="-O3" -O2 test.c -fopenmp-targets=nvptx64-nvidia-cuda
When I last tried this with GCC in 2018 it just didn't work. At that time target offloading for OpenMP only worked with the IBM XL compiler and OpenACC (a similar set of directives to OpenMP) only worked on the Nvidia's PGI compiler. I find PGI to do a worse job of compiling C/C++ than the others (seems inefficient, non-standard flags), but a Community Edition is available for free and a little translating will get you running in OpenACC quickly.
IBM XL seems to do a fine job compiling, but I don't know if it's available for free.
The situation may have changed with GCC. If you find a way to get it working, I'd appreciate you leaving a comment here. My strong recommendation is that you stop trying with GCC7 and get ahold of GCC8 or GCC9. GPU offloading is a fast-moving area and you'll want the latest compilers to take best advantage of it.
Looks like you're missing a device(id) in your #pragma omp target line:
#pragma omp target device(/*your device id here*/)
Without that, you haven't explicitly asked OpenMP to run anywhere but your CPU.

Why is cuda device count zero?

I am writing a simple code where I try to get the device count.
#include <cuda.h>
#include <iostream>
int main(){
CUcontext cudaContext;
int deviceCount = 0;
CUresult result = cuDeviceGetCount(&deviceCount);
std::cout << "device count = " << deviceCount << std::endl;
}
compiled command : g++ test.cpp -lcuda
When I try to get the count of the device I get zero even though I have gpu.
Or is it supposed to be zero?
You are using the CUDA driver API here.
A driver API code should start with cuInit(0);. If you don't do that, your usage of the driver API will probably return error codes such as initialization error.
You may want to study some CUDA driver API sample codes such as vectorAddDrv.

Visual C++ 2013 -- _tzname and _timezone

I have a C++ Date/Time library that I have literally used for decades. It's been rock solid without any issues. But today, as I was making some small enhancements, my test code started complaining violently. This following program demonstrates the problem :
#include <iostream>
#include <time.h>
void main(void) {
_tzset();
std::cout << "_tzname[ 0 ]=" << _tzname[ 0 ] << std::endl;
std::cout << "_tzname[ 1 ]=" << _tzname[ 1 ] << std::endl;
std::cout << "_timezone=" << _timezone << std::endl;
size_t ret;
char buf[ 64 ];
_get_tzname(&ret,buf,64,0);
std::cout << "_get_tzname[ 0 ]=" << buf << std::endl;
_get_tzname(&ret,buf,64,1);
std::cout << "_get_tzname[ 1 ]=" << buf << std::endl;
}
If I run this in the Visual Studio Debugger I get the following output :
_tzname[ 0 ]=SE Asia Standard Time
_tzname[ 1 ]=SE Asia Daylight Time
_timezone=-25200
_get_tzname[ 0 ]=SE Asia Standard Time
_get_tzname[ 1 ]=SE Asia Daylight Time
This is correct.
But if I run the program from the command line I get the following output :
_tzname[ 0 ]=Asi
_tzname[ 1 ]=a/B
_timezone=0
_get_tzname[ 0 ]=Asi
_get_tzname[ 1 ]=a/B
Note that the TZ environment variable is set to : Asia/Bangkok, which is a synonym for SE Asia Standard Time or UTC+7. You will notice in the command line output that the tzname[ 0 ] value is the first 3 characters of Asia/Bangkok and tzname[ 1 ] is the next 3 characters. I have some thoughts on this, but I cannot make sense of it, so I'll just stick to the facts.
Note that I included the calls to _get_tzname(...) to demonstrate that I am not getting caught in some kind deprecation trap given that _tzname and _timezone are deprecated.
I'm on Windows 7 Professional and I am linking statically to the runtime library (Multi-threaded Debug (/MTd)). I recently installed Visual Studio 2015 and while I am not using it yet, I compiled this program there and the results are the same. I thought there was a chance that I was somehow linking with the VS2015 libraries but I cannot verify this. The Platform Toolset setting in both projects reflects what I would expect.
Thank you for taking the time to look at this...