I have a problem with a simple CUDA program which just adds two numbers. I run it on a laptop with a Geforce GT320M GPU on Windows 7. I compile this program with Visual Studio 2013 (I don't know if it means something). The problem is that I always get 0 as a result. I tried to check the given parameters (just return with all the parameters given to the method in an array) and they all seemed to be 0. I run this program in an other computer (at university) and there it runs completely fine, and returns the correct result. So I think there should be some setting problem, but I am not sure of it.
#include <cuda.h>
#include <stdio.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__global__ void add(int a, int b, int* c)
{
*c = a + b;
return;
}
int main(int argc, char** argv)
{
int c;
int* dev_c;
cudaMalloc((void**)&dev_c, sizeof(int));
add << <1, 1 >> >(1, 2, dev_c);
cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
printf("a + b = %d\n", c);
cudaFree(dev_c);
return 0;
}
I also run this code snippet that I found somewhere.
cudaSetDevice(0);
cudaDeviceSynchronize();
cudaThreadSynchronize();
This isn't returning anything.
If you are using the typical CUDA template to create a new Visual Studio project using CUDA, then you have to take care of correctly setting the compute capability for which to compile, changing the default values if needed. This can be done by setting, for example,
compute_12,sm_12
in the CUDA C/C++ Configuration Properties. In your case, the default compute capability was 2.0, while your card is of a previous architecture. This was the source of your mis-computations.
P.S. As of September 2014, CUDA 6.5 is the only version of CUDA supporting Visual Studio 2013, see Is Cuda 6 supported with Visual Studio 2013?.
Related
I want to try out the Blaze linear algebra library and startet with a simple test programm. It looks like this:
#include <iostream>
#include <blaze/Blaze.h>
typedef blaze::DynamicVector<int, blaze::columnVector> bdVector;
int main(int argn, char** argc) {
bdVector a{ 1, 2, 3 };
std::cout << a+a << std::endl;
return 0;
}
This works fine. But when I move the code into a separate function, like this
#include <iostream>
#include <blaze/Blaze.h>
typedef blaze::DynamicVector<int, blaze::columnVector> bdVector;
bdVector func(bdVector a, bdVector b) { return a + b; }
int main(int argn, char** argc) {
bdVector a{ 1, 2, 3 };
std::cout << func(a,a) << std::endl;
return 0;
}
I get the error
The Ordinal 968 was not found in DLL "PATH/blaze-test.exe" It's apparently related to ntdll.dll:
Exception at 0x00007FFA7128EB78 (ntdll.dll) in blaze-test.exe: 0xC0000138: Ordinal Not Found.
I didn't find something related when I googled this error, so I hope some here has an idea.
Best regards
ps: Just in case: I used CMake and Visual Studio 2019 and build in both debug and release config (in release its just ordinal 900 instead)
edit: It seems like this is not because of the function. I added this line to the working code
bdVector b = a;
and this assignment causes the same error.
I found the problem. It seems as if MKL came with a lib for OpenMP. As a consequence there were two OpenMP libraries due to the find_package(OpenMP) in CMake. It worked with Clang because Clang didn’t find OpenMP.
However I don’t understand why this happend when I used an assignment.I would expect it to happen either always or when I use an operation which uses OpenMP.
I am having a code which is using Eigen::vectors, I want to confirm that Eigen has optimized this code for SSE or not.
I am using Visual Studio 2012 Express, in which i can set the command line option "/Qvec-report:2" which gives the optimization details of C++ code. Is there any option in visual studio or Eigen which can tell me that code has been optimized or not?
My code is as below:
#include <iostream>
#include <vector>
#include <time.h>
#include<Eigen/StdVector>
int main(char *argv[], int argc)
{
int tempSize=100;
/** I am aligning these vectors as specfied on http://eigen.tuxfamily.org/dox/group__TopicStlContainers.html */
std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d>> eiVec(tempSize);
std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d>> eiVec1(tempSize);
std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d>> eiVec2(tempSize);
for(int i=0;i<100;i++)
{
eiVec1[i] = Eigen::Vector3d::Zero();
eiVec2[i] = Eigen::Vector3d::Zero();
}
Eigen::Vector3d *eV = &eiVec.front();
const Eigen::Vector3d *eV1 = &eiVec1.front();
const Eigen::Vector3d *eV2 = &eiVec2.front();
/** Below loop is not vectorized by visual studio due to code 1304:
Because here comes the operations at level of Eigen, I want to
know here whether Eigen has optimized this operation or not? */
for(int i=0;i<100;i++)
{
eV[i] = eV1[i] - eV2[i];
}
return 0;
}
Look at the asm output.
If you see SUBPD (packed double) inside the inner loop, it vectorized. If you only see SUBSD (scalar double) and no SUBPD anywhere, it didn't.
When running this code
#include <cstdlib>
#include <cstdio>
int main() {
char b;
char c;
printf("%d\n", &b - &c);
return 0;
}
I got 12 using Microsoft visual studio 2013, and -1 using g++ -std=c++11 under Ubuntu 14.04. What is the reason for this difference?
Or did I make a mistake when testing variables' memory addresses?
It is actually implementation specific. For the given code:
#include <cstdlib>
#include <cstdio>
int main() {
char b;
char c;
printf("%p %p %u\n", &b, &c, &b - &c);
return 0;
}
GCC and MinGW gave an output of 1 for me. Judging by the values of &b and &c, we can say that memory was allocated continuously for two chars b and c. However, other compilers like VS-2013 and Intel C++ compiler will give some other values depending on how the allocation of memory is done.
I am currently running the CUDA 5.0 Toolkit on my Visual Studio 2012 Express.
I attempted to run the following code
I have searched high and low for methods of compiling .cu on Visual Studio but to no avail
Code I have attempted to compile:
//CUDA.cu
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
using namespace std;
__global__ void Add(int* a, int* b)
{
a[0] += b[0];
}
int main()
{
int a = 5, b = 9;
int *d_a, *d_b;
cudaMalloc(&d_a, sizeof(int));
cudaMalloc(&d_b, sizeof(int));
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);
Add<<< 1 , 1 >>>(d_a, d_b);
cudaMemcpy(&a, d_a, sizeof(int) , cudaMemcpyDeviceToHost);
cout << a << endl;
return 0;
}
The compiler shows an error in the line
Add<<< 1 , 1 >>>(d_a, d_b);
Where it says "Error:expected an expression"
Any attempts to compile this code results in a success. but no .exe is to be found hence I cannot debug whatsoever.
Unable to start program 'C:\Users\...\CUDATest3.exe'
The system cannot find the file specified
Any help whatsoever is VERY MUCH appreciated. Thanks
CK
Although I would love to understand as to WHY in my computer CUDA decided that VS is a deplorable program to be married to, I have taken a shortcut by handcuffing both of said program by means of copying from the templates provided by the CUDA installation toolkit. (The CUDA samples.)
Apparently those samples have everything already set up and ready to go; all you need to do is edit the code from inside the solution itself. For some reason the code would not work if I had written all of the code from square one.
I still have no idea as to why I am unable to get it to run from scratch but I guess
editing the sample templates would give a much satisfactory result
if one does not have the time for it.
We are using Visual Studio 2005. We are looking at maybe upgrading to Visual Studio 2012 once it is released. I tried this small program in Visual Studio 2012 RC and was surprised to see it ran more than 2X slower than it does in Visual Studio 2005. In VS2012 I used default release build settings. For me it takes about 20ms in VS2005 and about 50ms in VS2012. Why is it that much slower?
#include <windows.h>
#include <deque>
using namespace std;
deque<int> d;
int main(int argc, char* argv[])
{
const int COUNT = 5000000;
timeBeginPeriod(1);
for (int i = 0; i < COUNT; ++i)
{
d.push_back(i);
}
double sum = 0;
DWORD start = timeGetTime();
for (int i = 0; i < COUNT; ++i)
{
sum += d[i];
}
printf("time=%dms\n", timeGetTime() - start);
printf("sum=%f\n", sum);
return 0;
}
So we reposted this question to the Microsoft forum.
http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/72234b06-7622-445e-b359-88f572b4de52
The short answer is that the implementation of std::deque::operator[] in VS2012RC is just slower compared to VS2005. Other common stl containers tested as equal or faster. It will be interesting to retest when VS2012 is production to see if the operator[] performance is resolved.
ps
Hi Raphael
Karl
My suspicion is that you're running into thread-safety code and that 2012 configures your libraries for multi-threaded code by default, meaning there are a bunch of lock and unlock operations built into your deque accesses.
Try comparing the compiler and linker options of the two builds to see how they differ.
(I'd try this myself but I don't have a Windows system with the relevant software on it handy. Sorry.)
Try timing both of those loops separately. I bet the issue is that the stl container implementation is slower in the new compiler.
Err wait- I meant try timing something that doesn't use STL.