cudaOccupancyMaxPotentialBlockSizeVariableSMem Unary Function - c++

I am trying to automate grid and block size choices in my Cuda code. In my case, the amount of shared memory needed depends on the number of threads.The function has the following syntax.
__host__ ​cudaError_t cudaOccupancyMaxPotentialBlockSizeVariableSMem ( int* minGridSize, int* blockSize, T func, UnaryFunction blockSizeToDynamicSMemSize, int blockSizeLimit = 0 )
I tried defining a unary function as following.
struct unaryfn: std::unary_function<int, int> {
int operator()(int i) const { return 12* i; }
};
Then, I call the CUDA API function as following.
int blockSize; // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the
// maximum occupancy for a full device launch
int gridSize; // The actual grid size needed, based on input size
unaryfn::argument_type blk;
unaryfn::result_type result;
unaryfn ufn;
cudaOccupancyMaxPotentialBlockSizeVariableSMem(&minGridSize, &blockSize,
CUDAExclVolRepulsionenergy, ufn(), 0);
std::cout<<(nint +blockSize -1) / blockSize<<" "<<blockSize<<endl;
When I compile, I get an error
error: function "unaryfn::operator()" cannot be called with the given argument list
object type is: unaryfn
How do I fix this issue?

Solved! Removing parenthesis on the unary function in function call helped. cudaOccupancyMaxPotentialBlockSizeVariableSMem(&minGridSize, &blockSize, CUDAExclVolRepulsionenergy, ufn(), 0);

Related

Dynamic nested arrays in C++ resulting in cannot convert brace-enclosed initializer list to int

I have written a function that takes in a dynamic length array but with fixed inner array size, the second parameter in the function is the length of the parent array. When I try to access the nested values however, I get the issue mentioned above.
void myFunc(int arrOfArr, int arrOfArrLen) {
// try to access
arrOfArr[0][1]; // expect val2
}
example usage
myFunc(
{
{val1, val2},
{val3, val4}
},
2
);
edit: I realize "contextually" obviously an integer has no indexes, but that's how you declare an array it seems...(truthfully in Arduino context) but apparently it's still C++
Here's a runable demo of above from the first sandbox Google returned
http://cpp.sh/5sp3o
update
I did find a solution, it's ugly but it works:
instead of passing in a "raw" nested array as a param, I set it as a variable first eg:
int arrOfArr[][3] = {
{val1, val2},
{val3, val4}
}
Then in the function I do the same thing
void myFunc(int arrOfArr[][3], int arrOfLen) {
// access
}
Call it
myFunc(arrOfArr, 2);
As I said it's ugly but works for me, this is a passing project thing not a low-level dev, maybe will learn it fully later on but not needed in day job.
edit: apparently the thing I was trying to do initially eg. embed an initializer list as a param does not work.
if you want to pass a nested array, the declaration may be:
template<size_t N>
void myFunc(int const arrOfArr[][N], int arrOfArrLen) {
// ...
}
and you can remove the template argument if N is already decided.
const size_t N = 3;
void myFunc(int const arrOfArr[][N], int arrOfArrLen) {
// ...
}
but it doesn't work if you pass a brace-enclosed initializer, you can add a overloaded function:
template<size_t M, size_t N>
void myFunc(int const (&arrOfArr)[M][N], int arrOfArrLen){
// attention: int *const*
// ...
}

Tensorflow GPU new op memory allocation

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
}
}
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
//
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
ExampleCudaKernel<T>
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
context->eigen_device<Device>(),
static_cast<int>(input_tensor.NumElements()),
input_tensor.flat<T>().data(),
output_tensor->flat<T>().data());
}
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.

C++, different function output when called multiple times

I have the following code:
int countLatticePoints(const double radius, const int dimension) {
static std::vector<int> point {};
static int R = static_cast<int>(std::floor(radius));
static int latticePointCount = 0;
for(int i = -R; i <= R; i++) {
point.push_back(i);
if(point.size() == dimension) {
if(PointIsWithinSphere(point,R)) latticePointCount++;
} else {
countLatticePoints(R, dimension);
}
point.pop_back();
}
return latticePointCount;
}
When I make the call countLatticePoints(2.05, 3) I get the result 13 which is correct. Now I change the parameters and then call countLatticePoints(25.5, 1) I get 51 which is also correct.
Now when I call countLatticePoints(2.05, 3) and countLatticePoints(25.5, 1) right after each other in the main program I get 13 and then 18 (instead of 51), I really don't understand what i'm doing wrong ? When I call each one individually without the other I get the correct result but when I call the functions together one after the other my results change.
You're misusing static.
The second time you call the function, you push additional values into point.
Edit: I hadn't spotted the recursion. that makes things more complex, but static is still the wrong answer.
I'd create a 'state' object, and split the function into two. One that recurses, and takes a reference to the 'state' object, and a second one which initialises the state object and calls the first.
struct RecurState
{
std::vector<int> point;
int latticePointCount
RecurState() : latticePointCount(0)
{
}
}
Outer function:
int countLatticePoints(const double radius, const int dimension)
{
RecurState state;
return countLatticeRecurse(radius, dimension, state)
}
Recursive function
int countLatticeRecurse(const double radius, const int dimension, RecurseState &state)
{
...
}
Local, static variables only get initialized once, on the first function call.

Data types not matching in function call

I am having trouble matching up data types for a function I have written,
The function is:
void generate_all_paths(int size, char *maze[][size], int x, int y) {
...
}
This parameters size, x, and y are all super simple. I believe it is the maze that is throwing me off. It is intended to be a multidimensional size x size array, containing characters of the alphabet which acts like a maze.
When I try to call the function in main as such:
int main() {
char *exmaze[][6] = { {"#","#","#","#","#","#"},
{"S","a","#","h","l","n"},
{"#","b","d","p","#","#"},
{"#","#","e","#","k","o"},
{"#","g","f","i","j","#"},
{"#","#","#","#","#","#"}
};
generate_all_paths(6, *exmaze, 1, 0);
return 0;
}
My IDE complains that there is no generate_all_paths function with matching data types for its parameters.
I am fairly certain that my problem is in main where I defined exmaze but my tweaks were unable to fix it.
Does anybody have any suggestions? Thank you!
*exmaze - why the dereferrence? generate_all_paths(6, exmaze, 1, 0) will pass the pointer by value - which is I suppose what you want in this case.
You haven't shown what's size, but just make sure it's compile-time known constant.
Also, questions like this almost always get recommendations to use standard containers like std::vector so I won't miss it.
In my opinion, using a template is the most elegant way for this:
template<int size>
void generate_all_paths(const char *maze[][size], int x, int y) {
...
}
int main() {
const char *exmaze[][6] = { {"#","#","#","#","#","#"},
{"S","a","#","h","l","n"},
{"#","b","d","p","#","#"},
{"#","#","e","#","k","o"},
{"#","g","f","i","j","#"},
{"#","#","#","#","#","#"}
};
generate_all_paths(exmaze, 1, 0);
return 0;
}
Please also note the const char[][]!

C++ How to pass an element of an object array by reference?

I'm trying to pass an array of Coin objects to a Change function that is called from within a user interface function.
I have tried many combinations of * and & in different places and have found no success.
The user interface function header
void UserInterface (Coin Roll[])
The Change function header (tried Coin * coin and got expression must have class type errors)
bool Change ( Coin & coin, long int & mulah )
How I am trying to call Change (Roll is an array of Coins) within UserInterface
Change(Roll[j], mulah)
The whole program is here http://pastebin.com/6bsuyEvF
You can pass array as reference like below
void UserInterface (Coin (&Roll)[])
You have several possibilities
void UserInterface(Coin roll[], int size);
void UserInterface(Coin* roll, int size);
void UserInterface(Coin (&roll)[42]); // size should be 42
template <std::size_t N> void UserInterface(Coin (&roll)[N]);
Change the accepted type:
void UserInterface(std::vector<Coin>& roll);
void UserInterface(std::array<Coin, 42>& roll); // size should be 42
template <std::size_t N> void UserInterface(std::array<Coin, N>& roll);
There was no problem with the code!
I forgot to add '&' to the function prototype above int main()
bool Change ( Coin & coin, long int & mulah ) ;
int main()
{
Needed the address of operator.