C++ data structures and CUDA - c++

i have a structure which can be
struct type1{ double a,b,c;}
or it can be
struct type2{ double a,b,c,d,e;}
in my host function of cuda code i have someting like
void compute(){
// some code
// data on devices (up to 10)
type *xxx[10]; // this is where i want either type1 or type2 structures
// the "type" is not known at compile time but i want to
// determine at runtime based on some input variable. this
// part is not real code rather this is what i want to achive.
int DevUsed; // some code to give value to int DevUsed
for(int idev=0;idev<DevUsed;idev++){
// set cuda device
if ( cudaMalloc(&xxx[iDev], sizeof(type)) != cudaSuccess )
// print error message;
cudaMemcpy(xxx[iDev], pIF1, sizeof(type), cudaMemcpyHostToDevice);
function2<<<grid, block>>>(xxx[iDev]); // where function2 is the kernel
My question is what is a way to select between type1 and type2 data struct with generic code like "type *xxx[10];"

C++ template is designed for this situation.
template <class T>
void compute(){
// some code
// data on devices (up to 10)
T xxx[10]; // this is where i want either type1 or type2 structures
// the "type" is not known at compile time but i want to
// determine at runtime based on some input variable. this
// part is not real code rather this is what i want to achive.
int DevUsed; // some code to give value to int DevUsed
for(int idev=0;idev<DevUsed;idev++){
// set cuda device
if ( cudaMalloc(&xxx[iDev], sizeof(T)) != cudaSuccess )
// print error message;
cudaMemcpy(xxx[iDev], pIF1, sizeof(T), cudaMemcpyHostToDevice);
function2<<<grid, block>>>(xxx[iDev]); // where function2 is the kernel
Please note that you will also need a kernel template for these two types like
template <class T>
__global__ void function2(T x)


Tensorflow GPU new op memory allocation

I am trying to create a new tensorflow GPU op following the instructions on their website.
Looking at their example, it seems they feed a C++ pointer directly into the CUDA kernel without allocating device memory and copying the contents of the host pointer to the device pointer.
From what I understand of CUDA you always have to allocate memory on the device and then use device pointers inside the kernels.
What am I missing? I checked that input_tensor.flat<T>().data() should return a regular C++ pointer. Here is a copy of the code I am referring to:
// kernel_example.cu.cc
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"
using namespace tensorflow;
using GPUDevice = Eigen::GpuDevice;
// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
i += blockDim.x * gridDim.x) {
out[i] = 2 * ldg(in + i);
// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
const GPUDevice& d, int size, const T* in, T* out) {
// Launch the cuda kernel.
// See core/util/cuda_kernel_helper.h for example of computing
// block count and thread_per_block count.
int block_count = 1024;
int thread_per_block = 20;
<<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;
#endif // GOOGLE_CUDA
When you look on https://www.tensorflow.org/extend/adding_an_op at this code lines you will see that the allocation is done in kernel_example.cc:
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
// Do the computation.
OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
errors::InvalidArgument("Too many elements in tensor"));
ExampleFunctor<Device, T>()(
in context->allocate_output(....) they hand over a reference to the output Tensor, which is then allocated. The context knows if it is running on GPU or CPU and allocates the tensor respectively either on host or device. The pointer handed over to CUDA just points then to the actual data within the Tensor class.

cudaOccupancyMaxPotentialBlockSizeVariableSMem Unary Function

I am trying to automate grid and block size choices in my Cuda code. In my case, the amount of shared memory needed depends on the number of threads.The function has the following syntax.
__host__ ​cudaError_t cudaOccupancyMaxPotentialBlockSizeVariableSMem ( int* minGridSize, int* blockSize, T func, UnaryFunction blockSizeToDynamicSMemSize, int blockSizeLimit = 0 )
I tried defining a unary function as following.
struct unaryfn: std::unary_function<int, int> {
int operator()(int i) const { return 12* i; }
Then, I call the CUDA API function as following.
int blockSize; // The launch configurator returned block size
int minGridSize; // The minimum grid size needed to achieve the
// maximum occupancy for a full device launch
int gridSize; // The actual grid size needed, based on input size
unaryfn::argument_type blk;
unaryfn::result_type result;
unaryfn ufn;
cudaOccupancyMaxPotentialBlockSizeVariableSMem(&minGridSize, &blockSize,
CUDAExclVolRepulsionenergy, ufn(), 0);
std::cout<<(nint +blockSize -1) / blockSize<<" "<<blockSize<<endl;
When I compile, I get an error
error: function "unaryfn::operator()" cannot be called with the given argument list
object type is: unaryfn
How do I fix this issue?
Solved! Removing parenthesis on the unary function in function call helped. cudaOccupancyMaxPotentialBlockSizeVariableSMem(&minGridSize, &blockSize, CUDAExclVolRepulsionenergy, ufn(), 0);

Simultaneous use of overloading of parameter and overloading of return type

I was trying to make a program, which automatically detects the data type of input given by user.
My approach :
int input(istream& i)
int k;
return k;
float input(istream& i)
float k;
return k;
void showval(int h){cout<<h;}
void showval(float h){cout<<h;}
int main()
return 0;
As you can see, I used overloading of parameters and overloading of return type of two different functions, but at the same time. However, the program gives error as
"new declaration float input(istream& i) disambiguates the old
declaration int input(istream& i)”.
I don’t understand, how this creates ambiguity. Is it because, the two different functions (showval and input) are dependent?
Also after going through few articles on overloading, what i realised is that in C++, methods can be overloaded only if they differ by parameters.
However this link has a trick by which he was able to overload functions by return type. Is it possible to use the same trick in my program? Also, is there any way by which i can tell the compiler that the function input has parameter which is user dependent, and its data type may or may not differ. Does C++ forbid such possibilty?
Let's say that types such as int and float are specific, and types such as the proxy object shown in the linked question are generic. Our options are to be specific to begin with, in which case we just coast through the rest, or we give rise to a generic type and handle all the various specific types we may support.
The proxy object shown in the linked question is an example of a variant type, and boost::variant is a generic implementation of this. For example, boost::variant<int, float> allows us to hold either int or float.
My recommendation really depends what you want. Do you
want to specify the type you expect to get from the user and throw on unexpectd input? (specific to begin with and coast) OR,
want to give rise to a different type depending on what the user inputted and specify a set of types you can handle? (Give rise to a generic type and handle the various specific types)
Specifying the type you expect from the user
In this case we can simply make the function templated and we specify the type we expect through the template parameter.
The example shown is kept totally generic but you can restrain template parameters using various techniques. Check out my answer regarding this topic.
#include <iostream>
/* Read data of type T from an input stream. */
template <typename T>
T read(std::istream &strm) {
T val;
strm >> val;
if (!strm) {
throw /* something */;
} // if
return val;
/* Print data of type T. */
template <typename T>
void print(const T &val) {
std::cout << val;
int main() {
This will give rise to an int for input such as 1 and even for input such as 1., 1.0 and 1.2.
Handling different types you may get from the user
In this case we're actually lexing the input stream from the user. Our read function will give rise to a generic type, boost::variant<int, float>.
#include <iostream>
#include <boost/variant.hpp>
/* Naive implementation of a lexer. */
boost::variant<int, float> read(std::istream &strm) {
std::string lexeme;
strm >> lexeme;
try {
std::size_t idx;
auto val = std::stoi(lexeme, &idx);
if (idx == lexeme.size()) { // Make sure we converted the entire lexeme.
return val;
} // if
} catch (const std::exception &) {
// Do nothing. We'll try to lex it as float instead.
} // try
std::size_t idx;
auto val = std::stof(lexeme, &idx);
if (idx == lexeme.size()) { // Make sure we converted the entire lexeme.
return val;
} // if
throw /* something */;
/* Print the type and the value, to check that we have the correct type. */
void print(const boost::variant<int, float> &val) {
class visitor : public boost::static_visitor<void> {
void operator()(int that) const {
std::cout << "int: " << that << std::endl;
void operator()(float that) const {
std::cout << "float: " << that << std::endl;
}; // visitor
boost::apply_visitor(visitor(), val);
int main() {
This approach will give rise to int for input such as 1, and give rise to float for input such as 1., 1.0 as 1.2.
As you can see, we give rise to a generic type, boost::variant<int, float>, and handle the various specific types, int and float, in the visitor.
The problem is that the compiler cannot possibly know which version of input to call. It is only within input that you actually attempt to extract from the stream, and only at that point can you know what the user has inputted. And even then, there's no reason the user can't enter 1.5 and then you extract into an int, or they enter 5 and you extract into a float.
Types are compile-time constructs. The compiler uses the type information to produce the program executable, so it must know what types are being used at compile time (way before the user inputs anything).
So no, you can't do this quite like this. You could extract a line from the input, parse it to determine whether it's a floating point value or an integer (does it have a .?), and then have a separate execution path for each case. However, instead I recommend deciding what the input that you expect from the user is (an int or a float?) and just extract that.
And also no, the trick with the proxy won't work for you. Firstly, as I mentioned, the format of the input is not known at compile time anyway. But secondly, in that code, the type that was required was known by the type of the variable being declared. In one line they did int v = ... and in the other they did double u = .... In your case, you're passing the result to showval which could take either an int or double and the compiler has no idea which.

Calling templated function with type unknown until runtime

I have a this function to read 1d arrays from an unformatted fortran file:
template <typename T>
void Read1DArray(T* arr)
unsigned pre, post;
file.read((char*)&pre, PREPOST_DATA);
for(unsigned n = 0; n < (pre/sizeof(T)); n++)
file.read((char*)&arr[n], sizeof(T));
file.read((char*)&post, PREPOST_DATA);
std::cout << "Failed read fortran 1d array."<< std::endl;
I call this like so:
float* new_array = new float[sizeof_fortran_array];
Assume Read1DArray is part of a class, which contains an ifstream named 'file', and sizeof_fortran_array is already known. (And for those not quite so familiar with fortran unformatted writes, the 'pre' data indicates how long the array is in bytes, and the 'post' data is the same)
My issue is that I have a scenario where I may want to call this function with either a float* or a double*, but this will not be known until runtime.
Currently what I do is simply have a flag for which data type to read, and when reading the array I duplicate the code something like this, where datatype is a string set at runtime:
Can someone suggest a method of rewriting this so that I dont have to duplicate the function call with the two types? These are the only two types it would be necessary to call it with, but I have to call it a fair few times and I would rather not have this duplication all over the place.
In response to the suggestion to wrap it in a call_any_of function, this wouldnt be enough because at times I do things like this:
// More stuff happening in between
If you think about the title you will realize that there is a contradiction in that the template instantiation is performed at compile time but you want to dispatch based on information available only at runtime. At runtime you cannot instantiate a template, so that is impossible.
The approach you have taken is actually the right one: instantiate both options at compile time, and decide which one to use at runtime with the available information. That being said you might want to think your design.
I imagine that not only reading but also processing will be different based on that runtime value, so you might want to bind all the processing in a (possibly template) function for each one of the types and move the if further up the call hierarchy.
Another approach to avoid having to dispatch based on type to different instantiations of the template would be to loose some of the type safety and implement a single function that takes a void* to the allocated memory and a size argument with the size of the type in the array. Note that this will be more fragile, and it does not solve the overall problem of having to act on the different arrays after the data is read, so I would not suggest following this path.
Because you don't know which code path to take until runtime, you'll need to set up some kind of dynamic dispatch. Your current solution does this using an if-else which must be copied and pasted everywhere it is used.
An improvement would be to generate a function that performs the dispatch. One way to achieve this is by wrapping each code path in a member function template, and using an array of member function pointers that point to specialisations of that member function template. [Note: This is functionally equivalent to dynamic dispatch using virtual functions.]
class MyClass
template <typename T>
T* AllocateAndRead1DArray(int sizeof_fortran_array)
T* ptr = new T[sizeof_fortran_array];
return ptr;
template <typename T>
void Read1DArrayAndDoStuff(int sizeof_fortran_array)
template <typename T>
void Read1DArrayAndDoOtherStuff(int sizeof_fortran_array)
// map a datatype to a member function that takes an integer parameter
typedef std::pair<std::string, void(MyClass::*)(int)> Action;
static const int DATATYPE_COUNT = 2;
// find the action to perform for the given datatype
void Dispatch(const Action* actions, const std::string& datatype, int size)
for(const Action* i = actions; i != actions + DATATYPE_COUNT; ++i)
if((*i).first == datatype)
// perform the action for the given size
return (this->*(*i).second)(size);
// map each datatype to an instantiation of Read1DArrayAndDoStuff
MyClass::Action ReadArrayAndDoStuffMap[MyClass::DATATYPE_COUNT] = {
MyClass::Action("float", &MyClass::Read1DArrayAndDoStuff<float>),
MyClass::Action("double", &MyClass::Read1DArrayAndDoStuff<double>),
// map each datatype to an instantiation of Read1DArrayAndDoOtherStuff
MyClass::Action ReadArrayAndDoOtherStuffMap[MyClass::DATATYPE_COUNT] = {
MyClass::Action("float", &MyClass::Read1DArrayAndDoOtherStuff<float>),
MyClass::Action("double", &MyClass::Read1DArrayAndDoOtherStuff<double>),
int main()
MyClass object;
// call MyClass::Read1DArrayAndDoStuff<float>(33)
object.Dispatch(ReadArrayAndDoStuffMap, "float", 33);
// call MyClass::Read1DArrayAndDoOtherStuff<double>(542)
object.Dispatch(ReadArrayAndDoOtherStuffMap, "double", 542);
If performance is important, and the possible set of types is known at compile time, there are a few further optimisations that could be performed:
Change the string to an enumeration that represents all the possible data types and index the array of actions by that enumeration.
Give the Dispatch function template parameters that allow it to generate a switch statement to call the appropriate function.
For example, this can be inlined by the compiler to produce code that is (generally) more optimal than both the above example and the original if-else version in your question.
class MyClass
enum DataType
static MyClass::DataType getDataType(const std::string& datatype)
if(datatype == "float")
return MyClass::DATATYPE_FLOAT;
return MyClass::DATATYPE_DOUBLE;
// find the action to perform for the given datatype
template<typename Actions>
void Dispatch(const std::string& datatype, int size)
case DATATYPE_FLOAT: return Actions::FloatAction::apply(*this, size);
case DATATYPE_DOUBLE: return Actions::DoubleAction::apply(*this, size);
struct Action
static void apply(MyClass& object, int size)
struct ReadArrayAndDoStuff
typedef Action<&MyClass::Read1DArrayAndDoStuff<float>> FloatAction;
typedef Action<&MyClass::Read1DArrayAndDoStuff<double>> DoubleAction;
struct ReadArrayAndDoOtherStuff
typedef Action<&MyClass::Read1DArrayAndDoOtherStuff<float>> FloatAction;
typedef Action<&MyClass::Read1DArrayAndDoOtherStuff<double>> DoubleAction;
int main()
MyClass object;
// call MyClass::Read1DArrayAndDoStuff<float>(33)
object.Dispatch<ReadArrayAndDoStuff>("float", 33);
// call MyClass::Read1DArrayAndDoOtherStuff<double>(542)
object.Dispatch<ReadArrayAndDoOtherStuff>("double", 542);

On what platforms will this crash, and how can I improve it?

I've written the rudiments of a class for creating dynamic structures in C++. Dynamic structure members are stored contiguously with (as far as my tests indicate) the same padding that the compiler would insert in the equivalent static structure. Dynamic structures can thus be implicitly converted to static structures for interoperability with existing APIs.
Foremost, I don't trust myself to be able to write Boost-quality code that can compile and work on more or less any platform. What parts of this code are dangerously in need of modification?
I have one other design-related question: Is a templated get accessor the only way of providing the compiler with the requisite static type information for type-safe code? As it is, the user of dynamic_struct must specify the type of the member they are accessing, whenever they access it. If that type should change, all of the accesses become invalid, and will either cause spectacular crashes—or worse, fail silently. And it can't be caught at compile time. That's a huge risk, and one I'd like to remedy.
Example of usage:
struct Test {
char a, b, c;
int i;
Foo object;
void bar(const Test&);
int main(int argc, char** argv) {
dynamic_struct<std::string> ds(sizeof(Test));
ds.append<char>("a") = 'A';
ds.append<char>("b") = '2';
ds.append<char>("c") = 'D';
ds.append<int>("i") = 123;
And the code follows:
// dynamic_struct.h
// Much omitted for brevity.
* For any type, determines the alignment imposed by the compiler.
template<class T>
class alignment_of {
struct alignment {
char a;
T b;
}; // struct alignment
enum { value = sizeof(alignment) - sizeof(T) };
}; // class alignment_of
* A dynamically-created structure, whose fields are indexed by keys of
* some type K, which can be substituted at runtime for any structure
* with identical members and packing.
template<class K>
class dynamic_struct {
// Default maximum structure size.
static const int DEFAULT_SIZE = 32;
* Create a structure with normal inter-element padding.
dynamic_struct(int size = DEFAULT_SIZE) : max(size) {
} // dynamic_struct()
* Copy structure from another structure with the same key type.
dynamic_struct(const dynamic_struct& structure) :
members(structure.members), max(structure.max) {
for (iterator i = members.begin(); i != members.end(); ++i)
i->second.copy(&data[0] + i->second.offset,
&structure.data[0] + i->second.offset);
} // dynamic_struct()
* Destroy all members of the structure.
~dynamic_struct() {
for (iterator i = members.begin(); i != members.end(); ++i)
i->second.destroy(&data[0] + i->second.offset);
} // ~dynamic_struct()
* Get a value from the structure by its key.
template<class T>
T& get(const K& key) {
iterator i = members.find(key);
if (i == members.end()) {
std::ostringstream message;
message << "Read of nonexistent member \"" << key << "\".";
throw dynamic_struct_access_error(message.str());
} // if
return *reinterpret_cast<T*>(&data[0] + i->second.offset.offset);
} // get()
* Append a member to the structure.
template<class T>
T& append(const K& key, int alignment = alignment_of<T>::value) {
iterator i = members.find(key);
if (i != members.end()) {
std::ostringstream message;
message << "Add of already existing member \"" << key << "\".";
throw dynamic_struct_access_error(message.str());
} // if
const int modulus = data.size() % alignment;
const int delta = modulus == 0 ? 0 : sizeof(T) - modulus;
if (data.size() + delta + sizeof(T) > max) {
std::ostringstream message;
message << "Attempt to add " << delta + sizeof(T)
<< " bytes to struct, exceeding maximum size of "
<< max << ".";
throw dynamic_struct_size_error(message.str());
} // if
data.resize(data.size() + delta + sizeof(T));
new (static_cast<void*>(&data[0] + data.size() - sizeof(T))) T;
std::pair<iterator, bool> j = members.insert
({key, member(data.size() - sizeof(T), destroy<T>, copy<T>)});
if (j.second) {
return *reinterpret_cast<T*>(&data[0] + j.first->second.offset);
} else {
std::ostringstream message;
message << "Unable to add member \"" << key << "\".";
throw dynamic_struct_access_error(message.str());
} // if
} // append()
* Implicit checked conversion operator.
template<class T>
operator T&() { return as<T>(); }
* Convert from structure to real structure.
template<class T>
T& as() {
// This naturally fails more frequently if changed to "!=".
if (sizeof(T) < data.size()) {
std::ostringstream message;
message << "Attempt to cast dynamic struct of size "
<< data.size() << " to type of size " << sizeof(T) << ".";
throw dynamic_struct_size_error(message.str());
} // if
return *reinterpret_cast<T*>(&data[0]);
} // as()
// Map from keys to member offsets.
map_type members;
// Data buffer.
std::vector<unsigned char> data;
// Maximum allowed size.
const unsigned int max;
}; // class dynamic_struct
There's nothing inherently wrong with this kind of code. Delaying type-checking until runtime is perfectly valid, although you will have to work hard to defeat the compile-time type system. I wrote a homogenous stack class, where you could insert any type, which functioned in a similar fashion.
However, you have to ask yourself- what are you actually going to be using this for? I wrote a homogenous stack to replace the C++ stack for an interpreted language, which is a pretty tall order for any particular class. If you're not doing something drastic, this probably isn't the right thing to do.
In short, you can do it, and it's not illegal or bad or undefined and you can make it work - but you only should if you have a very desperate need to do things outside the normal language scope. Also, your code will die horrendously when C++0x becomes Standard and now you need to move and all the rest of it.
The easiest way to think of your code is actually a managed heap of a miniature size. You place on various types of object.. they're stored contiguously, etc.
Edit: Wait, you didn't manage to enforce type safety at runtime either? You just blew compile-time type safety but didn't replace it? Let me post some far superior code (that is somewhat slower, probably).
Edit: Oh wait. You want to convert your dynamic_struct, as the whole thing, to arbitrary unknown other structs, at runtime? Oh. Oh, man. Oh, seriously. What. Just no. Just don't. Really, really, don't. That's so wrong, it's unbelievable. If you had reflection, you could make this work, but C++ doesn't offer that. You can enforce type safety at runtime per each individual member using dynamic_cast and type erasure with inheritance. Not for the whole struct, because given a type T you can't tell what the types or binary layout is.
I think the type-checking could be improved. Right now it will reinterpret_cast itself to any type with the same size.
Maybe create an interface to register client structures at program startup, so they may be verified member-by-member — or even rearranged on the fly, or constructed more intelligently in the first place.
do dynamic_struct::registry< STRUCT >() // one registry obj per client type \
.add( # MEMBER, &STRUCT::MEMBER, offsetof( STRUCT, MEMBER ) ) while(0)
// ^ name as str ^ ptr to memb ^ check against dynamic offset
I have one question: what do you get out of it ?
I mean it's a clever piece of code but:
you're fiddling with memory, the chances of blow-up are huge
it's quite complicated too, I didn't get everything and I would certainly have to pose longer...
What I am really wondering is what you actually want...
For example, using Boost.Fusion
struct a_key { typedef char type; };
struct object_key { typedef Foo type; };
typedef boost::fusion<
std::pair<a_key, a_key::type>,
std::pair<object_key, object_key::type>
> data_type;
int main(int argc, char* argv[])
data_type data;
boost::fusion::at_key<a_key>(data) = 'a'; // compile time checked
Using Boost.Fusion you get compile-time reflection as well as correct packing.
I don't really see the need for "runtime" selection here (using a value as key instead of a type) when you need to pass the right type to the assignment anyway (char vs Foo).
Finally, note that this can be automated, thanks to preprocessor programming:
(char, a)
(char, b)
(char, c)
(int, i)
(Foo, object)
Not much wordy than a typical declaration, though a, b, etc... will be inner types rather than attributes names.
This has several advantages over your solution:
compile-time checking
perfect compliance with default generated constructors / copy constructors / etc...
much more compact representation
no runtime lookup of the "right" member