I am using the CUDA API / cuFFT API. In order to move data from host to GPU I am usign the cudaMemcpy functions. I am using it like below. len is the amount of elements on dataReal and dataImag.
void foo(const double* dataReal, const double* dataImag, size_t len)
{
cufftDoubleComplex* inputData;
size_t allocSizeInput = sizeof(cufftDoubleComplex)*len;
cudaError_t allocResult = cudaMalloc((void**)&inputData, allocSizeInput);
if (allocResult != cudaSuccess) return;
cudaError_t copyResult;
coypResult = cudaMemcpy2D(static_cast<void*>(inputData),
2 * sizeof (double),
static_cast<const void*>(dataReal),
sizeof(double),
sizeof(double),
len,
cudaMemcpyHostToDevice);
coypResult &= cudaMemcpy2D(static_cast<void*>(inputData) + sizeof(double),
2 * sizeof (double),
static_cast<const void*>(dataImag),
sizeof(double),
sizeof(double),
len,
cudaMemcpyHostToDevice);
//and so on.
}
I am aware, that pointer arithmetic on void pointers is actually not possible. the second cudaMemcpy2D does still work though. I still get a warning by the compiler, but it works correctly.
I tried using static_cast< char* > but that doesn't work as cuffDoubleComplex* cannot be static casted to char*.
I am a bit confused why the second cudaMemcpy with the pointer arithmetic on void is working, as I understand it shouldn't. Is the compiler implicitly assuming that the datatype behind void* is one byte long?
Should I change something there? Use a reinterpret_cast< char* >(inputData) for example?
Also during the allocation I am using the old C-style (void**) cast. I do this because I am getting a "invalid static_cast from cufftDoubleComplex** to void**". Is there another way to do this correctly?
FYI: Link to cudaMemcpy2D Doc
Link to cudaMalloc Doc
You cannot do arithmetic operations on void* since arithmetic operations on pointer are based on the size of the pointed objects (and sizeof(void) does not really mean anything).
Your code compiles probably thanks to a compiler extension that treats arithmetic operations on void* as arithmetic operation on char*.
In your case, you probably do not need arithmetic operations, the following should work (and be more robust):
coypResult &= cudaMemcpy2D(static_cast<void*>(&inputData->y),
sizeof (cufftDoubleComplex),
Since cufftDoubleComplex is simply:
struct __device_builtin__ __builtin_align__(16) double2
{
double x, y;
};
Related
I want to determine a cuda memory is malloced or not in runtime. Or is there a way to determine a cuda pointer is a nullptr or not?
I want to determine the memory in cuda is nullptr or not for different process. I have a function as below.
__global__ void func(unsigned int *a, unsigned char *mask, const int len)
{
if (mask!= nullptr){// do something}
else {// do something else}
}
If the mask is processed by cudaMalloc, it should run into if-condition. Otherwise, it runs into else-condition.
This snippet could run:
int* a;
char* mask;
int len = 1024;
cudaMalloc(&a, sizeof(int) * len);
cudaMalloc(&mask, sizeof(char) * len);
func(a, mask, len);
And this snippet could also run:
int* a;
char* mask;
int len = 1024;
cudaMalloc(&a, sizeof(int) * len);
func(a, mask, len);
Is there a way to achieve this?
In the general case, pointer introspection in device code is not possible.
In your host code, if you do:
char* mask = nullptr;
and you guarantee both of these conditions:
If any cudaMalloc operation is run (on mask), you test the return value and do not allow further code progress (or do not allow any of the snippets that use mask to run) if the return value is not cudaSuccess
There is no usage of cudaFree on the mask pointer until such point in time where your code snippets that use it will never be run again
Then it should be possible to do what you are suggesting in device code:
if (mask!= nullptr){// do something}
else {// do something else}
On a successful cudaMalloc call, the allocated pointer will never be the nullptr.
Ok - so I'll preface this by saying I'm not entirely sure how to describe the question and my current confusion, so I'll do my best to provide examples.
Question
Which of the two approaches to using the typedef-ed, fixed-length array in a memcpy call (shown below in "Context") is correct? Or are they equivalent?
(I'm starting the think that they are equivalent - some experimentation under "Notes", below).
Context
Consider the following typedef typedef uint8_t msgdata[150]; and the library interface const msgdata* IRead_GetMsgData (void); .
In my code, I use IRead_GetMsgData and memcpy the result into another uint8_t buffer (contrived example below).
//Included from library:
//typedef uint8_t msgdata[150];
//const msgdata* IRead_GetMsgData (void);
uint8_t mBuff[2048];
void Foo() {
const msgdata* myData = IRead_GetMsgData();
if(myData != nullptr) {
std::memcpy(mBuff, *myData, sizeof(msgdata));
}
}
Now, this works and passes our unit tests fine but it started a discussion between our team about whether we should dereference myData in this case. It turns out, not dereferencing myData also works and passes all our unit tests
std::memcpy(mBuff, myData, sizeof(msgdata)); //This works fine, too
My thought when writing the memcpy call was that, because myData is of type msgdata*, dereferencing it would return the pointed-to msgdata, which is a uint8_t array.
E.g.
typedef uint8 msgdata[150];
msgdata mData = {0u};
msgdata* pData = &mData;
memcpy(somePtr, pData, size); //Would expect this to fail - pData isn't the buffer mData.
memcpy(somePtr, *pData, size); //Would expect this to work - dereferencing pData returns the buffer mData
memcpy(somePtr, mData, size); //Would expect this to work - mData is the buffer, mData ==&mData[0]
I've tried searching for discussion of similar questions but haven't yet found anything that felt relevant:
Using new with fixed length array typedef - how to use/format a typedef
How to dereference typedef array pointer properly? - how to dereference a typedef-ed array and access its elements
typedef fixed length array - again how to format the typedef.
The last one in that list felt most relevant to me, as the accepted answer nicely states (emphasis mine)
[this form of typedef is] probably a very bad idea
Which, having now tried to understand what's actually going on, I couldn't agree with more! Not least because it hides the type you're actually trying to work with...
Notes
So after we started thinking on this, I did a bit of experimentation:
typedef uint8_t msgdata[150];
msgdata data = {0};
msgdata* pData = &data;
int main() {
printf("%p\n", pData);
printf("%p\n", *pData);
printf("%p\n", &data);
printf("%p\n", data);
return 0;
}
Outputs:
0x6020a0
0x6020a0
0x6020a0
0x6020a0
And if I extend that to include a suitable array, arr and a defined size value, size, I can use various memcpy calls such as
std::memcpy(arr, data, size);
std::memcpy(arr, pData, size);
std::memcpy(arr, *pData, size);
Which all behave the same, leading me to believe they are equivalent.
I understand the first and last versions (data and *pData), but I'm still unsure of what is happening regarding the pData version...
This code is, IMO, plain wrong. I'd also accept the alternative view "the code is very misleading"
//Included from library:
//typedef uint8_t msgdata[150];
//const msgdata* IRead_GetMsgData (void);
uint8_t mBuff[2048];
void Foo() {
const msgdata* myData = IRead_GetMsgData();
if(myData != nullptr) {
std::memcpy(mBuff, *myData, sizeof(msgdata));
}
}
When you dereference *myData, you mislead the reader. Obviously, memcpy requires a pointer to a msgdata, so the dereferencing star is not needed. myData is already a pointer. Introducing an extra dereference would break the code.
But it doesn't... Why?
That's where you specific use case kicks in. typedef uint8_t msgdata[150]; msgdata is an array that decays into a pointer. So, *msgdata is the array, and an array is(decays into) a pointer to its beginning.
So, you could argue: no big deal, I can leave my extra * in, right ?
No.
Because someday, someone will change the code to:
class msgdata
{
int something_super_useful;
uint8_t msgdata[150];
};
In this case, the compiler will catch it but, in general, an indirection level error might compile to a subtle crash. It would take you hours or days to find the extraneous *.
Is there a portable way to implement a tagged pointer in C/C++, like some documented macros that work across platforms and compilers? Or when you tag your pointers you are at your own peril? If such helper functions/macros exist, are they part of any standard or just are available as open source libraries?
Just for those who do not know what tagged pointer is but are interested, it is a way to store some extra data inside a normal pointer, because on most architectures some bits in pointers are always 0 or 1, so you keep your flags/types/hints in those extra bits, and just erase them right before you want to use pointer to dereference some actual value.
const int gc_flag = 1;
const int flag_mask = 7; // aka 0b00000000000111, because on some theoretical CPU under some arbitrary OS compiled with some random compiler and using some particular malloc last three bits are always zero in pointers.
struct value {
void *data;
};
struct value val;
val.data = &data | gc_flag;
int data = *(int*)(val.data & flag_mask);
https://en.wikipedia.org/wiki/Pointer_tagging
You can get the lowest N bits of an address for your personal use by guaranteeing that the objects are aligned to multiples of 1 << N. This can be achieved platform-independently by different ways (alignas and aligned_storage for stack-based objects or std::aligned_alloc for dynamic objects), depending on what you want to achieve:
struct Data { ... };
alignas(1 << 4) Data d; // 4-bits, 16-byte alignment
assert(reinterpret_cast<std::uintptr_t>(&d) % 16 == 0);
// dynamic (preferably with a unique_ptr or alike)
void* ptr = std::aligned_alloc(1 << 4, sizeof(Data));
auto obj = new (ptr) Data;
...
obj->~Data();
std::free(ptr);
You pay by throwing away a lot of memory, exponentionally growing with the number of bits required. Also, if you plan to allocate many of such objects contiguously, such an array won't fit in the processor's cacheline for comparatively small arrays, possibly slowing down the program considerably. This solution therefore is not to scale.
If you're sure that the addresses you are passing around always have certain bits unused, then you could use uintptr_t as a transport type. This is an integer type that maps to pointers in the expected way (and will fail to exist on an obscure platform that offers no such possible map).
There aren't any standard macros but you can roll your own easily enough. The code (sans macros) might look like:
void T_func(uintptr_t t)
{
uint8_t tag = (t & 7);
T *ptr = (T *)(t & ~(uintptr_t)7);
// ...
}
int main()
{
T *ptr = new T;
assert( ((uintptr_t)ptr % 8) == 0 );
T_func( (uintptr_t)ptr + 3 );
}
This may defeat compiler optimizations that involve tracking pointer usage.
Well, GCC at least can compute the size of bit-fields, so you can get portability across platforms (I don't have an MSVC available to test with). You can use this to pack the pointer and tag into an intptr_t, and intptr_t is guaranteed to be able to hold a pointer.
#include <limits.h>
#include <stdio.h>
#include <stdint.h>
#include <stddef.h>
#include <inttypes.h>
struct tagged_ptr
{
intptr_t ptr : (sizeof(intptr_t)*CHAR_BIT-3);
intptr_t tag : 3;
};
int main(int argc, char *argv[])
{
struct tagged_ptr p;
p.tag = 3;
p.ptr = (intptr_t)argv[0];
printf("sizeof(p): %zu <---WTF MinGW!\n", sizeof p);
printf("sizeof(p): %lu\n", (unsigned long int)sizeof p);
printf("sizeof(void *): %u\n", (unsigned int)sizeof (void *));
printf("argv[0]: %p\n", argv[0]);
printf("p.tag: %" PRIxPTR "\n", p.tag);
printf("p.ptr: %" PRIxPTR "\n", p.ptr);
printf("(void *)*(intptr_t*)&p: %p\n", (void *)*(intptr_t *)&p);
}
Gives:
$ ./tag.exe
sizeof(p): zu <---WTF MinGW!
sizeof(p): 8
sizeof(void *): 8
argv[0]: 00000000007613B0
p.tag: 3
p.ptr: 7613b0
(void *)*(intptr_t*)&p: 60000000007613B0
I've put the tag at the top, but changing the order of the struct would put it at the bottom. Then shifting the pointer-to-be-stored right by 3 would implement the OP's use case. Probably make macros for access to make it easier.
I also kinda like the struct because you can't accidentally dereference it as if it were a plain pointer.
I am trying to modify an existing code to make use of the memory align function below:
void* acl_aligned_malloc (size_t size)
{
void *result = NULL;
posix_memalign (&result, ACL_ALIGNMENT, size);
return result;
}
The function is taken from one of the examples provided by the vendor, and I am trying to optimize the execution of my code by incorporating the recommended function. This is how the function is used in the examples:
static void *X;
X = (void *) acl_aligned_malloc(sizeof(cl_float) * vectorSize);
initializeVector((float*)X, vectorSize);
status = clEnqueueWriteBuffer(queue, kernelX, CL_FALSE, 0, sizeof(cl_float) * vectorSize, X, 0, NULL, NULL);
Observe that the code uses pointer *X to hold the return value of the memory align function. However, the data type that I am working with is not pointer, but of vector float type called vec_t.
vec_t input_;
How can I adapt vec_t input_ to use the memory align function? I have tried the modification below, but I am getting segmentation fault error. Should I change vec_t into a pointer? How can I do that?
void *X;
vec_t input_;
X = (void *) acl_aligned_malloc(batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float));
input_ = *((vec_t*) X);
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), &input_[0]);
"the data type that I am working with is ... of vector float type"
This is going to cause you problems. Vectors handle their own internal storage structures. There is no way to manually manipulate those that I know of (nor would you want to, you'd have to manage the size and capacity variables too and that would get messy). Also as far as I know, there is no way to guarantee memory alignment with vectors.
When I've optimized code on Windows that required aligned memory (SIMD stuff), I implemented a template function [you could use a void*, but if you put types on it then you just have to use the intended size of the array instead of intended size * sizeof()] to allocate aligned memory with _aligned_malloc. I also had the function throw a std::bad_alloc if the alloc came back with a null pointer. You'll also have to implement a custom deleter for the associated free function (_aligned_free in my case). Then I type def'ed all of the types I'd use ("aligned_complex", "aligned_float", etc).
With unique_ptrs and throwing std::bad_alloc you can make it easy to use aligned memory in a very C++/RAII fashion, but not with vectors. I hope they change that someday, it'd make life easier.
input, as in:
vec_t input_;
Will have enough memory to hold only the size of one vec_t element. Now I see that in the call to acl_aligned_malloc() you might want to allocate space for a multidimensional array (which most likely will be larger that vec_t).
Then, when you do:
input_ = *((vec_t*) X);
You are copying only the size of one vec_t element from all the space allocated at the address pointed by X.
Then in this call:
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), &input_[0]);
Which might be a wrapper for 'clEnqueueWriteBuffer' you are declaring that you want to operate the whole size of memory allocated to 'X', instead of only the size of 'vec_t' (which is the memory allocated to 'input_'). Hence the memory overflow.
My suggestion would be making 'input_' a pointer to the type that you want, and use pointer casting.
What happens if you do this?
vec_t *input_;
input_ = (vec_t *) acl_aligned_malloc(batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float));
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), (void *) &input_[0]);
Lately I've been doing a lot of exercises with file streams. When I use fstream.write(...)
to e.g. write an array of 10 integers (intArr[10]) I write:
fstream.write((char*)intArr,sizeof(int)*10);
Is the (char*)intArr-cast safe? I didn't have any problems with it until now but I learned about static_cast (the c++ way right?) and used static_cast<char*>(intArr) and it failed! Which I cannot understand ... Should I change my methodology?
A static cast simply isn't the right thing. You can only perform a static cast when the types in question are naturally convertible. However, unrelated pointer types are not implicitly convertible; i.e. T* is not convertible to or from U* in general. What you are really doing is a reinterpreting cast:
int intArr[10];
myfile.write(reinterpret_cast<const char *>(intArr), sizeof(int) * 10);
In C++, the C-style cast (char *) becomes the most appropriate sort of conversion available, the weakest of which is the reinterpreting cast. The benefit of using the explicit C++-style casts is that you demonstrate that you understand the sort of conversion that you want. (Also, there's no C-equivalent to a const_cast.)
Maybe it's instructive to note the differences:
float q = 1.5;
uint32_t n = static_cast<uint32_t>(q); // == 1, type conversion
uint32_t m1 = reinterpret_cast<uint32_t>(q); // undefined behaviour, but check it out
uint32_t m2 = *reinterpret_cast<const uint32_t *>(&q); // equally bad
Off-topic: The correct way of writing the last line is a bit more involved, but uses copious amounts of casting:
uint32_t m;
char * const pm = reinterpret_cast<char *>(&m);
const char * const pq = reinterpret_cast<const char *>(&q);
std::copy(pq, pq + sizeof(float), pm);