I am trying to modify an existing code to make use of the memory align function below:
void* acl_aligned_malloc (size_t size)
{
void *result = NULL;
posix_memalign (&result, ACL_ALIGNMENT, size);
return result;
}
The function is taken from one of the examples provided by the vendor, and I am trying to optimize the execution of my code by incorporating the recommended function. This is how the function is used in the examples:
static void *X;
X = (void *) acl_aligned_malloc(sizeof(cl_float) * vectorSize);
initializeVector((float*)X, vectorSize);
status = clEnqueueWriteBuffer(queue, kernelX, CL_FALSE, 0, sizeof(cl_float) * vectorSize, X, 0, NULL, NULL);
Observe that the code uses pointer *X to hold the return value of the memory align function. However, the data type that I am working with is not pointer, but of vector float type called vec_t.
vec_t input_;
How can I adapt vec_t input_ to use the memory align function? I have tried the modification below, but I am getting segmentation fault error. Should I change vec_t into a pointer? How can I do that?
void *X;
vec_t input_;
X = (void *) acl_aligned_malloc(batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float));
input_ = *((vec_t*) X);
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), &input_[0]);
"the data type that I am working with is ... of vector float type"
This is going to cause you problems. Vectors handle their own internal storage structures. There is no way to manually manipulate those that I know of (nor would you want to, you'd have to manage the size and capacity variables too and that would get messy). Also as far as I know, there is no way to guarantee memory alignment with vectors.
When I've optimized code on Windows that required aligned memory (SIMD stuff), I implemented a template function [you could use a void*, but if you put types on it then you just have to use the intended size of the array instead of intended size * sizeof()] to allocate aligned memory with _aligned_malloc. I also had the function throw a std::bad_alloc if the alloc came back with a null pointer. You'll also have to implement a custom deleter for the associated free function (_aligned_free in my case). Then I type def'ed all of the types I'd use ("aligned_complex", "aligned_float", etc).
With unique_ptrs and throwing std::bad_alloc you can make it easy to use aligned memory in a very C++/RAII fashion, but not with vectors. I hope they change that someday, it'd make life easier.
input, as in:
vec_t input_;
Will have enough memory to hold only the size of one vec_t element. Now I see that in the call to acl_aligned_malloc() you might want to allocate space for a multidimensional array (which most likely will be larger that vec_t).
Then, when you do:
input_ = *((vec_t*) X);
You are copying only the size of one vec_t element from all the space allocated at the address pointed by X.
Then in this call:
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), &input_[0]);
Which might be a wrapper for 'clEnqueueWriteBuffer' you are declaring that you want to operate the whole size of memory allocated to 'X', instead of only the size of 'vec_t' (which is the memory allocated to 'input_'). Hence the memory overflow.
My suggestion would be making 'input_' a pointer to the type that you want, and use pointer casting.
What happens if you do this?
vec_t *input_;
input_ = (vec_t *) acl_aligned_malloc(batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float));
queue.enqueueWriteBuffer(input_batch_buf, CL_TRUE, 0, batch_size * in_width_ * in_height_ * in_depth_ * sizeof(cl_float), (void *) &input_[0]);
Related
In C++ I had:
MallocMetadata *tmp = static_cast<MallocMetadata *> (p);
But now I want tmp to be 5 bytes before in memory so I tried:
MallocMetadata *tmp = static_cast<MallocMetadata *> (p-5);
But that didn't compile, I read some articles which suggested this (and didn't work too):
MallocMetadata *tmp = static_cast<MallocMetadata *> (static_cast<char *> (p) - 5);
How to fix this problem, please note: I am sure that place in memory is legal plus I want tmp to be of type MallocMetadata* to use it later.
You can use reinterpret_cast to convert pointers other than void* to another pointers.
MallocMetadata *tmp = reinterpret_cast<MallocMetadata *> (static_cast<char *> (p) - 5);
Another choice is casting the char* after subtracting something to void* again.
MallocMetadata *tmp = static_cast<MallocMetadata *> (static_cast<void *> (static_cast<char *> (p) - 5));
C++ How to Advance void * pointer?
It is not possible to advance a void*.
Advancing a pointer by one modifies the pointer to point to the next sibling of the previously pointed object within an array of objects. The distance between two elements of an array differs between objects of different types. The distance is exactly the same as the size of the object.
Thus to advance a pointer, it is necessary to know the size of the pointed object. void* can point to an object of any size, and there is no way to get information about that size from the pointer.
What you can do instead is static cast void* to the dynamic type of the pointed object. The size of the pointed object is then known by virtue of knowing the type of the pointer, as long as the type is complete. You can then use pointer arithmetic to advance the converted pointer to a sibling of the pointed object.
But now I want tmp to be 5 bytes before in memory
Before we proceed any further, I want to make it clear that this is an unsafe thing to attempt, and you must know the language rules in detail to have even a remote chance of doing this correctly. I urge you to consider whether doing this is necessary.
To get a pointer to the memory address 5 bytes before, you can static_cast void* to unsigned char* and do pointer arithmetic on the converted pointer:
static_cast<unsigned char*>(p) - 5
MallocMetadata *tmp = static_cast<MallocMetadata *> (static_cast<char *> (p) - 5);
char* isn't static-castable to arbitrary object pointer types. if the memory address is properly aligned and ((the address contains an object of similar type) or (MallocMetadata is a trivial type and the address doesn't contain an object of another type and you're going to write to the address and not read, thereby creating a new object)), then you can use reinterpret_cast instead:
MallocMetadata *tmp = reinterpret_cast<MallocMetadata*>(
static_cast<char*>(p) - 5
);
A full example:
// preparation
int offset = 5;
std::size_t padding = sizeof(MallocMetadata) >= offset
? 0
: sizeof(MallocMetadata) - offset;
auto align = static_cast<std::align_val_t>(alignof(MallocMetadata));
void* p_storage = ::operator new(sizeof(MallocMetadata) + padding, align);
MallocMetadata* p_mm = new (p_storage) MallocMetadata{};
void* p = reinterpret_cast<char*>(p_mm) + offset;
// same as above
MallocMetadata *tmp = reinterpret_cast<MallocMetadata*>(
static_cast<char*>(p) - offset
);
// cleanup
tmp->~MallocMetadata();
::operator delete(tmp);
I don't know what you'll make of this, but I'll try:
There's a requirement in the standard that void * and character pointers have the same representation and alignment that falls out of C and how historically you had character pointer types where you now have void *.
If you have a void *, actually a void * and not some other type of pointer, and you wanted to advance it a byte at a time, you should be able to create a reference-to-pointer-to-unsigned-character bound to the pointer-to-void, as in:
auto &ucp = reinterpret_cast<unsigned char *&>(void_pointer);
And now it should be possible to manipulate void_pointer through operations on ucp reference.
So ++ucp will advance it, and therefore void_pointer, by one.
I am using the CUDA API / cuFFT API. In order to move data from host to GPU I am usign the cudaMemcpy functions. I am using it like below. len is the amount of elements on dataReal and dataImag.
void foo(const double* dataReal, const double* dataImag, size_t len)
{
cufftDoubleComplex* inputData;
size_t allocSizeInput = sizeof(cufftDoubleComplex)*len;
cudaError_t allocResult = cudaMalloc((void**)&inputData, allocSizeInput);
if (allocResult != cudaSuccess) return;
cudaError_t copyResult;
coypResult = cudaMemcpy2D(static_cast<void*>(inputData),
2 * sizeof (double),
static_cast<const void*>(dataReal),
sizeof(double),
sizeof(double),
len,
cudaMemcpyHostToDevice);
coypResult &= cudaMemcpy2D(static_cast<void*>(inputData) + sizeof(double),
2 * sizeof (double),
static_cast<const void*>(dataImag),
sizeof(double),
sizeof(double),
len,
cudaMemcpyHostToDevice);
//and so on.
}
I am aware, that pointer arithmetic on void pointers is actually not possible. the second cudaMemcpy2D does still work though. I still get a warning by the compiler, but it works correctly.
I tried using static_cast< char* > but that doesn't work as cuffDoubleComplex* cannot be static casted to char*.
I am a bit confused why the second cudaMemcpy with the pointer arithmetic on void is working, as I understand it shouldn't. Is the compiler implicitly assuming that the datatype behind void* is one byte long?
Should I change something there? Use a reinterpret_cast< char* >(inputData) for example?
Also during the allocation I am using the old C-style (void**) cast. I do this because I am getting a "invalid static_cast from cufftDoubleComplex** to void**". Is there another way to do this correctly?
FYI: Link to cudaMemcpy2D Doc
Link to cudaMalloc Doc
You cannot do arithmetic operations on void* since arithmetic operations on pointer are based on the size of the pointed objects (and sizeof(void) does not really mean anything).
Your code compiles probably thanks to a compiler extension that treats arithmetic operations on void* as arithmetic operation on char*.
In your case, you probably do not need arithmetic operations, the following should work (and be more robust):
coypResult &= cudaMemcpy2D(static_cast<void*>(&inputData->y),
sizeof (cufftDoubleComplex),
Since cufftDoubleComplex is simply:
struct __device_builtin__ __builtin_align__(16) double2
{
double x, y;
};
I would like to know what happens on the device (memory wise) when I allocate a structure and then allocate(?) and copy a pointer element of the same structure.
Do I need cudaMalloc of the element *a again?
Example code:
typedef struct {
int *a;
...
} StructA;
int main()
{
int row, col, numS = 10; // defined at runtime
StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));
int *h_A = d_a->a;
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
cudaMalloc( &(d_A->a), row*col*sizeof(int) ); // no (void**) needed?
cudaMemcpy( d_A->a, h_A, row*col*sizeof(int), cudaMemcpyHostToDevice );
kernel<<<grid, block>>>(d_A); // Passing pointer to StructA in device
...
}
The kernel definition:
__global__ kernel(StructA *d_A)
{
d_A->a = ...;
...
}
This question is another extension of this question and related to this question.
I would suggest that you put some effort into compiling and running your codes with proper cuda error checking. Learning to interpret the compiler output and runtime output will make you a better, smarter, more efficient coder. I also suggest reviewing the writeup I previously pointed you at here. It deals with this exact topic, and includes linked worked examples. This question is a duplicate of that one.
There are various errors:
StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));
The above line of code creates an allocation in host memory for a structure of size StructA, and sets the pointer d_A pointing to the start of that allocation. Nothing wrong at the moment.
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
The above line of code creates an allocation in device memory of the size of StructA, and sets the pointer d_A pointing to the start of that allocation. This has effectively wiped out the previous pointer and allocation. (The previous host allocation is still somewhere, but you can't access it. It's basically lost.) Surely that was not your intent.
int *h_A = d_a->a;
Now that d_A (I assume you meant d_A, not d_a) has been assigned as a device memory pointer, the -> operation will dereference that pointer to locate the element a. This is illegal in host code and will throw an error (seg fault).
cudaMalloc( &(d_A->a), row*col*sizeof(int) );
This line of code has a similar issue. We cannot cudaMalloc a pointer that lives in device memory. cudaMalloc creates pointers that live in host memory but reference a location in device memory. This operation &(d_A->a) is dereferencing a device pointer, which is illegal in host code.
A proper code would be something like this:
$ cat t363.cu
#include <stdio.h>
typedef struct {
int *a;
int foo;
} StructA;
__global__ void kernel(StructA *data){
printf("The value is %d\n", *(data->a + 2));
}
int main()
{
int numS = 1; // defined at runtime
//allocate host memory for the structure storage
StructA *h_A = (StructA*)malloc(numS * sizeof(StructA));
//allocate host memory for the storage pointed to by the embedded pointer
h_A->a = (int *)malloc(10*sizeof(int));
// initialize data pointed to by the embedded pointer
for (int i = 0; i <10; i++) *(h_A->a+i) = i;
StructA *d_A; // pointer for device structure storage
//allocate device memory for the structure storage
cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
// create a pointer for cudaMalloc to use for embedded pointer device storage
int *temp;
//allocate device storage for the embedded pointer storage
cudaMalloc((void **)&temp, 10*sizeof(int));
//copy this newly created *pointer* to it's proper location in the device copy of the structure
cudaMemcpy(&(d_A->a), &temp, sizeof(int *), cudaMemcpyHostToDevice);
//copy the data pointed to by the embedded pointer from the host to the device
cudaMemcpy(temp, h_A->a, 10*sizeof(int), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(d_A); // Passing pointer to StructA in device
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t363 t363.cu
$ cuda-memcheck ./t363
========= CUDA-MEMCHECK
The value is 2
========= ERROR SUMMARY: 0 errors
$
You'll note that I haven't worked out the case where you are dealing with an array of StructA (i.e. numS > 1), that will require a loop. I'll leave it to you to work through the logic I've presented here and in my previous linked answer to see if you can work out the details of that loop. Furthermore, for the sake of clarity/brevity I've dispensed with the usual cuda error checking but please use it in your codes. Finally, this process (sometimes called a "deep copy operation") is somewhat tedious in ordinary CUDA if you haven't concluded that yet. Previous recommendations along these lines are to "flatten" such structures (so that they don't contiain pointers), but you can also explore cudaMallocManaged i.e. Unified Memory in CUDA 6.
I have a question about void*. I have a function which captures blocks of 0.2 sec of audio by microphone. I have to process these blocks, in concrete a convolution.
This function returns these blocks of audio as a void * . To process this information I can't use void * because I can't access to them, so I have to convert in other kind of data, for example double but I don't know which length is assigned to this new pointer or how can I do it.
My code:
void Pre_proc_mono::PreProcess(void *data, int lenbytes, float t_max){
double * aux = (double*) data;
}
Now, aux's length is lenbytes too? Or I have to do something like:
int size = lenbytes/sizeof(double);
How can I make this work?
A pointer is an address in memory. This is the address of the first byte of data. The type of the pointer tells you how long is the data. So if we have
int *p
the value of p tells you where the data starts, and the type of the pointer, in this case int * tells you that from that address you need to take 4 bytes (on most architectures).
A void * pointer has only the starting address, but not the length of the data, so that's why you can't dereference a void * pointer.
sizeof(p) where p is a pointer (of any type) is the size of the pointer, and has nothing to do with the kind of data you find where the pointer points to
for instance:
sizeof(char) == 1
sizeof(char *) == 4
sizeof(void *) == 4
In your function:
void *data, int lenbytes, float t_max
data is a pointer to where the data starts, lenbytes is how many bytes the data has.
So you can have something like:
uint8_t *aux = (uint8_t*) data;
and you have a vector of lenbytes elements of type uint8_t (uint8_t is guaranteed to have 1 byte).
Or something like this:
double * aux = (double*) data;
and you have a vector of lenbutes/sizeof(double) elements of type double. But you need to be careful so that lenbytes is a multiple of sizeof(double).
Edit
And as regarding to what you should convert to, the answer depends on only the format of your blocks of data. Read the documentation, or search for an example.
I have reached a point where realloc stops returning a pointer - I assume that there is a lack of space for the array to expand or be moved. The only problem is I really need that memory to exist or the application can't run as expected, so I decided to try malloc - expecting it not work since realloc would no work - but it did. Why?
Then I memcpy the array of pointers into the new allocated array, but found it broke it, pointers like 0x10 and 0x2b was put in the array. There are real pointers, but if I replace the memcpy with a for loop, that fixes it. Why did memcpy do that? Should I not be using memcpy in my code?
Code:
float * resizeArray_by(float *array, uint size)
{
float *tmpArray = NULL;
if (!array)
{
tmpArray = (float *)malloc(size);
}
else
{
tmpArray = (float *)realloc((void *)array, size);
}
if (!tmpArray)
{
tmpArray = (float *)malloc(size);
if (tmpArray)
{
//memcpy(tmpArray, array, size - 1);
for (int k = 0; k < size - 1; k++)
{
((float**)tmpArray)[k] = ((float **)array)[k];
}
free(array);
}
}
return tmpArray;
}
void incrementArray_andPosition(float **& array, uint &total, uint &position)
{
uint prevTotal = total;
float *tmpArray = NULL;
position++;
if (position >= total)
{
total = position;
float *tmpArray = resizeArray_by((float *)array, total);
if (tmpArray)
{
array = (float **)tmpArray;
array[position - 1] = NULL;
}
else
{
position--;
total = prevTotal;
}
}
}
void addArray_toArray_atPosition(float *add, uint size, float **& array, uint &total, uint &position)
{
uint prevPosition = position;
incrementArray_andPosition(array, total, position);
if (position != prevPosition)
{
float *tmpArray = NULL;
if (!array[position - 1] || mHasLengthChanged)
{
tmpArray = resizeArray_by(array[position - 1], size);
}
if (tmpArray)
{
memcpy(tmpArray, add, size);
array[position - 1] = tmpArray;
}
}
}
After all my fixes, the code inits probably. The interesting thing here, is after sorting out the arrays, I allocate with malloc a huge array, so to reorder the arrays into one array to be used as an GL_ARRAY_BUFFER. If realloc is no allocating because of a lack of space, then why isn't allocating?
Finally, this results it crashing in the end anyway. After going through the render function once it crashes. If I removed all my fixes and just caught when realloc doesn't allocate it would work fine. Which begs the question, what is wrong with mallocing my array instead of reallocing to cause so problems further down the line?
My Array's are pointer of pointers of floats. When I grow the array it is converted into a pointer to floats and reallocated. I am building on Android, so this is why I assumed there to be a lack of memory.
Judging from all the different bits of information (realloc not finding memory, memcpy behaving unexpectedly, crashes) this sounds much like a heap corruption. Without some code samples of exactly what you're doing it's hard to say for sure but it appears that you're mis-managing the memory at some point, causing the heap to get into an invalid state.
Are you able to compile your code on an alternate platform such as Linux (you might have to stub some android specific APIs)? If so, you could see what happens on that platform and/or use valgrind to help hunt it down.
Finally, as you have this tagged C++ why are you using malloc/realloc instead of, for example, vector (or another standard container) or new?
You are confusing size and pointer types. In the memory allocation, size is the number of bytes, and you are converting the pointer type to float *, essentially creating an array of float of size size / sizeof(float). In the memcpy-equivalent code, you are treating the array as float ** and copying size of them. This will trash the heap, assuming that sizeof(float *) > 1, and is likely the source of later problems.
Moreover, if you are copying, say, a 100-size array to a 200-size array, you need to copy over 100 elements, not 200. Copying beyond the end of an array (which is what you're doing) can lead to program crashes.
A dynamically allocated array of pointers to floats will be of type float **, not float *, and certainly not a mixture of the two. The size of the array is the number of bytes to malloc and friends, and the number of elements in all array operations.
memcpy will faithfully copy bytes, assuming the source and destination blocks don't overlap (and separately allocated memory blocks don't). However, you've specified size - 1 for the number of bytes copied, when the number copied should be the exact byte size of the old array. (Where are you getting bad pointer values anyway? If it's in the expanded part of the array, you're copying garbage in there anyway.) If memcpy is giving you nonsense, it's getting nonsense to begin with, and it isn't your problem.
And btw, you don't need to test if array is NULL
You can replace
if (!array)
{
tmpArray = (float *)malloc(size);
}
else
{
tmpArray = (float *)realloc((void *)array, size);
}
by
tmpArray = realloc(array, size*sizeof (float));
realloc acts like malloc when given a NULLpointer.
Another thing, be careful that size is not 0, as realloc with 0 as size is the same as
free.
Third point, do not typecast pointers when not strictly necessary. You typecasted the return of the allocation functions, it's considered bad practice since ANSI-C. It's mandatory in C++, but as you're using the C allocation you're obviously not in C++ (in that case you should use new/delete).
Casting the array variable to (void *) is also unecessary as it could hide some warnings if your parameter was falsely declared (it could be an int or a pointer to pointer and by casting you would have suppressed the warning).