Casting device-side complex * to double * or float * for cublas

Casting device-side complex * to double * or float * for cublas - c++

Problem
Is it safe to cast a complex * to a float * or double * pointer using reinterpret_cast()
thrust::complex<float> *devicePtr; // only to show type, devicePtr otherwise lives in an object
/* OR */
float _Complex *devicePtr;
/* OR */
std::complex<float> *devicePtr;
cublasScnrm2(cublasv2handle,n,(cuComplex*)xarray,1,reinterpret_cast<float *>(obj->devicePtr));
If not, are there clever ways to solve this problem?
Restrictions
obj is a C struct (so no direct operator overloading possible)
I cannot store devicePtr as a float * within obj
devicePtr only ever holds a pointer to a single value. It may be relevant given the reinterpret_cast trickery but behind the scenes devicePtr is part of a pool:
static thrust::complex<float> *pool;
/* OR */
static float _Complex *pool;
/* OR */
std::complex<float> *pool;
void giveObjectDevicePtr(object obj)
{
for (int i = 0; i < poolSize; ++i) {
if (poolEntryIsFree(pool,i)) obj->devicePtr = pool+i;
}
}
The cublas call is made asynchronously on a stream, so copying the contents of devicePtr up to host and syncing stream to perform conversion is to be avoided.
Likewise, launching a micro kernel is also not ideal but perhaps unavoidable.
I have seen many questions about casting double * or float * to complex * but not many the other way around.

Underneath the hood, complex types usable in CUDA should generally be a struct of two values. You can see what I mean by looking at the cuComplex.h header file, as one possible example.
Casting a pointer to such, to a pointer type consistent with the values in that struct, should generally be less risky than the other way around (the other way around has additional alignment requirements beyond the base type).
If you posit the type you are discussing, exactly, then I claim this question has nothing to do with CUDA, and is really just a c++ question.
If you do such a cast, then provide that to a cublas function, in the general case I think you're going to be computing over both real and imaginary components, which seems weird to me. It should not be an issue for the case you have shown, however.
You also seem to have some confusion about where a device pointer lives:
copying devicePtr up to host
Any device pointer usable in a CUBLAS call for recent versions of CUBLAS lives in host memory.

According to the documentation, this should theoretically be possible, though it is not explicit.
If you were dealing with std::complex<T>, the answer would be a definitive "yes". According to cppreference, a pointer to a std::complex<T> array can be reinterpret_cast to a pointer to a T array, with the intuitive semantics. This is for compatibility with C's complex numbers.
Now thrust::complex<T>, the documentation states, "It is functionally identical to it, but can also be used in device code which std::complex currently cannot." Whether or not "functionally identical" includes compatibility with C's complex types is not explicit. That said, the structure is laid out as one would expect std::complex<T> to be laid out, which means (in a practical sense) it's likely that such a cast will work just like for std::complex<T>.

Related

Is there a function to load a non-atomic value atomically?

In C++20 we can write:
double x;
double x_value = std::atomic_ref(x).load();
Is there a function with the same effect?
I have tried std::atomic_load but there seem to be no overloads for non-atomic objects.

Non-portably of course, there is GNU C __atomic_load_n(&x, __ATOMIC_SEQ_CST) __atomic builtin.
I'm pretty sure you don't find a function in ISO C++ that takes a double * or double &.
Possibly one that takes a std::atomic_ref<double> * or reference, I didn't check, but I think the intent of atomic_ref is to be constructed on the fly for free inside a function that needs it.
If you want such a function, write you own that constructs + uses an atomic_ref. It will all inline down to an __atomic_load_n on compilers where atomic uses that under the hood anyway.
But do make sure to declare your global like this, to make sure it's safe + efficient to use with atomic_ref. It's UB (I think) to take an atomic_ref to an object that's not sufficiently aligned, so the atomic_ref constructor can simply assume that the object you use is aligned the same as atomic<T> needs to be.
alignas (std::atomic_ref<double>::required_alignment) double x;
In practice that's only going to be a problem for 8-byte primitive types like double inside structs on 32-bit targets, but something like struct { char c[8]; } could in practice be not naturally aligned if you don't ask for alignment.

What is the correct way to allocate and use an untyped memory block in C++?

The answers I got for this question until now has two exactly the opposite kinds of answers: "it's safe" and "it's undefined behaviour". I decided to rewrite the question in whole to get some better clarifying answers, for me and for anyone who might arrive here via Google.
Also, I removed the C tag and now this question is C++ specific
I am making an 8-byte-aligned memory heap that will be used in my virtual machine. The most obvious approach that I can think of is by allocating an array of std::uint64_t.
std::unique_ptr<std::uint64_t[]> block(new std::uint64_t[100]);
Let's assume sizeof(float) == 4 and sizeof(double) == 8. I want to store a float and a double in block and print the value.
float* pf = reinterpret_cast<float*>(&block[0]);
double* pd = reinterpret_cast<double*>(&block[1]);
*pf = 1.1;
*pd = 2.2;
std::cout << *pf << std::endl;
std::cout << *pd << std::endl;
I'd also like to store a C-string saying "hello".
char* pc = reinterpret_cast<char*>(&block[2]);
std::strcpy(pc, "hello\n");
std::cout << pc;
Now I want to store "Hello, world!" which goes over 8 bytes, but I still can use 2 consecutive cells.
char* pc2 = reinterpret_cast<char*>(&block[3]);
std::strcpy(pc2, "Hello, world\n");
std::cout << pc2;
For integers, I don't need a reinterpret_cast.
block[5] = 1;
std::cout << block[5] << std::endl;
I'm allocating block as an array of std::uint64_t for the sole purpose of memory alignment. I also do not expect anything larger than 8 bytes by its own to be stored in there. The type of the block can be anything if the starting address is guaranteed to be 8-byte-aligned.
Some people already answered that what I'm doing is totally safe, but some others said that I'm definitely invoking undefined behaviour.
Am I writing correct code to do what I intend? If not, what is the appropriate way?

The global allocation functions
To allocate an arbitrary (untyped) block of memory, the global allocation functions (§3.7.4/2);
void* operator new(std::size_t);
void* operator new[](std::size_t);
Can be used to do this (§3.7.4.1/2).
§3.7.4.1/2
The allocation function attempts to allocate the requested amount of storage. If it is successful, it shall return the address of the start of a block of storage whose length in bytes shall be at least as large as the requested size. There are no constraints on the contents of the allocated storage on return from the allocation function. The order, contiguity, and initial value of storage allocated by successive calls to an allocation function are unspecified. The pointer returned shall be suitably aligned so that it can be converted to a pointer of any complete object type with a fundamental alignment requirement (3.11) and then used to access the object or array in the storage allocated (until the storage is explicitly deallocated by a call to a corresponding deallocation function).
And 3.11 has this to say about a fundamental alignment requirement;
§3.11/2
A fundamental alignment is represented by an alignment less than or equal to the greatest alignment supported by the implementation in all contexts, which is equal to alignof(std::max_align_t).
Just to be sure on the requirement that the allocation functions must behave like this;
§3.7.4/3
Any allocation and/or deallocation functions defined in a C++ program, including the default versions in the library, shall conform to the semantics specified in 3.7.4.1 and 3.7.4.2.
Quotes from C++ WD n4527.
Assuming the 8-byte alignment is less than the fundamental alignment of the platform (and it looks like it is, but this can be verified on the target platform with static_assert(alignof(std::max_align_t) >= 8)) - you can use the global ::operator new to allocate the memory required. Once allocated, the memory can be segmented and used given the size and alignment requirements you have.
An alternative here is the std::aligned_storage and it would be able to give you memory aligned at whatever the requirement is.
typename std::aligned_storage<sizeof(T), alignof(T)>::type buffer[100];
From the question, I assume here that the both the size and alignment of T would be 8.
A sample of what the final memory block could look like is (basic RAII included);
struct DataBlock {
const std::size_t element_count;
static constexpr std::size_t element_size = 8;
void * data = nullptr;
explicit DataBlock(size_t elements) : element_count(elements)
{
data = ::operator new(elements * element_size);
}
~DataBlock()
{
::operator delete(data);
}
DataBlock(DataBlock&) = delete; // no copy
DataBlock& operator=(DataBlock&) = delete; // no assign
// probably shouldn't move either
DataBlock(DataBlock&&) = delete;
DataBlock& operator=(DataBlock&&) = delete;
template <class T>
T* get_location(std::size_t index)
{
// https://stackoverflow.com/a/6449951/3747990
// C++ WD n4527 3.9.2/4
void* t = reinterpret_cast<void*>(reinterpret_cast<unsigned char*>(data) + index*element_size);
// 5.2.9/13
return static_cast<T*>(t);
// C++ WD n4527 5.2.10/7 would allow this to be condensed
//T* t = reinterpret_cast<T*>(reinterpret_cast<unsigned char*>(data) + index*element_size);
//return t;
}
};
// ....
DataBlock block(100);
I've constructed more detailed examples of the DataBlock with suitable template construct and get functions etc., live demo here and here with further error checking etc..
A note on the aliasing
It does look like there are some aliasing issues in the original code (strictly speaking); you allocate memory of one type and cast it to another type.
It may probably work as you expect on your target platform, but you cannot rely on it. The most practical comment I've seen on this is;
"Undefined behaviour has the nasty result of usually doing what you think it should do, until it doesn’t” - hvd.
The code you have probably will work. I think it is better to use the appropriate global allocation functions and be sure that there is no undefined behaviour when allocating and using the memory you require.
Aliasing will still be applicable; once the memory is allocated - aliasing is applicable in how it is used. Once you have an arbitrary block of memory allocated (as above with the global allocation functions) and the lifetime of an object begins (§3.8/1) - aliasing rules apply.
What about std::allocator?
Whilst the std::allocator is for homogenous data containers and what your are looking for is akin to heterogeneous allocations, the implementation in your standard library (given the Allocator concept) offers some guidance on raw memory allocations and corresponding construction of the objects required.

Update for the new question:
The great news is there's a simple and easy solution to your real problem: Allocate the memory with new (unsigned char[size]). Memory allocated with new is guaranteed in the standard to be aligned in a way suitable for use as any type, and you can safely alias any type with char*.
The standard reference, 3.7.3.1/2, allocation functions:
The pointer returned shall be suitably aligned so that it can be
converted to a pointer of any complete object type and then used to
access the object or array in the storage allocated
Original answer for the original question:
At least in C++98/03 in 3.10/15 we have the following which pretty clearly makes it still undefined behavior (since you're accessing the value through a type that's not enumerated in the list of exceptions):
If a program attempts to access the stored value of an object through
an lvalue of other than one of the following types the behavior is
undefined):
— the dynamic type of the object,
— a cvqualified version of the dynamic type of the object,
— a type that is the signed or unsigned type corresponding to the dynamic type of the object,
— a type that is the signed or unsigned type corresponding to a cvqualified version of the dynamic type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
— a type that is a (possibly cvqualified) base class type of the dynamic type of the object,
— a char or unsigned char type.

pc pf and pd are all different types that access memory specified in block as uint64_t, so for say 'pf the shared types are float and uint64_t.
One would violate the strict aliasing rule were once to write using one type and read using another since the compile could we reorder the operations thinking there is no shared access. This is not your case however, since the uint64_t array is only used for assignment, it is exactly the same as using alloca to allocate the memory.
Incidentally there is no issue with the strict aliasing rule when casting from any type to a char type and visa versa. This is a common pattern used for data serialization and deserialization.

A lot of discussion here and given some answers that are slightly wrong, but making up good points, I just try to summarize:
exactly following the text of the standard (no matter what version) ... yes, this is undefined behaviour. Note the standard doesn't even have the term strict aliasing -- just a set of rules to enforce it no matter what implementations could define.
understanding the reason behind the "strict aliasing" rule, it should work nicely on any implementation as long as neither float or double take more than 64 bits.
the standard won't guarantee you anything about the size of float or double (intentionally) and that's the reason why it is that restrictive in the first place.
you can get around all this by ensuring your "heap" is an allocated object (e.g. get it with malloc()) and access the aligned slots through char * and shifting your offset by 3 bits.
you still have to make sure that anything you store in such a slot won't take more than 64 bits. (that's the hard part when it comes to portability)
In a nutshell: your code should be safe on any "sane" implementation as long as size constraints aren't a problem (means: the answer to the question in your title is most likely no), BUT it's still undefined behaviour (means: the answer to your last paragraph is yes)

I'll make it short: All your code works with defined semantics if you allocate the block using
std::unique_ptr<char[], std::free>
mem(static_cast<char*>(std::malloc(800)));
Because
every type is allowed to alias with a char[] and
malloc() is guaranteed to return a block of memory sufficiently aligned for all types (except maybe SIMD ones).
We pass std::free as a custom deleter, because we used malloc(), not new[], so calling delete[], the default, would be undefined behaviour.
If you're a purist, you can also use operator new:
std::unique_ptr<char[]>
mem(static_cast<char*>(operator new[](800)));
Then we don't need a custom deleter. Or
std::unique_ptr<char[]> mem(new char[800]);
to avoid the static_cast from void* to char*. But operator new can be replaced by the user, so I'm always a bit wary of using it. OTOH; malloc cannot be replaced (only in platform-specific ways, such as LD_PRELOAD).

Yes, because the memory locations pointed to by pf could overlap depending on the size of float and double. If they didn't, then the results of reading *pd and *pf would be well defined but not the results of reading from block or pc.

The behavior of C++ and the CPU are distinct. Although the standard provides memory suitable for any object, the rules and optimizations imposed by the CPU make the alignment for any given object "undefined" - an array of short would reasonably be 2 byte aligned, but an array of a 3 byte structure may be 8 byte aligned. A union of all possible types can be created and used between your storage and the usage to ensure no alignment rules are broken.
union copyOut {
char Buffer[200]; // max string length
int16 shortVal;
int32 intVal;
int64 longIntVal;
float fltVal;
double doubleVal;
} copyTarget;
memcpy( copyTarget.Buffer, Block[n], sizeof( data ) ); // move from unaligned space into union
// use copyTarget member here.

If you tag this as C++ question,
(1) why use uint64_t[] but not std::vector?
(2) in term of memory management, your code lack of management logic, which should keep track of which blocks are in use and which are free and the tracking of contiguoous blocks, and of course the allocate and release block methods.
(3) the code shows an unsafe way of using memory. For example, the char* is not const and therefore the block can be potentially be written to and overwrite the next block(s). The reinterpret_cast is consider danger and should be abstract from the memory user logic.
(4) the code doesn't show the allocator logic. In C world, the malloc function is untyped and in C++ world, the operator new is typed. You should consider something like the new operator.

Pointer aliasing of pointer containers

I learned that pointer aliasing may hurt performance, and that a __restrict__ attribute (in GCC, or equivalent attributes in other implementations) may help keeping track of which pointers should or should not be aliased. Meanwhile, I also learned that GCC's implementation of valarray stores a __restrict__'ed pointer (line 517 in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.1/valarray-source.html), which I think hints the compiler (and responsible users) that the private pointer can be assumed not to be aliased anywhere in valarray methods.
But if we alias a pointer to a valarray object, for example:
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
is it valid to say that the member pointer of a is aliased too? And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise? (Is it bad practice to point to optimized pointer containers?)

Let's first understand how aliasing hurts optimization.
Consider this code,
void
process_data(float *in, float *out, float gain, int nsamps)
{
int i;
for (i = 0; i < nsamps; i++) {
out[i] = in[i] * gain;
}
}
In C or C++, it is legal for the parameters in and out to point to overlapping regions in memory.... When the compiler optimizes the function, it does not in general know whether in and out are aliases. It must therefore assume that any store through out can affect the memory pointed to by in, which severely limits its ability to reorder or parallelize the code (For some simple cases, the compiler could analyze the entire program to determine that two pointers cannot be aliases. But in general, it is impossible for the compiler to determine whether or not two pointers are aliases, so to be safe, it must assume that they are).
Coming to your code,
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
Since a and b are aliases. The underlying storage structure used by valarray will also be aliased(I think it uses an array. Not very sure about this). So, any part of your code that uses a and b in a fashion similar to that shown above will not benefit from compiler optimizations like parallelization and reordering. Note that JUST the existence of b will not hurt optimization but how you use it.
Credits:
The quoted part and the code is take from here. This should serve as a good source for more information about the topic as well.

is it valid to say that the member pointer of a is aliased too?
Yes. For example, a->[0] and b->[0] reference the same object. That's aliasing.
And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise?
No.
You haven't done anything with b in your sample code. Suppose you have a function much larger than this sample code that starts with the same construct. There's usually no problem if the first several lines of that function uses a but never b, and the remaining lines uses b but never a. Usually. (Optimizing compilers do rearrange lines of code however.)
If on the other hand you intermingle uses of a and b, you aren't hurting the optimizations. You are doing something much worse: You are invoking undefined behavior. "Don't do it" is the best solution to the undefined behavior problem.
Addendum
The C restrict and gcc __restrict__ keywords are not constraints on the developers of the compiler or the standard library. Those keywords are promises to the compiler/library that restricted data do not overlap other data. The compiler/library doesn't check whether the programmer violated this promise. If this promise enables certain optimizations that might otherwise be invalid with overlapping data, the compiler/library is free to apply those optimizations.
What this means is that restrict (or __restrict__) is a restriction on you, not the compiler. You can violate those restrictions even without your b pointer. For example, consider
*a = a->[std::slice(a.size()-1,a.size(),-1)];
This is undefined behavior.

Reinterpret_cast use in C++

Just a simple question,having this:
fftw_complex *H_cast;
H_cast = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*M*N);
what is the difference between:
H_cast= reinterpret_cast<fftw_complex*> (H);
and
H_cast= reinterpret_cast<fftw_complex*> (&H);
Thanks so much in advance
Antonio

Answer to current question
The difference is that they do two completely different things!
Note: you do not tell us what H is, so it's impossible to answer the question with confidence. But general principles apply.
For the first case to be sensible code, H should be a pointer (typed as void* possibly?) to a fftw_complex instance. You would do this to tell the compiler that H is really a fftw_complex*, so you can then use it.
For the second case to be sensible code, H should be an instance of a class with a memory layout identical to that of class fftw_complex. I can't think of a compelling reason to put yourself in this situation, it is very unnatural. Based on this, and since you don't give us information regarding H, I think it's almost certainly a bug.
Original answer
The main difference is that in the second case you can search your source code for reinterpret_cast (and hopefully ensure that every use is clearly documented and a necessary evil).
However, if you are casting from void* to another pointer type (is this the case here?) then it's preferable to use static_cast instead (which can also be easily searched for).

H_cast= reinterpret_cast<fftw_complex*> (H);
This converts the pointer-ish type inside H (or the integer itself, if H is an integer type) and tells the compiler "this is a pointer. Stop thinking whatever it was, it's a pointer now". H is used as something where you had stored a pointer-like address.
H_cast= reinterpret_cast<fftw_complex*> (&H);
This converts the address of H (which is a pointer to whatever type H is) into a pointer to "fftw_complex". Modifying the contents of H_cast will now change H itself.
You'll want the second if H is not a pointer and usually the first if it is. There are use cases for the other way around but they're uncommon and ugly (especially reinterpreting an int or - god forbid - a double as a pointer).

Pointer casts are always executed as a reinterpret_cast, so when casting from or to a void * there's no difference between a c-style cast, a static_cast or a reinterpret_cast.
Reinterpret_casts are usually reserved for the ugliest of locations where c-style casts and static_casts are used for innocuous casts. You basically use reinterpret_cast to tag some code as really-ugly:
float f = 3.1415f;
int x = *reinterpret_cast<int *>(&f);
That way, these ugly unsafe casts are searchable/greppable.

Needless pointer-casts in C

I got a comment to my answer on this thread:
Malloc inside a function call appears to be getting freed on return?
In short I had code like this:
int * somefunc (void)
{
int * temp = (int*) malloc (sizeof (int));
temp[0] = 0;
return temp;
}
I got this comment:
Can I just say, please don't cast the
return value of malloc? It is not
required and can hide errors.
I agree that the cast is not required in C. It is mandatory in C++, so I usually add them just in case I have to port the code in C++ one day.
However, I wonder how casts like this can hide errors. Any ideas?
Edit:
Seems like there are very good and valid arguments on both sides. Thanks for posting, folks.

It seems fitting I post an answer, since I left the comment :P
Basically, if you forget to include stdlib.h the compiler will assume malloc returns an int. Without casting, you will get a warning. With casting you won't.
So by casting you get nothing, and run the risk of suppressing legitimate warnings.
Much is written about this, a quick google search will turn up more detailed explanations.
edit
It has been argued that
TYPE * p;
p = (TYPE *)malloc(n*sizeof(TYPE));
makes it obvious when you accidentally don't allocate enough memory because say, you thought p was TYPe not TYPE, and thus we should cast malloc because the advantage of this method overrides the smaller cost of accidentally suppressing compiler warnings.
I would like to point out 2 things:
you should write p = malloc(sizeof(*p)*n); to always ensure you malloc the right amount of space
with the above approach, you need to make changes in 3 places if you ever change the type of p: once in the declaration, once in the malloc, and once in the cast.
In short, I still personally believe there is no need for casting the return value of malloc and it is certainly not best practice.

This question is tagged both for C and C++, so it has at least two answers, IMHO:
C
Ahem... Do whatever you want.
I believe the reason given above "If you don't include "stdlib" then you won't get a warning" is not a valid one because one should not rely on this kind of hacks to not forget to include an header.
The real reason that could make you not write the cast is that the C compiler already silently cast a void * into whatever pointer type you want, and so, doing it yourself is overkill and useless.
If you want to have type safety, you can either switch to C++ or write your own wrapper function, like:
int * malloc_Int(size_t p_iSize) /* number of ints wanted */
{
return malloc(sizeof(int) * p_iSize) ;
}
C++
Sometimes, even in C++, you have to make profit of the malloc/realloc/free utils. Then you'll have to cast. But you already knew that. Using static_cast<>() will be better, as always, than C-style cast.
And in C, you could override malloc (and realloc, etc.) through templates to achieve type-safety:
template <typename T>
T * myMalloc(const size_t p_iSize)
{
return static_cast<T *>(malloc(sizeof(T) * p_iSize)) ;
}
Which would be used like:
int * p = myMalloc<int>(25) ;
free(p) ;
MyStruct * p2 = myMalloc<MyStruct>(12) ;
free(p2) ;
and the following code:
// error: cannot convert ‘int*’ to ‘short int*’ in initialization
short * p = myMalloc<int>(25) ;
free(p) ;
won't compile, so, no problemo.
All in all, in pure C++, you now have no excuse if someone finds more than one C malloc inside your code...
:-)
C + C++ crossover
Sometimes, you want to produce code that will compile both in C and in C++ (for whatever reasons... Isn't it the point of the C++ extern "C" {} block?). In this case, C++ demands the cast, but C won't understand the static_cast keyword, so the solution is the C-style cast (which is still legal in C++ for exactly this kind of reasons).
Note that even with writing pure C code, compiling it with a C++ compiler will get you a lot more warnings and errors (for example attempting to use a function without declaring it first won't compile, unlike the error mentioned above).
So, to be on the safe side, write code that will compile cleanly in C++, study and correct the warnings, and then use the C compiler to produce the final binary. This means, again, write the cast, in a C-style cast.

One possible error it can introduce is if you are compiling on a 64-bit system using C (not C++).
Basically, if you forget to include stdlib.h, the default int rule will apply. Thus the compiler will happily assume that malloc has the prototype of int malloc(); On Many 64-bit systems an int is 32-bits and a pointer is 64-bits.
Uh oh, the value gets truncated and you only get the lower 32-bits of the pointer! Now if you cast the return value of malloc, this error is hidden by the cast. But if you don't you will get an error (something to the nature of "cannot convert int to T *").
This does not apply to C++ of course for 2 reasons. Firstly, it has no default int rule, secondly it requires the cast.
All in all though, you should just new in c++ code anyway :-P.

Well, I think it's the exact opposite - always directly cast it to the needed type. Read on here!

The "forgot stdlib.h" argument is a straw man. Modern compilers will detect and warn of the problem (gcc -Wall).
You should always cast the result of malloc immediately. Not doing so should be considered an error, and not just because it will fail as C++. If you're targeting a machine architecture with different kinds of pointers, for example, you could wind up with a very tricky bug if you don't put in the cast.
Edit: The commentor Evan Teran is correct. My mistake was thinking that the compiler didn't have to do any work on a void pointer in any context. I freak when I think of FAR pointer bugs, so my intuition is to cast everything. Thanks Evan!

Actually, the only way a cast could hide an error is if you were converting from one datatype to an smaller datatype and lost data, or if you were converting pears to apples. Take the following example:
int int_array[10];
/* initialize array */
int *p = &(int_array[3]);
short *sp = (short *)p;
short my_val = *sp;
in this case the conversion to short would be dropping some data from the int. And then this case:
struct {
/* something */
} my_struct[100];
int my_int_array[100];
/* initialize array */
struct my_struct *p = &(my_int_array[99]);
in which you'd end up pointing to the wrong kind of data, or even to invalid memory.
But in general, and if you know what you are doing, it's OK to do the casting. Even more so when you are getting memory from malloc, which happens to return a void pointer which you can't use at all unless you cast it, and most compilers will warn you if you are casting to something the lvalue (the value to the left side of the assignment) can't take anyway.

#if CPLUSPLUS
#define MALLOC_CAST(T) (T)
#else
#define MALLOC_CAST(T)
#endif
...
int * p;
p = MALLOC_CAST(int *) malloc(sizeof(int) * n);
or, alternately
#if CPLUSPLUS
#define MYMALLOC(T, N) static_cast<T*>(malloc(sizeof(T) * N))
#else
#define MYMALLOC(T, N) malloc(sizeof(T) * N)
#endif
...
int * p;
p = MYMALLOC(int, n);

People have already cited the reasons I usually trot out: the old (no longer applicable to most compilers) argument about not including stdlib.h and using sizeof *p to make sure the types and sizes always match regardless of later updating. I do want to point out one other argument against casting. It's a small one, but I think it applies.
C is fairly weakly typed. Most safe type conversions happen automatically, and most unsafe ones require a cast. Consider:
int from_f(float f)
{
return *(int *)&f;
}
That's dangerous code. It's technically undefined behavior, though in practice it's going to do the same thing on nearly every platform you run it on. And the cast helps tell you "This code is a terrible hack."
Consider:
int *p = (int *)malloc(sizeof(int) * 10);
I see a cast, and I wonder, "Why is this necessary? Where is the hack?" It raises hairs on my neck that there's something evil going on, when in fact the code is completely harmless.
As long as we're using C, casts (especially pointer casts) are a way of saying "There's something evil and easily breakable going on here." They may accomplish what you need accomplished, but they indicate to you and future maintainers that the kids aren't alright.
Using casts on every malloc diminishes the "hack" indication of pointer casting. It makes it less jarring to see things like *(int *)&f;.
Note: C and C++ are different languages. C is weakly typed, C++ is more strongly typed. The casts are necessary in C++, even though they don't indicate a hack at all, because of (in my humble opinion) the unnecessarily strong C++ type system. (Really, this particular case is the only place I think the C++ type system is "too strong," but I can't think of any place where it's "too weak," which makes it overall too strong for my tastes.)
If you're worried about C++ compatibility, don't. If you're writing C, use a C compiler. There are plenty really good ones avaliable for every platform. If, for some inane reason, you have to write C code that compiles cleanly as C++, you're not really writing C. If you need to port C to C++, you should be making lots of changes to make your C code more idiomatic C++.
If you can't do any of that, your code won't be pretty no matter what you do, so it doesn't really matter how you decide to cast at that point. I do like the idea of using templates to make a new allocator that returns the correct type, although that's basically just reinventing the new keyword.

Casting a function which returns (void *) to instead be an (int *) is harmless: you're casting one type of pointer to another.
Casting a function which returns an integer to instead be a pointer is most likely incorrect. The compiler would have flagged it had you not explicitly cast it.

One possible error could (depending on this is whether what you really want or not) be mallocing with one size scale, and assigning to a pointer of a different type. E.g.,
int *temp = (int *)malloc(sizeof(double));
There may be cases where you want to do this, but I suspect that they are rare.

I think you should put the cast in. Consider that there are three locations for types:
T1 *p;
p = (T2*) malloc(sizeof(T3));
The two lines of code might be widely separated. Therefore it's good that the compiler will enforce that T1 == T2. It is easier to visually verify that T2 == T3.
If you miss out the T2 cast, then you have to hope that T1 == T3.
On the other hand you have the missing stdlib.h argument - but I think it's less likely to be a problem.

On the other hand, if you ever need to port the code to C++, it is much better to use the 'new' operator.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js