fastest way to initialize huge array of floats - c++

i need to initialize every node of a tree with something like:
this->values=(float*) _aligned_malloc(mem * sizeof(float), 32);
this->frequencies =(float*) _aligned_malloc(mem * sizeof(float), 32);
where mem is rather big(~100k-1m), values are 0s and frequencies==1/numChildren (arbitrary float for each node)
the fastest(although by a small amount) was std:fill_n:
std::fill_n(this->values, mem, 0);
std::fill_n(this->frequencies , mem,1/(float)numchildren);
i thought using avx2 intrinsics would've made it faster, something like:
float v = 1 / (float)numchildren;
__m256 nc = _mm256_set_ps(v, v, v, v, v, v, v, v);
__m256 z = _mm256_setzero_ps();
for (long i = 0; i < mem; i += 8)
{
_mm256_store_ps(this->frequencies + i, nc);
_mm256_store_ps(this->values + i, z);
}
this was actually a bit slower, and as slow as naive
for (auto i = 0; i < mem; i++)
{
this->values[i] = 0;
this->frequencies[i] = 1 / (float)numchildren;
}
i assume that intrinsics may actually copy arguments on each call, but since all values are the same, i want to load them into 1 register just once and move to different memory locations multiple times and i think it's not what's happening here.

By _aligned_malloc I assume Windows.
In Windows, you can allocate with VirtualAlloc big amounts of memory, it would be page-aligned (4096 bytes), and will be already zeroed by OS, which is likely faster than manual zeroing.
Note that VirtualAlloc is always a kernel call, but a huge _aligned_malloc is very likely to be a kernel call anyway.

Related

How to transform this memcpy into a for?

What is the difference between these two?
for (int i = 0; i < numSamples; i++) {
mData[sampleIndex++] = *buffer++ * (1.0f / 32768);
}
and
memcpy(&mData[sampleIndex], buffer, (numSamples * sizeof(float)));
If I understood correct, the first copies numSamples float values to mData, one by one. The second one, copies numSamples*sizeof(float) bytes to mData. Since we're copying numSaples * number of bytes on float, I think they do the same thing, but the first one actually multiplies things before passing to mData.
So, is there a way to transform the memcpy into a for? Something like:
for (int i = 0; i < numSamples * sizeof(float); i++) {
//What to put here?
}
Context:
const int32_t mChannelCount;
const int32_t mMaxFrames;
int32_t sampleIndex;
float *mData;
float *buffer;
What is the difference between these two?
for (int i = 0; i < numSamples; i++) {
mData[sampleIndex++] = *buffer++ * (1.0f / 32768);
}
// and
memcpy(&mData[sampleIndex], buffer, (numSamples * sizeof(float)));
These are quite different given the * (1.0f / 32768);. I assume the code compare is setting the scaling difference aside. #Thomas Matthews.
Important: buffer, sampleIndex has different values after the for loop.
*buffer++ needs no code change should the type of buffer change. * sizeof(float) obilgies a code change. Could have used * sizeof *buffer.
mempcy() is optimized code per that platform. for() loops can only do so much. In particular, mempcy() assumes mData, buffer do not overlap. The for() loop may not be able to make that optimization.
This for uses int indexing where memcpy() uses size_t. Makes a difference with huge arrays.
memcpy() tolerates an unaligned pointers. mData[sampleIndex++] = *buffer++ .. does not.
"the first copies numSamples float values to mData, one by one. " is a not certain. A smart compiler may be able to make certain parallel copies depending on the context and act as if copying was done one by one.
Post the entire block of code/function that uses these 2 approaches for a better compare.
I gather from your post that you want to make a memcpy similar copy but using a for loop, that being the case you just need do use the same for loop but without the multiplication part:
for (int i = 0; i < numSamples; i++){
mData[sampleIndex++] = *buffer++;
}
Note that memcpy can be more effective than a for loop given the conditions (see Maxim Egorushkin and Jeremy Friesner comments bellow) so you may want to keep it that way.
Another, more idiomatic, and, I would argue, better way to implement the operations you are performing is to use the C++ library provided methods as sugested by Ted Lyngmo and rustyx.
Disclaimer: As I was writing my answer, Martin York posted a comment with a similar solution, that being the case, credit to him as well.
What is the difference between these two?
The former performs a calculation on the source array while copying the result into another array a float at a time.
The latter copies the content of the array byte at a time into another without calculation.
So, is there a way to transform the memcpy into a for?
Yes. Here is a naïve way to transform it:
auto dest_c = static_cast<unsigned char*>(mData + sampleIndex);
auto src_c = static_cast<const unsigned char*>(buffer);
auto end = src_c + numSamples * sizeof(float);
for (; src_c < end;) { // or while(src_c < end)
*dest_c++ = *src_c++;
}
The actual implementation of the standard function is likely more complex, involving optimisations related to copying long sequences.
Since you don't appear to need the generic reinterpretation aspect of std::memcpy, perhaps a simpler alternative would suffice:
auto dest = mData + sampleIndex;
auto src = buffer;
auto end = src + numSamples;
for (; src < end;) {
*dest++ = *src++;
}
Or perhaps another standard algorithm:
std::copy(buffer, buffer + numSamples, mData + sampleIndex);

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
{
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
++l_ptrLaplace_pf32;
++l_ptrGauss_pf32;
}
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
{
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
}
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
{
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);
}

Performance AVX/SSE assembly vs. intrinsics

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:
void Mul(float *src1, float *src2, float *dst)
{
for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};
Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):
ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0
(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)
As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:
int i;
for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];
if ((MINTEGER)(src1 + i) % 32 == 0)
{
if ((MINTEGER)(src2 + i) % 32 == 0)
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_load_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_load_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
}
else
{
for (; i<cnt-8; i+=8)
{
__m256 x = _mm256_loadu_ps( src1 + i);
__m256 y = _mm256_loadu_ps( src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_store_ps(dst + i, z);
};
};
for (; i<cnt; i++) dst[i] = src1[i] * src2[i];
Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.
One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.
Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?
Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.
If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:
void Mul(float *src1, float *src2, float *__restrict__ dst)
or even better
void Mul(const float *src1, const float *src2, float *__restrict__ dst)
(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)
On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:
for (i = 0; i <= cnt - 8; i += 8)
{
__m256 x = _mm256_loadu_ps(src1 + i);
__m256 y = _mm256_loadu_ps(src2 + i);
__m256 z = _mm256_mul_ps(x, y);
_mm256_storeu_ps(dst + i, z);
}
for ( ; i < cnt; i++)
{
dst[i] = src1[i] * src2[i];
}
Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.
Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

How do you iterate through a pitched CUDA array?

Having parallelized with OpenMP before, I'm trying to wrap my head around CUDA, which doesn't seem too intuitive to me. At this point, I'm trying to understand exactly how to loop through an array in a parallelized fashion.
Cuda by Example is a great start.
The snippet on page 43 shows:
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
Whereas in OpenMP the programmer chooses the number of times the loop will run and OpenMP splits that into threads for you, in CUDA you have to tell it (via the number of blocks and number of threads in <<<...>>>) to run it sufficient times to iterate through your array, using a thread ID number as an iterator. In other words you can have a CUDA kernel always run 10,000 times which means the above code will work for any array up to N = 10,000 (and of course for smaller arrays you're wasting cycles dropping out at if (tid < N)).
For pitched memory (2D and 3D arrays), the CUDA Programming Guide has the following example:
// Host code
int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c > width; ++c) {
float element = row[c];
}
}
}
This example doesn't seem too useful to me. First they declare an array that is 64 x 64, then the kernel is set to execute 512 x 100 times. That's fine, because the kernel does nothing other than iterate through the array (so it runs 51,200 loops through a 64 x 64 array).
According to this answer the iterator for when there are blocks of threads going on will be
int tid = (blockIdx.x * blockDim.x) + threadIdx.x;
So if I wanted to run the first snippet in my question for a pitched array, I could just make sure I had enough blocks and threads to cover every element including the padding that I don't care about. But that seems wasteful.
So how do I iterate through a pitched array without going through the padding elements?
In my particular application I have a 2D FFT and I'm trying to calculate arrays of the magnitude and angle (on the GPU to save time).
After reviewing the valuable comments and answers from JackOLantern, and re-reading the documentation, I was able to get my head straight. Of course the answer is "trivial" now that I understand it.
In the code below, I define CFPtype (Complex Floating Point) and FPtype so that I can quickly change between single and double precision. For example, #define CFPtype cufftComplex.
I still can't wrap my head around the number of threads used to call the kernel. If it's too large, it simply won't go into the function at all. The documentation doesn't seem to say anything about what number should be used - but this is all for a separate question.
The key in getting my whole program to work (2D FFT on pitched memory and calculating magnitude and argument) was realizing that even though CUDA gives you plenty of "apparent" help in allocating 2D and 3D arrays, everything is still in units of bytes. It's obvious in a malloc call that the sizeof(type) must be included, but I totally missed it in calls of the type allocate(width, height). Noob mistake, I guess. Had I written the library I would have made the type size a separate parameter, but whatever.
So given an image of dimensions width x height in pixels, this is how it comes together:
Allocating memory
I'm using pinned memory on the host side because it's supposed to be faster. That's allocated with cudaHostAlloc which is straightforward. For pitched memory, you need to store the pitch for each different width and type, because it could change. In my case the dimensions are all the same (complex to complex transform) but I have arrays that are real numbers so I store a complexPitch and a realPitch. The pitched memory is done like this:
cudaMallocPitch(&inputGPU, &complexPitch, width * sizeof(CFPtype), height);
To copy memory to/from pitched arrays you cannot use cudaMemcpy.
cudaMemcpy2D(inputGPU, complexPitch, //destination and destination pitch
inputPinned, width * sizeof(CFPtype), //source and source pitch (= width because it's not padded).
width * sizeof(CFPtype), height, cudaMemcpyKind::cudaMemcpyHostToDevice);
FFT plan for pitched arrays
JackOLantern provided this answer, which I couldn't have done without. In my case the plan looks like this:
int n[] = {height, width};
int nembed[] = {height, complexPitch/sizeof(CFPtype)};
result = cufftPlanMany(
&plan,
2, n, //transform rank and dimensions
nembed, 1, //input array physical dimensions and stride
1, //input distance to next batch (irrelevant because we are only doing 1)
nembed, 1, //output array physical dimensions and stride
1, //output distance to next batch
cufftType::CUFFT_C2C, 1);
Executing the FFT is trivial:
cufftExecC2C(plan, inputGPU, outputGPU, CUFFT_FORWARD);
So far I have had little to optimize. Now I wanted to get magnitude and phase out of the transform, hence the question of how to traverse a pitched array in parallel. First I define a function to call the kernel with the "correct" threads per block and enough blocks to cover the entire image. As suggested by the documentation, creating 2D structures for these numbers is a great help.
void GPUCalcMagPhase(CFPtype *data, size_t dataPitch, int width, int height, FPtype *magnitude, FPtype *phase, size_t magPhasePitch, int cudaBlockSize)
{
dim3 threadsPerBlock(cudaBlockSize, cudaBlockSize);
dim3 numBlocks((unsigned int)ceil(width / (double)threadsPerBlock.x), (unsigned int)ceil(height / (double)threadsPerBlock.y));
CalcMagPhaseKernel<<<numBlocks, threadsPerBlock>>>(data, dataPitch, width, height, magnitude, phase, magPhasePitch);
}
Setting the blocks and threads per block is equivalent to writing the (up to 3) nested for-loops. So you have to have enough blocks * threads to cover the array, and then in the kernel you must make sure that you are not exceeding the array size. By using 2D elements for threadsPerBlock and numBlocks, you avoid having to go through the padding elements in the array.
Traversing a pitched array in parallel
The kernel uses the standard pointer arithmetic from the documentation:
__global__ void CalcMagPhaseKernel(CFPtype *data, size_t dataPitch, int width, int height,
FPtype *magnitude, FPtype *phase, size_t magPhasePitch)
{
int threadX = threadIdx.x + blockDim.x * blockIdx.x;
if (threadX >= width)
return;
int threadY = threadIdx.y + blockDim.y * blockIdx.y;
if (threadY >= height)
return;
CFPtype *threadRow = (CFPtype *)((char *)data + threadY * dataPitch);
CFPtype complex = threadRow[threadX];
FPtype *magRow = (FPtype *)((char *)magnitude + threadY * magPhasePitch);
FPtype *magElement = &(magRow[threadX]);
FPtype *phaseRow = (FPtype *)((char *)phase + threadY * magPhasePitch);
FPtype *phaseElement = &(phaseRow[threadX]);
*magElement = sqrt(complex.x*complex.x + complex.y*complex.y);
*phaseElement = atan2(complex.y, complex.x);
}
The only wasted threads here are for the cases where the width or height are not multiples of the number of threads per block.

Why is this slower than memcmp

I am trying to compare two rows of pixels.
A pixel is defined as a struct containing 4 float values (RGBA).
The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.
My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2 and all the optimization flags.
I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".
If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.
The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.
You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.
Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Roughly something like that.
You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned it compares the pointers byte by byte until the start of an aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory with SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....
I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:
You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.