I have a shader I need to optimise (with lots of vector operations) and I am experimenting with SSE instructions in order to better understand the problem.
I have some very simple sample code. With the USE_SSE define it uses explicit SSE intrinsics; without it I'm hoping GCC will do the work for me. Auto-vectorisation feels a bit finicky but I'm hoping it will save me some hair.
Compiler and platform is: gcc 4.7.1 (tdm64), target x86_64-w64-mingw32 and Windows 7 on Ivy Bridge.
Here's the test code:
/*
Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>
#if defined(__GNUG__) || defined(__clang__)
/* GCC & CLANG */
#define SSVEC_FINLINE __attribute__((always_inline))
#elif defined(_WIN32) && defined(MSC_VER)
/* MSVC. */
#define SSVEC_FINLINE __forceinline
#else
#error Unsupported platform.
#endif
#ifdef USE_SSE
typedef __m128 vec4f;
inline void addvec4f(vec4f &a, vec4f const &b)
{
a = _mm_add_ps(a, b);
}
#else
typedef float vec4f[4];
inline void addvec4f(vec4f &a, vec4f const &b)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
a[3] = a[3] + b[3];
}
#endif
int main(int argc, char *argv[])
{
int const count = 1e7;
#ifdef USE_SSE
printf("Using SSE.\n");
#else
printf("Not using SSE.\n");
#endif
vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};
for (int i = 0; i < count; ++i)
{
vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
addvec4f(data, val);
}
float result[4] = {0};
#ifdef USE_SSE
_mm_store_ps(result, data);
#else
result[0] = data[0];
result[1] = data[1];
result[2] = data[2];
result[3] = data[3];
#endif
printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);
return 0;
}
This is compiled with:
g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe
Apart from the explicit SSE-version being a bit quicker there is no difference in output.
Here's the assembly for the loop, first explicit SSE:
.L3:
subl $1, %eax
addps %xmm1, %xmm0
jne .L3
It inlined the call. Nice, more or less just a straight up _mm_add_ps.
Array version:
.L3:
subl $1, %eax
addss %xmm0, %xmm1
addss %xmm0, %xmm2
addss %xmm0, %xmm3
addss %xmm0, %xmm4
jne .L3
It is using SSE math alright, but on each array member. Not really desirable.
My question is, how can I help GCC so that it can better optimise the array version of vec4f?
Any Linux specific tips is helpful too, that's where the real code will run.
This LockLess article on Auto-vectorization with gcc 4.7 is hands down the best article I have ever seen and I have spent a while looking for good articles on similar topics. They also have a lot of other articles that you may find very useful on similar subjects dealing all manners of low level software development.
Here is some tips based on your code to make gcc auto-vectorization works:
make the loop-upbound a const. To vectorize, GCC need to split the loop by 4-iterations to fit in the SSE XMM register, which is 128-bit length. a const loop upper bound will help GCC make sure that the loop have plenty of iterations, and the vectorization is profitable.
remove the inline keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3.
so, to make your code vectorized, your addvec4f function should be modified as the following:
void addvec4f(vec4f &a, vec4f const &b)
{
int i = 0;
for(;i < 4; i++)
a[i] = a[i]+b[i];
}
BTW:
GCC also have flags to help you find out whether a loop have been vectorized. -ftree-vectorizer-verbose=2, higher number will have more output information, currently the value can be 0,1,2.Here is the documentation of this flag, and some other related flag.
Be careful of the alignment. The address of the array should be aligned, and the compiler can not know whether the address is aligned without running it.Usually, there will be a bus error if the data is not aligned. Here is the reason.
Related
I have code that compares one string to another, but with optimizer(-Oz image size + LTO) turned on, I recieve some problems. I've tried volatile parameters but me variables are still optimized.
bool INA239::getBusStatus() {
volatile uint8_t ManfId[3] = "TI";
volatile uint8_t* data = readReg(INA239_REG_MANUFACTURER_ID);
volatile bool result = !memcmp((uint8_t*)data, (uint8_t*)ManfId, 2);
return result;
}
readReg() function reads data from SPI using DMA and as I see on my SPI buffer, it performs OK.
uint8_t* INA239::readReg(uint8_t RegNo) {
SPITxBuffer[0] = (RegNo << 2) | 0x01;
SPITxBuffer[1] = 0x00;
SPITxBuffer[2] = 0x00;
while(!DSPI->ModeChange(GPort, GPin, INA239_CPOL, INA239_CPHA));
while(!DSPI->TXRX(SPITxBuffer, SPIRxBuffer, INA239_SPIBUFFERSIZE + (RegNo == INA239_REG_POWER)));
while(DSPI->Maintain);
return &SPIRxBuffer[1];
}
it sends data and waits for HAL_SPI_TxRxCpltCallback
I've found strange solution that somewhat work:
Add HAL_Delay(1) after
volatile uint8_t* data = readReg(INA239_REG_MANUFACTURER_ID);
But I have no need to do any of this if optimizer flag is -O0
It seems like optimized code perform strange with DMA.
I thought that optimizer ignores empty body while loop and yes:
57: while(!DSPI->TXRX(SPITxBuffer, SPIRxBuffer, INA239_SPIBUFFERSIZE + (RegNo == INA239_REG_POWER)));
58: while(DSPI->Maintain);
0x08003EFA 4631 MOV r1,r6
0x08003EFC 4622 MOV r2,r4
0x08003EFE 462B MOV r3,r5
0x08003F00 F7FDFD8B BL.W 0x08001A1A OUTLINED_FUNCTION_9
0x08003F04 2800 CMP r0,#0x00
0x08003F06 D0F8 BEQ 0x08003EFA
this should act OK, I thought, as there is BEQ 0x08003EFA, but no, it is ignored
As a solution I made Maintain a volatile bool and after that that assembler code is:
58: while(DSPI->Maintain);
0x08003F86 F8901021 LDRB r1,[r0,#0x21]
0x08003F8A 2900 CMP r1,#0x00
0x08003F8C D1FB BNE 0x08003F86
But I want all my INA239 functions to be not optimized.
So is there any way to make armclang v6 to not optimize block of code?
I've tried:
1)
#pragma push
#pragma O0
void function2(void)
{
}
#pragma pop
but compiler says that this is arm v5 compiler options and will not work on v6
(-Warmcc-pragma-optimization)
2)
void __attribute__((optimize("O0")))
but get unkown attribute warning
3)
#pragma GCC push_options
#pragma GCC optimize ("O0")
your code
#pragma GCC pop_options
no messages, but doesn't work
4)
#pragma push_options
#pragma optimize ("O0")
your code
#pragma pop_options
no messages, but doesn't work
Can std::memmove() be used to "move" the memory to the same location to be able to alias it using different types?
For example:
#include <cstring>
#include <cstdint>
#include <iomanip>
#include <iostream>
struct Parts { std::uint16_t v[2u]; };
static_assert(sizeof(Parts) == sizeof(std::uint32_t), "");
static_assert(alignof(Parts) <= alignof(std::uint32_t), "");
int main() {
std::uint32_t u = 0xdeadbeef;
Parts * p = reinterpret_cast<Parts *>(std::memmove(&u, &u, sizeof(u)));
std::cout << std::hex << u << " ~> "
<< p->v[0] << ", " << p->v[1] << std::endl;
}
$ g++-10.2.0 -Wall -Wextra test.cpp -o test -O2 -ggdb -fsanitize=address,undefined -std=c++20 && ./test
deadbeef ~> beef, dead
Is this a safe approach? What are the caveats? Can static_cast be used instead of reinterpret_cast here?
If one is interested in knowing what constructs will be reliably processed by the clang and gcc optimizers in the absence of the -fno-strict-aliasing, rather than assuming that everything defined by the Standard will be processed meaningfully, both clang and gcc will sometimes ignore changes to active/effective types made by operations or sequences of operations that would not affect the bit pattern in a region of storage.
As an example:
#include <limits.h>
#include <string.h>
#if LONG_MAX == LLONG_MAX
typedef long long longish;
#elif LONG_MAX == INT_MAX
typedef int longish;
#endif
__attribute((noinline))
long test(long *p, int index, int index2, int index3)
{
if (sizeof (long) != sizeof (longish))
return -1;
p[index] = 1;
((longish*)p)[index2] = 2;
longish temp2 = ((longish*)p)[index3];
p[index3] = 5; // This should modify p[index3] and set its active/effective type
p[index3] = temp2; // Shouldn't (but seems to) reset effective type to longish
long temp3;
memmove(&temp3, p+index3, sizeof (long));
memmove(p+index3, &temp3, sizeof (long));
return p[index];
}
#include <stdio.h>
int main(void)
{
long arr[1] = {0};
long temp = test(arr, 0, 0, 0);
printf("%ld should equal %ld\n", temp, arr[0]);
}
While gcc happens to process this code correctly on 32-bit ARM (even if one uses flag -mcpu=cortex-m3 to avoid the call to memmove), clang processes it incorrectly on both platforms. Interestingly, while clang makes no attempt to reload p[index], gcc does reload it on both platforms, but the code for test on x64 is:
test(long*, int, int, int):
movsx rsi, esi
movsx rdx, edx
lea rax, [rdi+rsi*8]
mov QWORD PTR [rax], 1
mov rax, QWORD PTR [rax]
mov QWORD PTR [rdi+rdx*8], 2
ret
This code writes the value 1 to p[index1], then reads p[index1], stores 2 to p[index2], and returns the value just read from p[index1].
It's possible that memmove will scrub active/effective type on all implementations that correctly handle all of the corner cases mandated by the Standards, but it's not necessary on the -fno-strict-aliasing dialects of clang and gcc, and it's insufficient on the -fstrict-aliasing dialects processed by those compilers.
I want to speedup image processing code using OpenMP and I found some strange behavior in my code. I'm using Visual Studio 2019 and I also tried Intel C++ compiler with same result.
I'm not sure why is the code with OpenMP in some situations much slower than in the others. For example function divideImageDataWithParam() or difference between copyFirstPixelOnRow() and copyFirstPixelOnRowUsingTSize() using struct TSize as parameter of image data size. Why is performance of boxFilterRow() and boxFilterRow_OpenMP() so different a why isn't it with different radius size in program?
I created github repository for this little testing project:
https://github.com/Tb45/OpenMP-Strange-Behavior
Here are all results summarized:
https://github.com/Tb45/OpenMP-Strange-Behavior/blob/master/resuts.txt
I didn't find any explanation why is this happening or what am I doing wrong.
Thanks for your help.
I'm working on faster box filter and others for image processing algorithms.
typedef intptr_t int_t;
struct TSize
{
int_t width;
int_t height;
};
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
void divideImageDataWithParam_OpenMP(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param, bool parallel)
{
#pragma omp parallel for if(parallel)
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = src[y*srcStep + x]/param;
}
}
}
Results of divideImageDataWithParam():
generateRandomImageData :: 3840x2160
numberOfIterations = 100
With Visual C++ 2019:
32bit 64bit
336.906ms 344.251ms divideImageDataWithParam
1832.120ms 6395.861ms divideImageDataWithParam_OpenMP single-thread parallel=false
387.152ms 1204.302ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
With Intel C++ 19:
32bit 64bit
15.162ms 8.927ms divideImageDataWithParam
266.646ms 294.134ms divideImageDataWithParam_OpenMP single-threaded parallel=false
239.564ms 1195.556ms divideImageDataWithParam_OpenMP multi-threaded parallel=true
Screenshot from Intel VTune Amplifier, where divideImageDataWithParam_OpenMP() with parallel=false take most of the time in instruction mov to dst memory.
648trindade is right; it has to do with optimizations that cannot be done with openmp. But its not loop-unrolling or vectorization, its inlining which allows for a smart substitution.
Let me explain: Integer divisions are incredibly slow (64bit IDIV: ~40-100 Cycles). So whenever possible people (and compilers) try to avoid divisions. One trick you can use is to substitute a division with a multiplication and a shift. That only works if the divisor is known at compile time. This is the case because your function divideImageDataWithParam is inlined and PARAM is known. You can verify this by prepending it with __declspec(noinline). You will get the timings that you expected.
The openmp parallelization does not allow this trick because the function cannot be inlined and therefore param is not known at compile time and an expensive IDIV-instruction is generated.
Compiler output of divideImageDataWithParam (WIN10, MSVC2017, x64):
0x7ff67d151480 <+ 336> movzx ecx,byte ptr [r10+r8]
0x7ff67d151485 <+ 341> mov rax,r12
0x7ff67d151488 <+ 344> mul rax,rcx <------- multiply
0x7ff67d15148b <+ 347> shr rdx,3 <------- shift
0x7ff67d15148f <+ 351> mov byte ptr [r8],dl
0x7ff67d151492 <+ 354> lea r8,[r8+1]
0x7ff67d151496 <+ 358> sub r9,1
0x7ff67d15149a <+ 362> jne test!main+0x150 (00007ff6`7d151480)
And the openmp-version:
0x7ff67d151210 <+ 192> movzx eax,byte ptr [r10+rcx]
0x7ff67d151215 <+ 197> lea rcx,[rcx+1]
0x7ff67d151219 <+ 201> cqo
0x7ff67d15121b <+ 203> idiv rax,rbp <------- idiv
0x7ff67d15121e <+ 206> mov byte ptr [rcx-1],al
0x7ff67d151221 <+ 209> lea rax,[r8+rcx]
0x7ff67d151225 <+ 213> mov rdx,qword ptr [rbx]
0x7ff67d151228 <+ 216> cmp rax,rdx
0x7ff67d15122b <+ 219> jl test!divideImageDataWithParam$omp$1+0xc0 (00007ff6`7d151210)
Note 1) If you try out the compiler explorer (https://godbolt.org/) you will see that some compilers do the substitution for the openmp version too.
Note 2) As soon as the parameter is not known at compile time this optimization cannot be done anyway. So if you put your function into a library it will be slow. I'd do something like precomputing the division for all possible values and then do a lookup. This is even faster because the lookup table fits into 4-5 cache lines and L1 latency is only 3-4 cycles.
void divideImageDataWithParam(
const unsigned char * src, int_t srcStep, unsigned char * dst, int_t dstStep, TSize size, int_t param)
{
uint8_t tbl[256];
for(int i = 0; i < 256; i++) {
tbl[i] = i / param;
}
for (int_t y = 0; y < size.height; y++)
{
for (int_t x = 0; x < size.width; x++)
{
dst[y*dstStep + x] = tbl[src[y*srcStep + x]];
}
}
}
Also thanks for the interesting question, I learned a thing or two along the way! ;-)
This behavior is explained by the use of compiler optimizations: when enabled, divideImageDataWithParam sequential code will be subjected to a series of optimizations (loop-unrolling, vectorization, etc.) that divideImageDataWithParam_OpenMP parallel code probably is not, as it is certainly uncharacterized after the process of outlining parallel regions by the compiler.
If you compile this same code without optimizations, you will find that the runtime version of the sequential version is very similar to that of the parallel version with only one thread.
The maximum speedup of parallel version in this case is limited by the division of the original workload without optimizations. Optimizations in this case need to be writed manually.
The Xeon-Phi Knights Landing cores have a fast exp2 instruction vexp2pd (intrinsic _mm512_exp2a23_pd). The Intel C++ compiler can vectorize the exp function using the Short Vector Math Library (SVML) which comes with the compiler. Specifically, it calls the fuction __svml_exp8.
However, when I step through a debugger I don't see that __svml_exp8 uses the vexp2pd instruction. It is a complication function with many FMA operations. I understand that vexp2pd is less accurate than exp but if I use -fp-model fast=1 (the default) or fp-model fast=2 I expect the compiler to use this instruction but it does not.
I have two questions.
Is there a way to get the compiler to use vexp2pd?
How do I safely override the call to __svml_exp8?
As to the second question this is what I have done so far.
//exp(x) = exp2(log2(e)*x)
extern "C" __m512d __svml_exp8(__m512d x) {
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
Is this safe? Is there a better solution e.g. one that inlines the function? In the test code below this is about 3 times faster than if I don't override.
//https://godbolt.org/g/adI11c
//icpc -O3 -xMIC-AVX512 foo.cpp
#include <math.h>
#include <stdio.h>
#include <x86intrin.h>
extern "C" __m512d __svml_exp8(__m512d x) {
//exp(x) = exp2(log2(e)*x)
return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}
void foo(double * __restrict x, double * __restrict y) {
__assume_aligned(x, 64);
__assume_aligned(y, 64);
for(int i=0; i<1024; i++) y[i] = exp(x[i]);
}
int main(void) {
double x[1024], y[1024];
for(int i=0; i<1024; i++) x[i] = 1.0*i;
for(int r=0; r<1000000; r++) foo(x,y);
double sum=0;
//for(int i=0; i<1024; i++) sum+=y[i];
for(int i=0; i<8; i++) printf("%f ", y[i]); puts("");
//printf("%lf",sum);
}
ICC will generate vexp2pd but only under very much relaxed math requirements as specified by targeted -fimf* switches.
#include <math.h>
void vfoo(int n, double * a, double * r)
{
int i;
#pragma simd
for ( i = 0; i < n; i++ )
{
r[i] = exp(a[i]);
}
}
E.g. compile with -xMIC-AVX512 -fimf-domain-exclusion=1 -fimf-accuracy-bits=22
..B1.12:
vmovups (%rsi,%rax,8), %zmm0
vmulpd .L_2il0floatpacket.2(%rip){1to8}, %zmm0, %zmm1
vexp2pd %zmm1, %zmm2
vmovupd %zmm2, (%rcx,%rax,8)
addq $8, %rax
cmpq %r8, %rax
jb ..B1.12
Please be sure to understand the accuracy implications as not only the end result is only about 22 bits accurate, but the vexp2pd also flushes to zero any denormalized results irrespective of the FTZ/DAZ bits set in the MXCSR.
To the second question: "How do I safely override the call to __svml_exp8?"
Your approach is generally not safe. SVML routines are internal to Intel Compiler and rely on custom calling conventions, so a generic routine with the same name can potentially clobber more registers than a library routine would, and you may end up in a hard-to-debug ABI mismatch.
A better way of providing your own vector functions would be to utilize #pragma omp declare simd, e.g. see https://software.intel.com/en-us/node/524514 and possibly the vector_variant attribute if prefer coding with intrinsics, see https://software.intel.com/en-us/node/523350. Just don't try to override standard math names or you'll get an error.
I've understood it's best to avoid _mm_set_epi*, and instead rely on _mm_load_si128 (or even _mm_loadu_si128 with a small performance hit if the data is not aligned). However, the impact this has on performance seems inconsistent to me. The following is a good example.
Consider the two following functions that utilize SSE intrinsics:
static uint32_t clmul_load(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_load_si128((__m128i const*)(&x)),
_mm_load_si128((__m128i const*)(&y)), 0);
return _mm_extract_epi32(c, 0);
}
static uint32_t clmul_set(uint16_t x, uint16_t y)
{
const __m128i c = _mm_clmulepi64_si128(
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, x),
_mm_set_epi16(0, 0, 0, 0, 0, 0, 0, y), 0);
return _mm_extract_epi32(c, 0);
}
The following function benchmarks the performance of the two:
template <typename F>
void benchmark(int t, F f)
{
std::mt19937 rng(static_cast<unsigned int>(std::time(0)));
std::uniform_int_distribution<uint32_t> uint_dist10(
0, std::numeric_limits<uint32_t>::max());
std::vector<uint32_t> vec(t);
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < t; ++i)
{
vec[i] = f(uint_dist10(rng), uint_dist10(rng));
}
auto duration = std::chrono::duration_cast<
std::chrono::milliseconds>(
std::chrono::high_resolution_clock::now() -
start);
std::cout << (duration.count() / 1000.0) << " seconds.\n";
}
Finally, the following main program does some testing:
int main()
{
const int N = 10000000;
benchmark(N, clmul_load);
benchmark(N, clmul_set);
}
On an i7 Haswell with MSVC 2013, a typical output is
0.208 seconds. // _mm_load_si128
0.129 seconds. // _mm_set_epi16
Using GCC with parameters -O3 -std=c++11 -march=native (with slightly older hardware), a typical output is
0.312 seconds. // _mm_load_si128
0.262 seconds. // _mm_set_epi16
What explains this? Are there actually cases where _mm_set_epi* is preferable over _mm_load_si128? There are other times where I've noticed _mm_load_si128 to perform better, but I can't really characterize those observations.
Your compiler is optimizing away the "gather" behavior of your _mm_set_epi16() call since it really isn't needed. From g++ 4.8 (-O3) and gdb:
(gdb) disas clmul_load
Dump of assembler code for function clmul_load(uint16_t, uint16_t):
0x0000000000400b80 <+0>: mov %di,-0xc(%rsp)
0x0000000000400b85 <+5>: mov %si,-0x10(%rsp)
0x0000000000400b8a <+10>: vmovdqu -0xc(%rsp),%xmm0
0x0000000000400b90 <+16>: vmovdqu -0x10(%rsp),%xmm1
0x0000000000400b96 <+22>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400b9c <+28>: vmovd %xmm0,%eax
0x0000000000400ba0 <+32>: retq
End of assembler dump.
(gdb) disas clmul_set
Dump of assembler code for function clmul_set(uint16_t, uint16_t):
0x0000000000400bb0 <+0>: vpxor %xmm0,%xmm0,%xmm0
0x0000000000400bb4 <+4>: vpxor %xmm1,%xmm1,%xmm1
0x0000000000400bb8 <+8>: vpinsrw $0x0,%edi,%xmm0,%xmm0
0x0000000000400bbd <+13>: vpinsrw $0x0,%esi,%xmm1,%xmm1
0x0000000000400bc2 <+18>: vpclmullqlqdq %xmm1,%xmm0,%xmm0
0x0000000000400bc8 <+24>: vmovd %xmm0,%eax
0x0000000000400bcc <+28>: retq
End of assembler dump.
The vpinsrw (insert word) is ever-so-slightly faster than the unaligned double-quadword move from clmul_load, likely due to the internal load/store unit being able to do the smaller reads simultaneously but not the 16B ones. If you were doing more arbitrary loads, this would go away, obviously.
The slowness of _mm_set_epi* comes from the need to scrape together various variables into a single vector. You'd have to examine the generated assembly to be certain, but my guess is that since most of the arguments to your _mm_set_epi16 calls are constants (and zeroes, at that), GCC is generating a fairly short and fast set of instructions for the intrinsic.