vectorization fails with GCC

vectorization fails with GCC - c++

I am trying to understand vectorization but to my surprise this very simple code is not being vectorized
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
}
While the Intel compiler vectorizes for some reason the initialization loop, line 5.
> icc -vec-report a.c
a.c(5): (col. 3) remark: LOOP WAS VECTORIZED
With GCC, I get nothing it seems
> gcc -ftree-vectorize -ftree-vectorizer-verbose=2 a.c
Am I doing something wrong? Shouldn't this be a very simple vectorizable loop? All the same operations, continuous memory etc. My CPU supports SSE1/2/3/4.
--- update ---
Following the answer below, this example works for me.
#include <stdio.h>
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
printf("%d\n", c[1023]);
}
With icc
> icc -vec-report a.c
a.c(7): (col. 3) remark: LOOP WAS VECTORIZED
a.c(8): (col. 3) remark: LOOP WAS VECTORIZED
And gcc
> gcc -ftree-vectorize -fopt-info-vec -O a.c
a.c:8:3: note: loop vectorized
a.c:7:3: note: loop vectorized

I've slightly modified your source code to be sure that GCC couldn't remove the loops:
#include <stdio.h>
#define n 1024
int main () {
int i, a[n], b[n], c[n];
for(i=0; i<n; i++) { a[i] = i; b[i] = i*i; }
for(i=0; i<n; i++) c[i] = a[i]+b[i];
printf("%d\n", c[1023]);
}
GCC (v4.8.2) can vectorize the two loops but it needs the -O flag:
gcc -ftree-vectorize -ftree-vectorizer-verbose=1 -O2 a.c
and I get:
Analyzing loop at a.c:8
Vectorizing loop at a.c:8
a.c:8 note: LOOP VECTORIZED. Analyzing loop at a.c:7
Vectorizing loop at a.c:7
a.c:7 note: LOOP VECTORIZED. a.c: note: vectorized 2 loops in
function.
Using the -fdump-tree-vect switch GCC will dump more information in the a.c.##t.vect file (it's quite useful to get an idea of what is happening "inside").
Also consider that:
the -march= switch could be essential to perform vectorization
-ftree-vectorizer-verbose=n is now being deprecated in favor of -fopt-info-vec and -fopt-info-vec-missed (see http://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html)

Most of the time the options -Ofast -march=native will vectorize your code if it can be on your processor.
$ gcc compute_simple.c -Ofast -march=native -fopt-info-vec -o compute_simple.bin
compute_simple.c:14:5: note: loop vectorized
compute_simple.c:14:5: note: loop versioned for vectorization because of possible aliasing
compute_simple.c:14:5: note: loop vectorized
To know if your processor can do it, use lscpu and look at available flags.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
...
Vendor ID: GenuineIntel
...
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand
lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb
stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1
hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt
xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify
hwp_act_window hwp_epp md_clear flush_l1d
You need sse/avx on Intel, neon on ARM, others on AMD (like xop).
You can find many more information on vectorization by searching on gcc documentation.
Here is a nice article on the subject, with flags that can be used for many platforms:
https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Finaly, as written above, use -ftree-vectorizer-verbose=n in old versions of gcc, and -fopt-info-vec/-fopt-info-vec-missed in recent ones to see what is vectorized.

Related

Very low FLOPs/second without any data transfer

I tested the following code on my machine to see how much throughput I can get. The code does not do very much except assigning each thread two nested loop,
#include <chrono>
#include <iostream>
int main() {
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for(int thread = 0; thread < 24; thread++) {
float i = 0.0f;
while(i < 100000.0f) {
float j = 0.0f;
while (j < 100000.0f) {
j = j + 1.0f;
}
i = i + 1.0f;
}
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time = end_time - start_time;
std::cout << time / std::chrono::milliseconds(1) << std::endl;
return 0;
}
To my surprise, the throughput is very low according to perf
$ perf stat -e all_dc_accesses -e fp_ret_sse_avx_ops.all cmake-build-release/roofline_prediction
8907
Performance counter stats for 'cmake-build-release/roofline_prediction':
325.372.690 all_dc_accesses
240.002.400.000 fp_ret_sse_avx_ops.all
8,909514307 seconds time elapsed
202,819795000 seconds user
0,059613000 seconds sys
With 240.002.400.000 FLOPs in 8.83 seconds, the machine achieved only 27.1 GFLOPs/second, way below the CPU's capacity of 392 GFLOPs/sec (I got this number from a roofline modelling software).
My question is, how can I achieved higher throughput?
Compiler: GCC 9.3.0
CPU: AMD Threadripper 1920X
Optimization level: -O3
OpenMP's flag: -fopenmp

Compiled with GCC 9.3 with those options, the inner loop looks like this:
.L3:
addss xmm0, xmm2
comiss xmm1, xmm0
ja .L3
Some other combinations of GCC version / options may result in the loop being elided, after all it doesn't really do anything (except waste time).
The addss forms a loop-carried dependency with only itself in it. That is not fast though, on Zen 1 that takes 3 cycles per iteration, so the number of additions per cycle is 1/3. The maximum number of floating point additions per cycle could be attained by having at least 6 independent addps instructions (256bit vaddps may help a bit, but Zen 1 executes such 256bit SIMD instructions with 2 128bit operations internally), to deal with the latency of 3 and the throughput of 2 per cycle (so 6 operations need to be active at any time). That would correspond to 8 additions per cycles, 24 times as much as the current code.
From a C++ program, it may be possible to coax the compiler into generating suitable machine code by:
Using -ffast-math (if possible, which it isn't always)
Using explicit vectorization using _mm_add_ps
Manually unrolling the loop, using (at least 6) independent accumulators

How to verify CPU cache line size with c++ code?

I read an article from Igor's blog. The article said:
... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
The article also provides c# code to verify above conclusion:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.
And then I refer the above idea to implement a c++ version to verify the cache line size as the following:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;
const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;
int testCacheHit(int * pBuffer, int size, int step)
{
int result = 0;
for (int i = 0; i < size;) {
result += pBuffer[i];
i += step;
}
return result;
}
int main()
{
int * pBuffer = new int[buff_size*total_buff_count];
for (int i = 0; i < total_buff_count; ++i) {
int step = (int)pow(2, i);
auto start = std::chrono::system_clock::now();
volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
}
delete[] pBuffer;
}
But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following.
The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.
My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.

Drop between 1024 and 2048
Note that you are using an uninitialized array. This basically means that
int * pBuffer = new int[buff_size*total_buff_count];
does not cause your program to actually ask for any physical memory. Instead, just some virtual address space is reserved.
Then, whey you first touch some array element, a page fault is triggered and the OS maps the page to physical memory. This is a relatively slow operation which may significantly influece your experiment. Since a page size on your system is likely 4 kB, it can hold 1024 4-byte integers. When you go for 2048 step, then only every second page is actually accessed and the runtime drops proportionally.
You can avoid the negative effect of this mechanism by "touching" the memory in advance:
int * pBuffer = new int[buff_size*total_buff_count]{};
When I tried that, I got almost linear decrease of time between 64 and 8192 step sizes.
Drop between 1 and 2
A cache line size on your system is definitely not 2048 bytes, it's very likely 64 bytes (generally, it may have different values and even different values for different cache levels).
As for the first part, for step being 1, there are simply much more arithmetic operations involved (addition of array elements and increments of i).
Difference from Igor's experiment
We can only speculate about why Igor's experiment gave practically the same times in both cases. I would guess that the runtime of arithmetics is negligible there, since there is only a single loop counter increment involved and he writes into the array, which requires an additional transfer of cached lines back to memory. (We can say that the byte/op ratio is much higher than in your experiment.)

How to verify CPU cache line size with c++ code?
There is std::hardware_destructive_interference_size in C++17, which should provide the smallest cache line size. Note that it is a compile time value and the compiler relies on your input on what machine is targeted. When targeting entire architecture, the number may be inaccurate.

How to verify CPU cache line size with c++ code?
You reliably cannot.
And you should write portable C++ code. Read n3337.
Imagine that you did not enable compiler optimizations in your C++ compiler. And imagine that you run your C++ compiler in some emulator (like these).
On Linux specifically, you could parse the /proc/cpuinfo pseudo file and get the CPU cache line size from it.
For example:
% head -20 /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 8
model name : AMD Ryzen Threadripper 2970WX 24-Core Processor
stepping : 2
microcode : 0x800820b
cpu MHz : 1776.031
cache size : 512 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
BTW, there are many different organizations and levels of caches.
You could imagine a C++ application on Linux parsing the output of /proc/cpuinfo then making HTTP requests (using libcurl) to the Web to get more from it.
See also this answer.

understand the performance diff of std::optional on different platforms

I was trying to understand the overhead of using std::optional in the interface design, whether it's worthing using std::optional for single interface or using combination of "isNull and getX", the main concern is to understand the overhead of std::optional. So I crafted this simple program to test it out.
The code snippet and test results are summarized in this gist https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
Surprisingly, there are two things I don't understand, I would like to hear from experts to help
on the linux platform with gcc (8 and 9 both tested), I saw consistently std::optional is better than the direct condition, this is something I don't understand.
on Mac (my laptop), I saw the program executes way faster than the time measured on linux (12ms vs >60ms), I don't see why this is the case (is it because of Apple LLVM impl?), from the CPU info, I don't think the mac is faster, maybe I missed other information to check.
Any pointers will be appreciated.
Basically understand the runtime difference between these two platforms, maybe processors related, or compiler related, I'm not sure. Repro app is attached in the gist. Also, I don't quite understand why std::optional is faster on linux with GCC.
https://gist.github.com/shawncao/fa801d300c4a0a2826240fd9f5961db2
see details in the gist.
Code Snippet (copied from the gist based on suggestion)
inline bool f1ni(int i) {
return i % 2 == 0;
}
TEST(OptionalTest, TestOptionalPerf) {
auto f1 = [](int i) -> int {
return i + 1;
};
auto f2 = [](int i) -> std::optional<int> {
if (i % 2 == 0) {
return {};
}
return i + 1;
};
// run 1M cycles
constexpr auto cycles = 100000000;
nebula::common::Evidence::Duration duration;
long sum2 = 0;
for (int i = 0; i < cycles; ++i) {
auto x = f2(i);
if (x) {
sum2 += x.value();
}
}
LOG(INFO) << fmt::format("optional approach: sum={0}, time={1}", sum2, duration.elapsedMs());
duration.reset();
long sum1 = 0;
for (int i = 0; i < cycles; ++i) {
if (i % 2 != 0) {
sum1 += f1(i);
}
}
LOG(INFO) << fmt::format("special value approach: sum={0}, time={1}", sum1, duration.elapsedMs());
EXPECT_EQ(sum1, sum2);
Laptop Mac MacOs Mojave
Processor 2.6 GHz Intel Core i7
/usr/bin/g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
Apple LLVM version 10.0.1 (clang-1001.0.46.4)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
[Test Output]
I0812 09:36:20.678683 348009920 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=13
I0812 09:36:20.692001 348009920 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=12
Linux Ubuntu 18.04 LTS
36 cores - one of them
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU # 3.00GHz
stepping : 4
microcode : 0x100014a
cpu MHz : 1685.329
cache size : 25344 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
/usr/bin/g++ --version
g++ (Ubuntu 9.1.0-2ubuntu2~18.04) 9.1.0
Copyright (C) 2019 Free Software Foundation, Inc.
[Test Output]
I0812 16:42:07.391607 15080 TestCommon.cpp:354] optional approach: sum=2500000050000000, time=59
I0812 16:42:07.473335 15080 TestCommon.cpp:363] special value approach: sum=2500000050000000, time=81
===== build flags =====
[build commands capture for Mac]
/usr/bin/g++ -I/Users/shawncao/nebula/include -I/Users/shawncao/nebula/src -I/usr/local/Cellar/boost/1.70.0/include -I/usr/local/Cellar/folly/2019.08.05.00/include -I/Users/shawncao/nebula/build/cuckoofilter/src/cuckoofilter/src -I/Users/shawncao/nebula/build/fmtproj-prefix/src/fmtproj/include -I/Users/shawncao/nebula/src/service/gen/nebula -isystem /usr/local/include -isystem /Users/shawncao/nebula/build/gflagsp-prefix/src/gflagsp-build/include -isystem /Users/shawncao/nebula/build/glogp-prefix/src/glogp-build -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/include -isystem /Users/shawncao/nebula/build/roaringproj-prefix/src/roaringproj/cpp -isystem /Users/shawncao/nebula/build/yomm2-prefix/src/yomm2/include -isystem /Users/shawncao/nebula/build/xxhash-prefix/src/xxhash -isystem /Users/shawncao/nebula/build/bloom/src/bloom -isystem /usr/local/Cellar/openssl/1.0.2s/include -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -std=gnu++1z -o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -c /Users/shawncao/nebula/src/common/test/TestExts.cpp
[100%] Linking CXX executable CommonTests
/Applications/CMake.app/Contents/bin/cmake -E cmake_link_script CMakeFiles/CommonTests.dir/link.txt --verbose=1
/usr/bin/g++ -Wall -Wextra -Werror -Wno-error=nullability-completeness -Wno-error=sign-compare -Wno-error=unknown-warning-option -O3 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/Users/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a googletest-prefix/src/googletest-build/lib/libgtest.a googletest-prefix/src/googletest-build/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a glogp-prefix/src/glogp-build/libglog.a gflagsp-prefix/src/gflagsp-build/lib/libgflags.a bloom/src/bloom-build/lib/libbf.dylib libcflib.a /usr/local/Cellar/openssl/1.0.2s/lib/libssl.a fmtproj-prefix/src/fmtproj-build/libfmt.a
[build commands capture for Linux]
/usr/bin/g++ -Wall -Wextra -Werror -lstdc++ -Wl,--no-as-needed -no-pie -ldl -lunwind -I/usr/include/ -I/usr/local/include -L/usr/local/lib -L/usr/lib -O3 -s CMakeFiles/CommonTests.dir/src/common/test/TestCommon.cpp.o CMakeFiles/CommonTests.dir/src/common/test/TestExts.cpp.o -o CommonTests -Wl,-rpath,/home/shawncao/nebula/build/bloom/src/bloom-build/lib libNCommon.a yomm2-prefix/src/yomm2-build/src/libyomm2.a /usr/local/lib/libgtest.a /usr/local/lib/libgtest_main.a xxhash-prefix/src/xxhash/libxxhash.a roaringproj-prefix/src/roaringproj-build/src/libroaring.a /usr/local/lib/libglog.a /usr/local/lib/libgflags.a bloom/src/bloom-build/lib/libbf.so libcflib.a /usr/local/lib/libssl.a /usr/local/lib/libcrypto.a fmtproj-prefix/src/fmtproj-build/libfmt.a -lpthread

Installing pkgconfig from source

I want to install pkgconfig from source, because I want it in a particular path to be accessed by my application. However I am encountering a problem, which seems to be basic to someone very familiar with C/C++. I have tried to search a solution without success. The problem is with the definition of end_time as an extract from make shows below
In file included from gthread.c:42:0:
gthread-posix.c: In function 'g_cond_timed_wait_posix_impl':
gthread-posix.c:112:19: error: storage size of 'end_time' isn't known
struct timespec end_time;
gthread-posix.c:112:19: warning: unused variable 'end_time' [-Wunused-variable]
Makefile:285: recipe for target 'gthread.lo' failed
The line, 112, in the file "pkgconfig-0.23/glib-1.2.10/gthread/gthread-posix.c" at the origin of the error (including one line before and a few lines after) is
{
int result;
struct timespec end_time;
gboolean timed_out;
g_return_val_if_fail (cond != NULL, FALSE);
g_return_val_if_fail (entered_mutex != NULL, FALSE);
if (!abs_time)
{
g_cond_wait (cond, entered_mutex);
return TRUE;
}
end_time.tv_sec = abs_time->tv_sec;
end_time.tv_nsec = abs_time->tv_usec * (G_NANOSEC / G_MICROSEC);
g_assert (end_time.tv_nsec < G_NANOSEC);
result = pthread_cond_timedwait ((pthread_cond_t *) cond,
(pthread_mutex_t *) entered_mutex,
&end_time);
#ifdef HAVE_PTHREAD_COND_TIMEDWAIT_POSIX
timed_out = (result == ETIMEDOUT);
#else
timed_out = (result == -1 && errno == EAGAIN);
#endif
if (!timed_out)
posix_check_for_error (result);
return !timed_out;
}
Any assistance will be appreciated
My OS is ubuntu 16.04
Output of cat /proc/cpuinfo
cpu cores : 1
apicid : 0
initial apicid : 0
fdiv_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss nx pdpe1gb rdtscp constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor abm epb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 5202.00
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:

neon float multiplication is slower than expected

I have two tabs of floats. I need to multiply elements from the first tab by corresponding elements from the second tab and store the result in a third tab.
I would like to use NEON to parallelize floats multiplications: four float multiplications simultaneously instead of one.
I have expected significant acceleration but I achieved only about 20% execution time reduction. This is my code:
#include <stdlib.h>
#include <iostream>
#include <arm_neon.h>
const int n = 100; // table size
/* fill a tab with random floats */
void rand_tab(float *t) {
for (int i = 0; i < n; i++)
t[i] = (float)rand()/(float)RAND_MAX;
}
/* Multiply elements of two tabs and store results in third tab
- STANDARD processing. */
void mul_tab_standard(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i++)
tr[i] = t1[i] * t2[i];
}
/* Multiply elements of two tabs and store results in third tab
- NEON processing. */
void mul_tab_neon(float *t1, float *t2, float *tr) {
for (int i = 0; i < n; i+=4)
vst1q_f32(tr+i, vmulq_f32(vld1q_f32(t1+i), vld1q_f32(t2+i)));
}
int main() {
float t1[n], t2[n], tr[n];
/* fill tables with random values */
srand(1); rand_tab(t1); rand_tab(t2);
// I repeat table multiplication function 1000000 times for measuring purposes:
for (int k=0; k < 1000000; k++)
mul_tab_standard(t1, t2, tr); // switch to next line for comparison:
//mul_tab_neon(t1, t2, tr);
return 1;
}
I run the following command to compile:
g++ -mfpu=neon -ffast-math neon_test.cpp
My CPU: ARMv7 Processor rev 0 (v7l)
Do you have any ideas how I can achieve more significant speed-up?

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.
I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.
Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon
000000a4 <mul_tab_neon>:
ac: e0805003 add r5, r0, r3
b0: e0814003 add r4, r1, r3
b4: e082c003 add ip, r2, r3
b8: e2833010 add r3, r3, #16
bc: f4650a8f vld1.32 {d16-d17}, [r5]
c0: f4642a8f vld1.32 {d18-d19}, [r4]
c4: e3530e19 cmp r3, #400 ; 0x190
c8: f3400df2 vmul.f32 q8, q8, q9
cc: f44c0a8f vst1.32 {d16-d17}, [ip]
d0: 1afffff5 bne ac <mul_tab_neon+0x8>
and this is loop part of mul_tab_standard
00000000 <mul_tab_standard>:
58: ecf01b02 vldmia r0!, {d17}
5c: ecf10b02 vldmia r1!, {d16}
60: f3410db0 vmul.f32 d16, d17, d16
64: ece20b02 vstmia r2!, {d16}
68: e1520003 cmp r2, r3
6c: 1afffff9 bne 58 <mul_tab_standard+0x58>
As you can see in standard case, compiler creates much tighter loop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js