Fully Connected Layer (dot product) using AVX - c++

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.
const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();
SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();
const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);
for(size_t n = 0; n < out_neurons; n++){
float accum = 0.0;
float temp[4] = {0,0,0,0};
float *p = temp;
__m128 in, ws, dp;
for(size_t i = 0; i < in_neurons; i+=4){
// read and save the weights correctly by applying the mask
temp[0] = scl[(i+0)*out_neurons + n];
temp[1] = scl[(i+1)*out_neurons + n];
temp[2] = scl[(i+2)*out_neurons + n];
temp[3] = scl[(i+3)*out_neurons + n];
// load input neurons sequentially
in = _mm_load_ps(&src[i]);
// load weights
ws = _mm_load_ps(p);
// dot product
dp = _mm_dp_ps(in, ws, 0xff);
// accumulator
accum += dp.m128_f32[0];
// save the final result
dst[n] = accum.m128_f32[0];
It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).
**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32
Input: 28x28x32 = 25K
Weights: (3*3*32)*32 = 9K
Number of MACs: 3*3*27*27*32*32 = 7M
Execution Time on OpenVINO framework: 0.049 ms
**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576
Weights: 3*3*64*512 = 295K
Number of MACs: 295K
Execution Time on OpenVINO framework: 0.197 ms
Thanks for all help in advance!

Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.
You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.
A very simple implementation assuming FMA is available (otherwise use multiplication and addition):
void dot_product(const float* src, const float* scl, float* dst,
const int in_neurons, const int out_neurons)
for(size_t n = 0; n < out_neurons; n+=8){
__m256 accum = _mm256_setzero_ps();
for(size_t i = 0; i < in_neurons; i++){
accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
// save the result
_mm256_storeu_ps(dst+n ,accum);
This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.
Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi


Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.
Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
_mm256_storeu_pd( rdi + i, v );
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.
If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
return res;
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
acc[row] = sum;

What is the optimum OpenCL 2 kernel to sum floats?

C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.
Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan example:
kernel void sum_float (global float* sum, global float* values)
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
sum[0] = sum_val;
The kernel above works (or seems to!). However, exclusive_scan required the work_group_broadcast function to pass the last value of one work group to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val, so an atomic add is more appropriate.
OpenCL 2 provides an atomic_int which supports atomic_fetch_add. An integer version of the kernel above using atomic_int is:
kernel void sum_int (global int* sum, global int* values)
atomic_int sum_val;
atomic_init(&sum_val, 0);
for (size_t i = 0u; i < get_num_groups(0); ++i)
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
int value = work_group_reduce_add(values[index]);
atomic_fetch_add(&sum_val, value);
sum[0] = atomic_load(&sum_val);
OpenCL 2 also provides an atomic_float but it doesn't support atomic_fetch_add.
What is the best way to implement an OpenCL2 kernel to sum floats?
kernel void sum_float (global float* sum, global float* values)
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
sum[0] = sum_val;
this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.
For same 1-M element array, this(global=1M, local=256)
kernel void sum_float2 (global float* sum, global float* values)
float sum_partial = work_group_reduce_add(values[get_global_id(0)]);
sum[get_group_id(0)] = sum_partial;
followed by this (global=4k, local=256)
kernel void sum_float3 (global float* sum, global float* values)
float sum_partial = work_group_reduce_add(sum[get_global_id(0)]);
values[get_group_id(0)] = sum_partial;
does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).
Program works like this:
values: 1.0 1.0 .... 1.0 1.0
sum: 256.0 256.0 256.0
values: 65536.0 65536.0 .... 16 items total to be summed by cpu
if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.

NEON increasing run time

I am currently trying to optimize some of my image processing code to use NEON instructions.
Let's say I have to very large float arrays and I want to multiply each value of the first one with three consecutive values of the second one. (The second one is three times as large.)
float* l_ptrGauss_pf32 = [...];
float* l_ptrLaplace_pf32 = [...]; // Three times as large
for (uint64_t k = 0; k < l_numPixels_ui64; ++k)
float l_weight_f32 = *l_ptrGauss_pf32;
*l_ptrLaplace_pf32 *= l_weight_f32;
*l_ptrLaplace_pf32 *= l_weight_f32;
*l_ptrLaplace_pf32 *= l_weight_f32;
So when I replace the above code with NEON intrinsics, the run time is about 10% longer.
float32x4_t l_gaussElem_f32x4;
float32x4_t l_laplElem1_f32x4;
float32x4_t l_laplElem2_f32x4;
float32x4_t l_laplElem3_f32x4;
for( uint64_t k=0; k<(l_lastPixelInBlock_ui64/4); ++k)
l_gaussElem_f32x4 = vld1q_f32(l_ptrGauss_pf32);
l_laplElem1_f32x4 = vld1q_f32(l_ptrLaplace_pf32);
l_laplElem2_f32x4 = vld1q_f32(l_ptrLaplace_pf32+4);
l_laplElem3_f32x4 = vld1q_f32(l_ptrLaplace_pf32+8);
l_laplElem1_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem1_f32x4);
l_laplElem2_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem2_f32x4);
l_laplElem3_f32x4 = vmulq_f32(l_gaussElem_f32x4, l_laplElem3_f32x4);
vst1q_f32(l_ptrLaplace_pf32, l_laplElem1_f32x4);
vst1q_f32(l_ptrLaplace_pf32+4, l_laplElem2_f32x4);
vst1q_f32(l_ptrLaplace_pf32+8, l_laplElem3_f32x4);
l_ptrLaplace_pf32 += 12;
l_ptrGauss_pf32 += 4;
Both versions are compiled with -Ofast using Apple LLVM 8.0. Is the compiler really so good at optimizing this code even without NEON intrinsics?
You code contains relatively many operations of vector loading and a few operations of multiplication. So I would recommend to optimize loading of vectors. There are two steps:
Use aligned memory in your arrays.
Use prefetch.
In order to do this I would recommend to use next function:
inline float32x4_t Load(const float * p)
// use prefetch:
__builtin_prefetch(p + 256);
// tell compiler that address is aligned:
float * _p = (float *)__builtin_assume_aligned(p, 16);
return vld1q_f32(_p);

How to fast calculate the normalized l1 and l2 norm of a vector in C++?

I have a matrix X that has n column data vectors in d dimensional space.
Given a vector xj, v[j] is its l1 norm (the summation of all abs(xji)), w[j] is the square of its l2 norm (the summation of all xji^2), and pj[i] is the combination of entries divided by l1 and l2 norm. Finally, I need the outputs: pj, v, w for subsequet applications.
// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
jm = j*m;
jd = j*d;
for (unsigned long i=0; i<d; ++i) {
x_abs[i] = abs(X[i+jd]);
v[j] += x_abs[i];
x_2[i] = x_abs[i]*x_abs[i];
w[j] += x_2[i];
for (unsigned long i=0; i<d; ++i){
pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];
// functionA(pj){ ... ...} for subsequent applications
// functionB(v, w){ ... ...} for subsequent applications
My above algorithm takes O(nd) Flops/Time-complexity, can any one help me to speed up it by using building-functoin or new implementation in C++? Reducing the constant value in O(nd) is also very helpful for me.
Let me guess: since you have problems related with the performance, the dimension of your vectors is quite large.If this is the case, then it worth considering "CPU cache locality" - some interesting info on this in a cppcon14 presentation.
If the data is not available in the CPU caches, then abs-ing or squaring it it once available is dwarfed by the time the CPU just wait for the data.
With this is mind, you may want to try the following solution (with no warranties that will improve performance - the compiler may actually apply these techniques when optimizing the code)
for (unsigned long j=0; j<n; ++j) {
// use pointer arithmetic - at > -O0 the compiler will do it anyway
double *start=X+j*d, *end=X+(j+1)*d;
// this part avoid as much as possible the competition
// on CPU caches between X and v/w.
// Don't store the norms in v/w as yet, keep them in registers
double l1norm=0, l2norm=0;
for(double *src=start; src!=end; src++) {
double val=*src;
l2norm+= src*src;
double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
for(double *src=start, *dst=pj; src!=end; src++, dst++) {
// Yes, recomputing abs/sqr may actually save time by not
// creating competition on CPU caches with x_abs and x_2
double val=*src;
*dst = pl1*abs(val) + pl2*val*val;
// functionA(pj){ ... ...} for subsequent applications
// Think well if you really need v/w. If you really do,
// at least there are two values to be sent for storage into memory,
//meanwhile the CPU can actually load the next vector into cache
v[j]=l1norm; w[j]=l2norm;
// functionB(v, w){ ... ...} for subsequent applications

How to parallelize this for loop for rapidly converting YUV422 to RGB888?

I am using v4l2 api to grab images from a Microsoft Lifecam and then transferring these images over TCP to a remote computer. I am also encoding the video frames into a MPEG2VIDEO using ffmpeg API. These recorded videos play too fast which is probably because not enough frames have been captured and due to incorrect FPS settings.
The following is the code which converts a YUV422 source to a RGB888 image. This code fragment is the bottleneck in my code as it takes nearly 100 - 150 ms to execute which means I can't log more than 6 - 10 FPS at 1280 x 720 resolution. The CPU usage is 100% as well.
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
*dst++ = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
*dst++ = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
*dst++ = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
'dst' is then compressed as jpeg and sent over TCP and 'vid_frame' is saved to the disk.
How can I make this code fragment faster so that I can get atleast 30 FPS at 1280x720 resolution as compared to the present 5-6 FPS?
I've tried parallelizing the for loop across three threads using p_thread, processing one third of the rows in each thread.
for (int line = 0; line < image_height/3; line++) // thread 1
for (int line = image_height/3; line < 2*image_height/3; line++) // thread 2
for (int line = 2*image_height/3; line < image_height; line++) // thread 3
This gave me only a minor improvement of 20-30 milliseconds per frame.
What would be the best way to parallelize such loops? Can I use GPU computing or something like OpenMP? Say spwaning some 100 threads to do the calculations?
I also noticed higher frame rates with my laptop webcam as compared to the Microsoft USB Lifecam.
Here are other details:
Ubuntu 12.04, ffmpeg 2.6
AMG-A8 quad core processor with 6GB RAM
Encoder settings:
bitrate: 4000000
time_base: (AVRational){1, 20}
pix_fmt: AV_PIX_FMT_YUV420P
gop: 10
max_b_frames: 1
If all you care about is fps and not ms per frame (latency), another option would be a separate thread per frame.
Threading is not the only option for speed improvements. You could also perform integer operations as opposed to floating point. And SIMD is an option. Using an existing library like sws_scale will probably give you the best performance.
Mak sure you are compiling -O3 (or -Os).
Make sure debug symbols are disabled.
Move repeated operations outside the loop e.g.
// compiler cant optimize this because another thread could change frame->linesize[0]
int row = line * frame->linesize[0];
for (int column = 0; column < image_width; column++) {
vid_frame->data[0][row + column] = *py;
You can precompute tables, so there is no math in the loop:
init() {
for(int py = 0; py <= 255 ; ++py)
for(int pv = 0; pv <= 255 ; ++pv)
ytable[pv][py] = CLAMP(pv + 1.402*(py - 128.0));
for (int column = 0; column < image_width; column++) {
*dst++ = ytable[*pv][*py];
Just to name a few options.
I think unless you want to reinvent the painful wheel, using pre-existing options (ffmpeg' libswscale or ffmpeg's scale filter, gstreamer's scale plugin, etc.) is a much better option.
But if you want to reinvent the wheel for whatever reason, show the code you used. For example, thread startup is expensive, so you'd want to create the threads before measuring your looptime and reuse threads from frame-to-frame. Better yet is frame-threading, but that adds latency. This is usually ok but depends on your use case. More importantly, don't write C code, learn to write x86 assembly (simd), all previously mentioned libraries use simd for such conversions, and that'll give you a 3-4x speedup (since it allows you to do 4-8 pixels instead of 1 per iteration).
You could build blocks of x lines and convert each block in a separate thread
do not mix integer and floating point arithmetic!
char x;
char y=((double)x*1.5); /* ouch casting double<->int is slow! */
char z=(x*3)>>1; /* fixed point arithmetic rulez */
use SIMD (though this would be easier if both input and output data were properly aligned...e.g. by using RGB8888 as output)
use openMP
an alternative that does not require any coding of the processing, would be to simply do your entire processing using a framework that does proper timestamping throughout the pipeline (starting at image acquisition time), and is hopefully optimized enough to deal with big data. e.g. gstreamer
Would something like this not work?
#pragma omp parallel for
for (int line = 0; line < image_height; line++) {
for (int column = 0; column < image_width; column++) {
dst[ ( image_width*line + column )*3 ] = CLAMP((double)*py + 1.402*((double)*pv - 128.0)); // R - first byte
dst[ ( image_width*line + column )*3 + 1] = CLAMP((double)*py - 0.344*((double)*pu - 128.0) - 0.714*((double)*pv - 128.0)); // G - next byte
dst[ ( image_width*line + column )*3 + 2] = CLAMP((double)*py + 1.772*((double)*pu - 128.0)); // B - next byte
vid_frame->data[0][line * frame->linesize[0] + column] = *py;
// increment py, pu, pv here
Of course you have to also handle incrementing py, py, pv part accordingly.
Usually transformation of pixel format is performed with using of only integer variables.
It's allow to prevent conversion between float point and integer variables.
Also it's allow to use more effectively SIMD extensions of modern CPUs.
For example, this is a code of conversion YUV to BGR:
const int Y_ADJUST = 16;
const int UV_ADJUST = 128;
const int Y_TO_RGB_WEIGHT = int(1.164*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_BLUE_WEIGHT = int(2.018*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int U_TO_GREEN_WEIGHT = -int(0.391*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_GREEN_WEIGHT = -int(0.813*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
const int V_TO_RED_WEIGHT = int(1.596*(1 << YUV_TO_BGR_AVERAGING_SHIFT) + 0.5);
inline int RestrictRange(int value, int min = 0, int max = 255)
return value < min ? min : (value > max ? max : value);
inline int YuvToBlue(int y, int u)
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
inline int YuvToGreen(int y, int u, int v)
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
inline int YuvToRed(int y, int v)
return RestrictRange((Y_TO_RGB_WEIGHT*(y - Y_ADJUST) +
This code is taken here (http://simd.sourceforge.net/). Also here there is a code optimized for different SIMDs.