multithreading and dynamic arrays/matrix-op's

multithreading and dynamic arrays/matrix-op's - c++

I am currently writing a physical simulation (t.b.m.p. solving a stochastic differential equation) and I need to parallelize it.
Now this could be achieved with MPI and I think I will have to do it some time in the future, but currently I want to utilize all 8 cores of my local machine for it. A normal run takes from 2 - 17 hours for one parameter set. Therefore I thought to utilize multithreading, specifically the following function should be executed in parallel. This function essentially solves the same SDE Nrep times for Nsteps timesteps. The results are averaged and stored for each thread into a separate row of an Nthreads x Nsteps array JpmArr.
double **JpmArr;
void worker(const dtype gamma, const dtype dt, const uint seed, const uint Nsteps, const uint Nrep,\
const ESpMatD& Jplus, const ESpMatD& Jminus, const ESpMatD& Jz, const uint tId ){
dtype dW(0), stdDev( sqrt(dt) );
std::normal_distribution<> WienerDistr(0, stdDev);
//create the arrays for the values of <t|J+J-|t>
dtype* JpmExpect = JpmArr[tId];
//execute Nrep repetitions of the experiment
for (uint r(0); r < Nrep; ++r) {
//reinitialize the wave function
psiVecArr[tId] = globalIstate;
//<t|J+J-|t>
tmpVecArr[tId] = Jminus* psiVecArr[tId];
JpmExpect[0] += tmpVecArr[tId].squaredNorm();
//iterate over the timesteps
for (uint s(1); s < Nsteps; ++s) {
//get a random number
dW = WienerDistr(RNGarr[tId]);
//execute one step of the RK-s.o. 1 scheme
tmpPsiVecArr[tId] = F2(gamma, std::ref(Jminus), std::ref(psiVecArr[tId]) );
tmpVecArr[tId] = psiVecArr[tId] + tmpPsiVecArr[tId] * sqrt(dt);
psiVecArr[tId] = psiVecArr[tId] + F1(gamma, std::ref(Jminus), std::ref(Jplus), std::ref(psiVecArr[tId])) * dt + tmpPsiVecArr[tId] * dW \
+ 0.5 * (F2(gamma, std::ref(Jminus), std::ref(tmpVecArr[tId]) ) - F2(gamma, std::ref(Jminus), std::ref(psiVecArr[tId]))) *(dW * dW - dt) / sqrt(dt);
//normalise
psiVecArr[tId].normalize();
//compute <t|J+J-|t>
tmpVecArr[tId] = Jminus* psiVecArr[tId];
JpmExpect[s] += tmpVecArr[tId].squaredNorm();
}
}
//average over the repetitions
for (uint j(0); j < Nsteps; ++j) {
JpmExpect[j] /= Nrep;
}
}
I am using Eigen as a library for linear algebra, thus:
typedef Eigen::SparseMatrix<dtype, Eigen::RowMajor> ESpMatD;
typedef Eigen::Matrix<dtype, Eigen::Dynamic, Eigen::RowMajor> VectorXdrm;
are used as types. The above worker function calls:
VectorXdrm& F1(const dtype a, const ESpMatD& A, const ESpMatD& B, const VectorXdrm& v) {
z.setZero(v.size());
y.setZero(v.size());
// z is for simplification
z = A*v;
//scalar intermediate value c = <v, Av>
dtype c = v.dot(z);
y = a * (2.0 * c * z - B * z - c * c * v);
return y;
}
VectorXdrm& F2(const dtype a, const ESpMatD& A, const VectorXdrm& v) {
//zero the data
z.setZero(v.size());
y.setZero(v.size());
z = A*v;
dtype c = v.dot(z);
y = sqrt(2.0 * a)*(z - c * v);
return y;
}
where the vectors z,y are of type VectorXdrm and are declared in the same file (module-global).
All the arrays (RNGarr, JpmArr, tmpPsiVecArr, tmpVecArr, psiVecArr) are initialized in main (by use of extern declaration in main.cpp). After that setup is done I run the function using std::async, wait for all to finish and then collect the data from JpmArr into a single array in main() and write it to file.
Problem:
The results are nonsense if I use std::launch::async.
If I use std::launch::deferred the computed and averaged results match (as far as then numerical method permits) the results I obtain by analytic means.
I have no idea anymore where stuff fails. I used to use Armadillo for linear algebra, but it's normalize routine delivered nan's so I switched to Eigen, which hints (in the documenation) at being usable with multiple threads - it still fails.
Having not worked with threads before I have spent 4 days now trying to get this working and reading up on things. The latter lead me to use the global arrays RNGarr, JpmArr, tmpPsiVecArr, tmpVecArr, psiVecArr (before I just tried to create the appropriate arrays in worker and pass the results by means of a struct workerResult back to main.) as well as using std::ref() to pass the matrices Jplus, Jminus, Jz to the function.(the last is omitted in the function above - for brevity)
But the results I get are still wrong and I have no Idea anymore, what is wrong and what I should do to get the right results.
Any input and/or pointers to examples of solutions of such (threading) problems or references will be greatly appreciated.

There is clearly some interaction between the calculations in each thread -- either due to that your global data is being updated by multiple thread, or though some of the structures passed by reference is mutating while running -- z and y cannot be globals if they are updated by multiple threads -- but there may be many other problems
I would suggest you refactor the code as follows;
Make it object oriented -- define a class which are self contained. Take all the parameters which are given to worker and make them members of the class.
if you are not sure if data structures are mutating, then do not pass them by reference. If in doubt assume the worst and make complete copies within the class.
In cases where threads do have to update shared structures (you should have none in your use case) you will need to protect read and writes by mutexes for exclusive access.
Remove all global -- instead of having a global JpmArr, z and y, define the data which is needed within the class.
make worker, F1, F2 member functions of your class.
Then in main, new your newly created class as many times as you need, and start each of them as a thread -- making sure that you are waiting for the thread to complete before you read any of the data -- each thread will have its own stack, and its own data within the class, and hence interference between parallel calculations becomes much less likely.
If you want to optimize further, you will need to consider each matrix calculation as a job and create a pool of thread matching the number of cores and let each thread pick up a job sequentially, that will reduce context switching overhead and CPU L1/L2 cache misses which will happen if your number of threads becomes many more than your number of cores -- however this is becoming a lot more elaborate than what you need for your immediate problem....

Stop using globals. It's bad style anyway, and here multiple threads will be zeroing and mutating z and y at the same time.
The simplest fix is to replace your globals with local variables in the worker function - so each concurrent call to worker has its own copy - and pass them into F1 and F2 by reference.

Related

Implementing Kalman filter in C++

I'd like to implement an extended Kalman filter in C++ using the eigen library because I'm interested in robotics, this seems like a good exercise to get better at C++ and it seems like a fun project. I was hoping I can post my code to get some feedback on writing classes and what my next steps should be from here. So I got the equations from a class online
what I have so far is below, I've hardcoded a state vector of size 2x1 and an array of measurements as a test but would like to change it so I can declare a state vector of any size, and I'll move the array of measurements to a main.cpp file. I just did this in the beginning so I can simply declare and object of this class and quickly test out the functions, and everything seems to be working so far. What I was thinking of doing next is to make another class that takes measurements from some source and converts it into eigen matrices to pass onto this kalman filter class. The main questions I have are:
Should I have the measurement update and state prediction as two different functions? Does it really matter? I did that in the first place because I thought it was easier to read.
Should I set the size of things like the state vector in the class constructor or is it better to have something like an initializer function for that?
I read that it's better practice to have class members that are matrices actually be pointers to the matrix, because it makes the class lighter. What does this mean? Is that important if I want to run this on a PC vs something like a raspberry pi?
In the measurementUpdate function, should y, S, K be class members? It'll make the class larger, but then I wouldn't be constructing and destroying the Eigen objects when the program is looping? Is that good practice?
Should there be a class member that takes the measurement inputs or is it better to just pass a value to the measurement update function? Does it matter?
Is it even worth it to try and implement a class for this or is it better to just have a single function that implements the filter?
removed this one because it wasn't a question.
I was thinking of implementing some getter functions so I can check the state variable and covariance matrix, is it better to just make those members public and not have the getter functions?
Apologies if this is posted in the wrong place and if these are super basic questions I'm pretty new to this stuff. Thanks for all the help, all advice is appreciated.
header:
#include "eigen3/Eigen/Dense"
#include <iostream>
#include <vector>
class EKF {
public:
EKF();
void filter(Eigen::MatrixXd Z);
private:
void measurementUpdate(Eigen::MatrixXd Z);
void statePrediction();
Eigen::MatrixXd P_; //Initial uncertainty
Eigen::MatrixXd F_; //Linearized state approximation function
Eigen::MatrixXd H_; //Jacobian of linearrized measurement function
Eigen::MatrixXd R_; //Measurement uncertainty
Eigen::MatrixXd I_; //Identity matrix
Eigen::MatrixXd u_; //Mean of state function
Eigen::MatrixXd x_; //Matrix of initial state variables
};
source:
EKF::EKF() {
double meas[5] = {1.0, 2.1, 1.6, 3.1, 2.4};
x_.resize(2, 1);
P_.resize(2, 2);
u_.resize(2, 1);
F_.resize(2, 2);
H_.resize(1, 2);
R_.resize(1, 1);
I_.resize(2, 2);
Eigen::MatrixXd Z(1, 1);
for(int i = 0; i < 5; i++){
Z << meas[i];
measurementUpdate(Z);
//statePrediction();
}
}
void EKF::measurementUpdate(Eigen::MatrixXd Z){
//Calculate measurement residual
Eigen::MatrixXd y = Z - (H_ * x_);
Eigen::MatrixXd S = H_ * P_ * H_.transpose() + R_;
Eigen::MatrixXd K = P_ * H_.transpose() * S.inverse();
//Calculate posterior state vector and covariance matrix
x_ = x_ + (K * y);
P_ = (I_ - (K * H_)) * P_;
}
void EKF::statePrediction(){
//Predict next state vector
x_ = (F_ * x_) + u_;
P_ = F_ * P_ * F_.transpose();
}
void EKF::filter(Eigen::MatrixXd Z){
measurementUpdate(Z);
statePrediction();
}

One thing to consider, which will affect the answers to ypur questions, is how 'generic' a filter you want to make.
There is no restriction in a Kalman filter that the sampling rate of the measurements be constant, nor that you get all the measurements every time. The only restriction is that the measurements appear in increasing time order. If you want to support this, then your measurement function will be passed arrays of variable sizes, and the H and R matrices will also be of variable size, and moreover the F and Q matrices (though of constant shape) will need to know the time update -- in particular you will need a function to compute Q.
As an example of what I mean, consider the example of some sort of survey boat that has a dgps sensor that gives a position every second, a gyro compass that gives the ship's heading twice a second, and a rgps system that gives the range and bearing to a towed buoy every two seconds. In this case we could get measurements like this:
at 0.0 gyro and dgps
at 0.0 gyro
at 1.0 gyro and dgps
at 1.5 gyro
at 2.0 gyro, dgps and rgps
and so on. So we get different numbers of observations at different times.
On a different topic I've always found it useful to have a way of seeing how well the filter is doing. Somewhat surprisingly the state covariance matrix is not a way of seeing this. In the linear (as opposed to extended) filter the state covariance could be computed for all times before you see any data! This is not true for the extended case as the state covariance depends on the states through the measurement Jacobian, but that is a very weak dependence on the observations. I think the most useful quality measures are those based on the measurements. Easy ones to compute are the 'innovations' -- the difference between the measured values and the values calculated using the predicted state -- and the residuals -- the difference between the measured values and the values calculated using the updated state. Each of these, over time, should have mean 0. If you want to get fancier there are the normalised residuals. If ita is the innovations vector, the normalised residuals are
T = inv(S)
u = T*ita
nr[i] = u[i]/sqrt( T[i][i])
The nice thing about the normalised residuals is that each (over time) should have mean 0 but also sd 1 -- if the filter is correctly tuned.

Launching the right number of CUDA blocks for a custom PyTorch activation function

I'm currently writing a CUDA kernel for a custom operation (an activation) for PyTorch, but I'm quite unfamiliar with any form of GPU programming. For reference, I was following the Custom C++ & CUDA extension tutorial.
A simplified example of the sort of operation I want to do:
Let's say I have an input tensor, X_in, which can be of any shape and dims (e.g. something like (16, 3, 50, 100) ). Let's say I also have a 1D tensor, T (for example, T can be a tensor of shape (100,) ).
For each value in X_in, I want to calculate an "index" value that should be < len(T). Then the output would be basically the value of that index in T, multiplied or added with some constant. This is something like a "lookup table" operation.
An example kernel:
__global__ void inplace_lookup_kernel(
const scalar_t* __restrict__ T,
scalar_t* __restrict__ X_in,
const int N) {
const int i = blockIdx.x * blockDim.x + threadIdx.x;
const int idx = int(X_in[i]) % N;
X_in[i] = 5 * T[idx] - 3;
}
I also wish to do the operation in-place, which is why the output is being computed into X_in.
My question is, for an operation like this, which is to be applied pointwise on each value of X_in, how to determine the way to launch a good number of threads/blocks? In the Custom C++ & CUDA extension tutorial, they do so by:
const int threads = 1024;
const dim3 blocks((state_size + threads - 1) / threads, batch_size);
For their use-case, the operation (an lstm variant) has a specific format of input, and thus a fixed number of dimensions, from which blocks can be calculated.
But the operation I'm writing should accept inputs of any dimensions and shape. What is the right way to calculate the block number for this situation?
I'm currently doing the following:
const int threads = 1024;
const int nelems = int(X_in.numel());
const dim3 blocks((nelems + threads - 1) / threads);
However I'm doing this by intuition, and not with any certainty. Is there a better or correct way to do this? And is there any computational advantage if I define blocks in the format blocks(otherdim_size, batch_size) like in the tutorial?

I'm speculating here, but - since your operations seems to be completely elementwise (w.r.t. X_in); and you don't seem to using interesting SM-specific resources like shared memory, nor a lot of registers per thread, I don't think the grid partition matters all that much. Just treat X_in as a 1-D array according to its layout in memory, and use a 1D grid with a block size of, oh, 256, or 512 or 1024.
Of course - always try out your choices to make sure you don't get bitten by unexpected behavior.

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.

These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?

For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).

No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.

The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.

In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.

If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

Multiply Large Complex Number Vector by Scalar efficiently C++

I'm currently trying to most efficiently do an in-place multiplication of an array of complex numbers (memory aligned the same way the std::complex would be but currently using our own ADT) by an array of scalar values that is the same size as the complex number array.
The algorithm is already parallelized, i.e. the calling object splits the work up into threads. This calculation is done on arrays in the 100s of millions - so, it can take some time to complete. CUDA is not a solution for this product, although I wish it was. I do have access to boost and thus have some potential to use BLAS/uBLAS.
I'm thinking, however, that SIMD might yield much better results, but I'm not familiar enough with how to do this with complex numbers. The code I have now is as follows (remember this is chunked up into threads which correspond to the number of cores on the target machine). The target machine is also unknown. So, a generic approach is probably best.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (register int idx = start; idx < end; ++idx)
{
values[idx].real *= scalar[idx];
values[idx].imag *= scalar[idx];
}
}
fcomplex is defined as follows:
struct fcomplex
{
float real;
float imag;
};
I've tried manually unrolling the loop, as my finally loop count will always be a power of 2, but the compiler is already doing that for me (I've unrolled as far as 32). I've tried a const float reference to the scalar - in thinking I'd save one access - and that proved to be equal to the what the compiler was already doing. I've tried STL and transform, which game close results, but still worse. I've also tried casting to std::complex and allow it to use the overloaded operator for scalar * complex for the multiplication but this ultimately produced the same results.
So, anyone with any ideas? Much appreciation is given for your time in considering this! Target platform is Windows. I'm using Visual Studio 2008. Product cannot contain GPL code as well! Thanks so much.

You can do this fairly easily with SSE, e.g.
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 2)
{
__m128 vc = _mm_load_ps((float *)&values[idx]);
__m128 vk = _mm_set_ps(scalar[idx + 1], scalar[idx + 1], scalar[idx], scalar[idx]);
vc = _mm_mul_ps(vc, vk);
_mm_store_ps((float *)&values[idx], vc);
}
}
Note that values and scalar need to be 16 byte aligned.
Or you could just use the Intel ICC compiler and let it do the hard work for you.
UPDATE
Here is an improved version which unrolls the loop by a factor of 2 and uses a single load instruction to get 4 scalar values which are then unpacked into two vectors:
void cmult_scalar_inplace(fcomplex *values, const int start, const int end, const float *scalar)
{
for (int idx = start; idx < end; idx += 4)
{
__m128 vc0 = _mm_load_ps((float *)&values[idx]);
__m128 vc1 = _mm_load_ps((float *)&values[idx + 2]);
__m128 vk = _mm_load_ps(&scalar[idx]);
__m128 vk0 = _mm_shuffle_ps(vk, vk, 0x50);
__m128 vk1 = _mm_shuffle_ps(vk, vk, 0xfa);
vc0 = _mm_mul_ps(vc0, vk0);
vc1 = _mm_mul_ps(vc1, vk1);
_mm_store_ps((float *)&values[idx], vc0);
_mm_store_ps((float *)&values[idx + 2], vc1);
}
}

Your best bet will be to use an optimised BLAS which will take advantage of whatever is available on your target platform.

One problem I see is that in the function it's hard for the compiler to understand that the scalar pointer is not indeed pointing in the middle of the complex array (scalar could in theory be pointing to the complex or real part of a complex).
This actually forces the order of evaluation.
Another problem I see is that here the computation is so simple that other factors will influence the raw speed, therefore if you really care about performance the only solution is in my opinion to implement several variations and test them at runtime on the user machine to discover what is the fastest.
What I'd consider is using different unrolling sizes, and also playing with the alignment of scalar and values (the memory access pattern can have a big influence of caching effects).
For the problem of the unwanted serialization an option is to see what is the generated code for something like
float r0 = values[i].real, i0 = values[i].imag, s0 = scalar[i];
float r1 = values[i+1].real, i1 = values[i+1].imag, s1 = scalar[i+1];
float r2 = values[i+2].real, i2 = values[i+2].imag, s2 = scalar[i+2];
values[i].real = r0*s0; values[i].imag = i0*s0;
values[i+1].real = r1*s1; values[i+1].imag = i1*s1;
values[i+2].real = r2*s2; values[i+2].imag = i2*s2;
because here the optimizer has in theory a little bit more freedom.

Do you have access to Intel's Integrated Performance Primitives?
Integrated Performance Primitives They have a number of functions that handle cases like this with pretty decent performance. You might have some success with your particular problem, but I would not be surprised if your compiler already does a decent job of optimizing the code.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js