Optimizing the backward solve for a sparse lower triangular linear system

Optimizing the backward solve for a sparse lower triangular linear system - c++

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in
(A + I)' * x = b
This is the routine I have for computing this:
void backsolve(const int*__restrict__ Lp,
const int*__restrict__ Li,
const double*__restrict__ Lx,
const int n,
double*__restrict__ x) {
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i]; j<Lp[i+1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line
x[i] -= Lx[j] * x[Li[j]];
being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives
backsolve(int const*, int const*, double const*, int, double*):
lea eax, [rcx-1]
movsx r11, eax
lea r9, [r8+r11*8]
test eax, eax
js .L9
.L5:
movsx rax, DWORD PTR [rdi+r11*4]
mov r10d, DWORD PTR [rdi+4+r11*4]
cmp eax, r10d
jge .L6
vmovsd xmm0, QWORD PTR [r9]
.L7:
movsx rcx, DWORD PTR [rsi+rax*4]
vmovsd xmm1, QWORD PTR [rdx+rax*8]
add rax, 1
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
vmovsd QWORD PTR [r9], xmm0
cmp r10d, eax
jg .L7
.L6:
sub r11, 1
sub r9, 8
test r11d, r11d
jns .L5
ret
.L9:
ret
According to vtune,
vmovsd QWORD PTR [r9], xmm0
is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.
Processor: Xeon Skylake
Question What can I do to optimize this function?

This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.
For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.
1. Iterating strictly backwards
The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:
for (int i=n-1; i>=0; --i) {
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2734 ns 5120000
benchBacksolveBaseline 17412 ns 17421 ns 829630
benchBacksolveOptimized 16046 ns 16040 ns 853333
I assume this is caused by somehow more predictable cache access, but I did not look into it much further.
2. Less loads/stores in inner loop
As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:
for (int i=n-1; i>=0; --i) {
double xi_temp = x[i];
for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
xi_temp -= Lx[j] * x[Li[j]];
}
x[i] = xi_temp;
}
This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2737 ns 2740 ns 5120000
benchBacksolveBaseline 17410 ns 17418 ns 814545
benchBacksolveOptimized 15155 ns 15147 ns 887129
3. Unroll the loop
While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2733 ns 2734 ns 5120000
benchBacksolveBaseline 15079 ns 15081 ns 953191
benchBacksolveOptimized 14392 ns 14385 ns 963441
Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:
for (int i=n-1; i>=0; --i) {
const int col_begin = Lp[i];
const int col_end = Lp[i+1];
const bool is_col_nnz_odd = (col_end - col_begin) & 1;
double xi_temp = x[i];
int j = col_end - 1;
if (is_col_nnz_odd) {
xi_temp -= Lx[j] * x[Li[j]];
--j;
}
for (; j >= col_begin; j -= 2) {
xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
Lx[j - 1] * x[Li[j - 1]];
}
x[i] = xi_temp;
}
With that I measure:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
benchFillRhs 2728 ns 2729 ns 5090909
benchBacksolveBaseline 17451 ns 17449 ns 822018
benchBacksolveOptimized 13440 ns 13443 ns 1018182
Other algorithms
All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.

You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:
void backsolve(const unsigned * __restrict__ Lp,
const unsigned * __restrict__ Li,
const double * __restrict__ Lx,
const unsigned n,
double * __restrict__ x) {
for (unsigned i = n; i-- > 0; ) {
for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
x[i] -= Lx[j] * x[Li[j]];
}
}
}
Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.
Here is the generated code for the inner loop:
.L8:
mov rax, rcx
.L5:
mov ecx, DWORD PTR [r10+rax*4]
vmovsd xmm1, QWORD PTR [r11+rax*8]
vfnmadd231sd xmm0, xmm1, QWORD PTR [r8+rcx*8]
lea rcx, [rax+1]
vmovsd QWORD PTR [r9], xmm0
cmp rdi, rax
jne .L8

Related

Cycles Per Element V.S. actual performance of Polynomial Evaluation

In the book "Computer Systems: A Programmer's Perspective (3rd edition)"'s chapter 5, exercise 5.5 and 5.6 talked about Polynomial Evaluation:
It also gives two implementation poly() and polyh(), and says poly()'s CPE(Cycles Per Element) is 5.0 and polyh()'s CPE is 8.0, thus concludes poly() run faster than polyh(). **But with clang-12 or clang-14 on my ubuntu20.04, polyh() is much faster, instead of what these exercises said. I'm confused. **
The Polynomial Evaluation implementations:
// the naive method
double poly(double a[], double x, long degree)
{
long i;
double result = a[0];
double xpwr = x;
for (i = 1; i <= degree; i++)
{
result += a[i] * xpwr;
xpwr = x * xpwr;
}
return result;
}
// the Horner's method
double polyh(double a[], double x, long degree)
{
long i;
double result = a[degree];
for (i = degree-1; i>=0; i--)
{
result = a[i] + x * result;
}
return result;
}
My compilation flags: -O1. Full implementation (including timer) is: https://godbolt.org/z/3eW8Wzr7z
My time cost result:
polyh: took 2.318 ms, loop=10, avg = 0.232 ms
poly: took 78.980 ms, loop=10, avg = 7.898 ms
Why polyh run faster with large CPE?
update: Based on the comments of #Passer By, I use the website quich-bench for time cost measurement, and with different array size, the benchmark result is different:
n = 1000, poly() is faster (https://quick-bench.com/q/EpDmf22VD_E0CvLN0-6TY_Ye8bU)
n = 10000 , polyh() is much faster (https://quick-bench.com/q/yuzoVzz_KhWv1gJ-_j9wlZtfWVM)

I think there is some confusion regarding the statements in the book. The link you have provided clearly shows polyh() to have less CPE than poly():
polyh(double*, double, long):
# skipping non-loop code...
mulsd xmm0, xmm1
addsd xmm0, qword ptr [rdi + 8*rsi - 16]
add rsi, -1
cmp rsi, 1
jg .LBB1_2
vs
poly(double*, double, long):
# skipping non-loop code...
movsd xmm3, qword ptr [rdi + 8*rax + 8]
mulsd xmm3, xmm2
addsd xmm0, xmm3
mulsd xmm2, xmm1
add rax, 1
cmp rsi, rax
jne .LBB0_2
Clearly polyh() is more precise code in comparission with poly().
Now lets talk about optimization. First of all -O0 is used to disable optimization. -01 is the minimum optimizations.
But even if you throw optimization out of the window the code in polyh() is optimized before even compilation. It has only 1 of each multiplication, addition and assigment while poly() has 2 multiplications and assigments.
Clearly polyh() is leaner and farter code.
UPDATE:
After updated question here is what I found. I tested with same quick-bench but used GCC instead of CLANG as I was using on my computer, and thee results are still same. polyh() wins even with 1000 iterations.
https://quick-bench.com/q/_0IppR0fGBncrR60s5WtUiTq5U8

why is it faster to print number in binary using arithmetic instead of _bittest

The purpose of the next two code section is to print number in binary.
The first one does this by two instructions (_bittest), while the second does it by pure arithmetic instructions which is three instructions.
the first code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = _bittest(&num, nBit);
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
the inner loop is compile to the instructions bt and setb
The second code section:
#include <intrin.h>
#include <stdio.h>
#include <Windows.h>
long num = 78002;
int main()
{
unsigned char bits[32];
long nBit;
LARGE_INTEGER a, b, f;
QueryPerformanceCounter(&a);
for (size_t i = 0; i < 100000000; i++)
{
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++)
{
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
}
QueryPerformanceCounter(&b);
QueryPerformanceFrequency(&f);
printf_s("time is: %f\n", ((float)b.QuadPart - (float)a.QuadPart) / (float)f.QuadPart);
printf_s("Binary representation:\n");
while (nBit--)
{
if (bits[nBit])
printf_s("1");
else
printf_s("0");
}
return 0;
}
The inner loop compile to and add(as shift left) and sar.
the second code section run three time faster then the first one.
Why three cpu instructions run faster then two?

Not answer (Bo did), but the second inner loop version can be simplified a bit:
long numCopy = num;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = numCopy & 1;
numCopy >>= 1;
}
Has subtle difference (1 instruction less) with gcc 7.2 targetting 32b.
(I'm assuming 32b target, as you convert long into 32 bit array, which makes sense only on 32b target ... and I assume x86, as it includes <windows.h>, so it's clearly for obsolete OS target, although I think windows now have even 64b version? (I don't care.))
Answer:
Why three cpu instructions run faster then two?
Because the count of instructions only correlates with performance (usually fewer is better), but the modern x86 CPU is much more complex machine, translating the actual x86 instructions into micro-code before execution, transforming that further by things like out-of-order-execution and register renaming (to break false dependency chains), and then it executes the resulting microcode, with different units of CPU capable to execute only some micro-ops, so in ideal case you may get 2-3 micro-ops executed in parallel by the 2-3 units in single cycle, and in worst case you may be executing an complete micro-code loop implementing some complex x86 instruction taking several cycles to finish, blocking most of the CPU units.
Another factor is availability of data from memory and memory writes, a single cache miss, when the data must be fetched from higher level cache, or even memory itself, creates tens-to-hundreds cycles stall. Having compact data structures favouring predictable access patterns and not exhausting all cache-lines is paramount for exploiting maximum CPU performance.
If you are at stage "why 3 instructions are faster than 2 instructions", you pretty much can start with any x86 optimization article/book, and keep reading for few months or year(s), it's quite complex topic.
You may want to check this answer https://gamedev.stackexchange.com/q/27196 for further reading...

I'm assuming you're using x86-64 MSVC CL19 (or something that makes similar code).
_bittest is slower because MSVC does a horrible job and keeps the value in memory and bt [mem], reg is much slower than bt reg,reg. This is a compiler missed-optimization. It happens even if you make num a local variable instead of a global, even when the initializer is still constant!
I included some perf analysis for Intel Sandybridge-family CPUs because they're common; you didn't say and yes it matters: bt [mem], reg has one per 3 cycle throughput on Ryzen, one per 5 cycle throughput on Haswell. And other perf characteristics differ...
(For just looking at the asm, it's usually a good idea to make a function with args to get code the compiler can't do constant-propagation on. It can't in this case because it doesn't know if anything modifies num before main runs, because it's not static.)
Your instruction-counting didn't include the whole loop so your counts are wrong, but more importantly you didn't consider the different costs of different instructions. (See Agner Fog's instruction tables and optimization manual.)
This is your whole inner loop with the _bittest intrinsic, with uop counts for Haswell / Skylake:
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = _bittest(&num, nBit);
//bits[nBit] = (bool)(num & (1UL << nBit)); // much more efficient
}
Asm output from MSVC CL19 -Ox on the Godbolt compiler explorer
$LL7#main:
bt DWORD PTR num, ebx ; 10 uops (microcoded), one per 5 cycle throughput
lea rcx, QWORD PTR [rcx+1] ; 1 uop
setb al ; 1 uop
inc ebx ; 1 uop
mov BYTE PTR [rcx-1], al ; 1 uop (micro-fused store-address and store-data)
cmp ebx, 31
jb SHORT $LL7#main ; 1 uop (macro-fused with cmp)
That's 15 fused-domain uops, so it can issue (at 4 per clock) in 3.75 cycles. But that's not the bottleneck: Agner Fog's testing found that bt [mem], reg has a throughput of one per 5 clock cycles.
IDK why it's 3x slower than your other loop. Maybe the other ALU instructions compete for the same port as the bt, or the data dependency it's part of causes a problem, or just being a micro-coded instruction is a problem, or maybe the outer loop is less efficient?
Anyway, using bt [mem], reg instead of bt reg, reg is a major missed optimization. This loop would have been faster than your other loop with a 1 uop, 1c latency, 2-per-clock throughput bt r9d, ebx.
The inner loop compile to and add(as shift left) and sar.
Huh? Those are the instructions MSVC associates with the curBit <<= 1; source line (even though that line is fully implemented by the add self,self, and the variable-count arithmetic right shift is part of a different line.)
But the whole loop is this clunky mess:
long curBit = 1;
for (nBit = 0; nBit < 31; nBit++) {
bits[nBit] = (num&curBit) >> nBit;
curBit <<= 1;
}
$LL18#main: # MSVC CL19 -Ox
mov ecx, ebx ; 1 uop
lea r8, QWORD PTR [r8+1] ; 1 uop pointer-increment for bits
mov eax, r9d ; 1 uop. r9d holds num
inc ebx ; 1 uop
and eax, edx ; 1 uop
# MSVC says all the rest of these instructions are from curBit <<= 1; but they're obviously not.
add edx, edx ; 1 uop
sar eax, cl ; 3 uops (variable-count shifts suck)
mov BYTE PTR [r8-1], al ; 1 uop (micro-fused)
cmp ebx, 31
jb SHORT $LL18#main ; 1 uop (macro-fused with cmp)
So this is 11 fused-domain uops, and takes 2.75 clock cycles per iteration to issue from the front-end.
I don't see any loop-carried dep chains longer than that front-end bottleneck, so it probably runs about that fast.
Copying ebx to ecx every iteration instead of just using ecx as the loop counter (nBit) is an obvious missed optimization. The shift-count is needed in cl for a variable-count shift (unless you enable BMI2 instructions, if MSVC can even do that.)
There are major missed optimizations here (in the "fast" version), so you should probably write your source differently do hand-hold your compiler into making less bad code. It implements this fairly literally instead of transforming it into something the CPU can do efficiently, or using bt reg,reg / setc
How to do this fast in asm or with intrinsics
Use SSE2 / AVX. Get the right byte (containing the corresponding bit) into each byte element of a vector, and PANDN (to invert your vector) with a mask that has the right bit for that element. PCMPEQB against zero. That gives you 0 / -1. To get ASCII digits, use _mm_sub_epi8(set1('0'), mask) to subtract 0 or -1 (add 0 or 1) to ASCII '0', conditionally turning it into '1'.
The first steps of this (getting a vector of 0/-1 from a bitmask) is How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
Fastest way to unpack 32 bits to a 32 byte SIMD vector (has a 128b version). Without SSSE3 (pshufb), I think punpcklbw / punpcklwd (and maybe pshufd) is what you need to repeat each byte of num 8 times and make two 16-byte vectors.
is there an inverse instruction to the movemask instruction in intel avx2?.
In scalar code, this is one way that runs at 1 bit->byte per clock. There are probably ways to do better without using SSE2 (storing multiple bytes at once to get around the 1 store per clock bottleneck that exists on all current CPUs), but why bother? Just use SSE2.
mov eax, [num]
lea rdi, [rsp + xxx] ; bits[]
.loop:
shr eax, 1 ; constant-count shift is efficient (1 uop). CF = last bit shifted out
setc [rdi] ; 2 uops, but just as efficient as setc reg / mov [mem], reg
shr eax, 1
setc [rdi+1]
add rdi, 2
cmp end_pointer ; compare against another register instead of a separate counter.
jb .loop
Unrolled by two to avoid bottlenecking on the front-end, so this can run at 1 bit per clock.

The difference is that the code _bittest(&num, nBit); uses a pointer to num, which makes the compiler store it in memory. And the memory access makes the code a lot slower.
bits[nBit] = _bittest(&num, nBit);
00007FF6D25110A0 bt dword ptr [num (07FF6D2513034h)],ebx ; <-----
00007FF6D25110A7 lea rcx,[rcx+1]
00007FF6D25110AB setb al
00007FF6D25110AE inc ebx
00007FF6D25110B0 mov byte ptr [rcx-1],al
The other version stores all the variables in registers, and uses very fast register shifts and adds. No memory accesses.

Bit shift whole block of memory efficiently

I have following code, after performing a sobel operation:
short* tempBufferVert = new short[width * height];
ippiFilterSobelVertBorder_8u16s_C1R(pImg, width, tempBufferVert, width * 2, dstSize, IppiMaskSize::ippMskSize3x3, IppiBorderType::ippBorderConst, 0, pBufferVert);
for (int i = 0; i < width * height; i++)
tempBufferVert[i] >>= 2;
The frustrating thing is, the bit shift is the longest taking operation of it all, the IPP sobel is so optimized it runs faster than my stupid bit shift. How can I optimize the bitshift, or are there IPP or other options (AVX?) to perform a bitshift on the whole memory (but pertain the sign of the short, which the >>= does on the Visual Studio implementation)

C++ optimisers perform a lot better with iterator-based loops than with indexing loops.
This is because the compiler can make assumptions about how address arithmetic works at the index overflow. For it to make the same assumptions when using an index into an array you must happen to pick the correct datatype for the index by luck.
The shift code can be expressed as:
void shift(short* first, short* last, int bits)
{
while (first != last) {
*first++ >>= bits;
}
}
int test(int width, int height)
{
short* tempBufferVert = new short[width * height];
shift(tempBufferVert, tempBufferVert + (width * height), 2);
}
Which will (with correct optimisations enabled) be vectorised: https://godbolt.org/g/oJ8Boj
note how the middle of the loop becomes:
.L76:
vmovdqa ymm0, YMMWORD PTR [r9+rdx]
add r8, 1
vpsraw ymm0, ymm0, 2
vmovdqa YMMWORD PTR [r9+rdx], ymm0
add rdx, 32
cmp rsi, r8
ja .L76
lea rax, [rax+rdi*2]
cmp rcx, rdi
je .L127
vzeroupper

Firstly make sure you are compiling with optimisation enabled (e.g. -O3), and then check whether your compiler is auto-vectorizing the right shift loop. If it's not then you can probably get a significant improvement with SSE:
#include <emmintrin.h> // SSE2
for (int i = 0; i < width * height; i += 8)
{
__m128i v = _mm_loadu_si128((__m128i *)&tempBufferVert[i]);
v = _mm_srai_epi16(v, 2); // v >>= 2
_mm_storeu_si128((__m128i *)&tempBufferVert[i], v);
}
(Note: assumes width*height is a multiple of 8.)
You can probably do even better with some loop unrolling and/or using AVX2, but this may be enough for your needs as it stands.

SIMD XOR operation is not as effective as Integer XOR?

I have a task to calculate xor-sum of bytes in an array:
X = char1 XOR char2 XOR char3 ... charN;
I'm trying to parallelize it, xoring __m128 instead. This should give speed up factor 4.
Also, to recheck the algorithm I use int. This should give speed up factor 4.
The test program is 100 lines long, I can't make it shorter, but it is simple:
#include "xmmintrin.h" // simulation of the SSE instruction
#include <ctime>
#include <iostream>
using namespace std;
#include <stdlib.h> // rand
const int NIter = 100;
const int N = 40000000; // matrix size. Has to be dividable by 4.
unsigned char str[N] __attribute__ ((aligned(16)));
template< typename T >
T Sum(const T* data, const int N)
{
T sum = 0;
for ( int i = 0; i < N; ++i )
sum = sum ^ data[i];
return sum;
}
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum = _mm_set_ps1(0);
for ( int i = 0; i < N; ++i )
sum = _mm_xor_ps(sum,data[i]);
return sum;
}
int main() {
// fill string by random values
for( int i = 0; i < N; i++ ) {
str[i] = 256 * ( double(rand()) / RAND_MAX ); // put a random value, from 0 to 255
}
/// -- CALCULATE --
/// SCALAR
unsigned char sumS = 0;
std::clock_t c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ )
sumS = Sum<unsigned char>( str, N );
double tScal = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SIMD
unsigned char sumV = 0;
const int m128CharLen = 4*4;
const int NV = N/m128CharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
__m128 sumVV = _mm_set_ps1(0);
sumVV = Sum<__m128>( reinterpret_cast<__m128*>(str), NV );
unsigned char *sumVS = reinterpret_cast<unsigned char*>(&sumVV);
sumV = sumVS[0];
for ( int iE = 1; iE < m128CharLen; ++iE )
sumV ^= sumVS[iE];
}
double tSIMD = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// SCALAR INTEGER
unsigned char sumI = 0;
const int intCharLen = 4;
const int NI = N/intCharLen;
c_start = std::clock();
for( int ii = 0; ii < NIter; ii++ ) {
int sumII = Sum<int>( reinterpret_cast<int*>(str), NI );
unsigned char *sumIS = reinterpret_cast<unsigned char*>(&sumII);
sumI = sumIS[0];
for ( int iE = 1; iE < intCharLen; ++iE )
sumI ^= sumIS[iE];
}
double tINT = 1000.0 * (std::clock()-c_start) / CLOCKS_PER_SEC;
/// -- OUTPUT --
cout << "Time scalar: " << tScal << " ms " << endl;
cout << "Time INT: " << tINT << " ms, speed up " << tScal/tINT << endl;
cout << "Time SIMD: " << tSIMD << " ms, speed up " << tScal/tSIMD << endl;
if(sumV == sumS && sumI == sumS )
std::cout << "Results are the same." << std::endl;
else
std::cout << "ERROR! Results are not the same." << std::endl;
return 1;
}
The typical results:
[10:46:20]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:27]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3540 ms
Time INT: 890 ms, speed up 3.97753
Time SIMD: 280 ms, speed up 12.6429
Results are the same.
[10:46:35]$ g++ test.cpp -O3 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 290 ms, speed up 12.5517
Results are the same.
As you see, int version works ideally, but simd version loses 25% of the speed and this is stable. I tried to change the array sizes, this doesn't help.
Also, if I switch to -O2 I lose 75% of the speed in simd version:
[10:50:25]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 880 ms, speed up 4.13636
Time SIMD: 890 ms, speed up 4.08989
Results are the same.
[10:51:16]$ g++ test.cpp -O2 -fno-tree-vectorize; ./a.out
Time scalar: 3640 ms
Time INT: 900 ms, speed up 4.04444
Time SIMD: 880 ms, speed up 4.13636
Results are the same.
Can someone explain me this?
Additional info:
I have g++ (GCC) 4.7.3; Intel(R) Xeon(R) CPU E7-4860
I use -fno-tree-vectorize to prevent auto vectorization. Without this flag with -O3 the
expected speed up is 1, since the task is simple. This is what I get:
[10:55:40]$ g++ test.cpp -O3; ./a.out
Time scalar: 270 ms
Time INT: 270 ms, speed up 1
Time SIMD: 280 ms, speed up 0.964286
Results are the same.
but with -O2 result is still strange:
[10:55:02]$ g++ test.cpp -O2; ./a.out
Time scalar: 3540 ms
Time INT: 990 ms, speed up 3.57576
Time SIMD: 880 ms, speed up 4.02273
Results are the same.
When I change
for ( int i = 0; i < N; i+=1 )
sum = sum ^ data[i];
to equivalent of:
for ( int i = 0; i < N; i+=8 )
sum = (data[i] ^ data[i+1]) ^ (data[i+2] ^ data[i+3]) ^ (data[i+4] ^ data[i+5]) ^ (data[i+6] ^ data[i+7]) ^ sum;
i do see improvment in scalar speed by factor of 2. But I don't see improvements in speed up. Before: intSpeedUp 3.98416, SIMDSpeedUP 12.5283. After: intSpeedUp 3.5572, SIMDSpeedUP 6.8523.

I think you may be bumping into the upper limits of memory bandwidth. This might be the reason for the 12.6x speedup instead of 16x speedup in the -O3 case.
However, gcc 4.7.3 puts a useless store instruction into the tiny not-unrolled vector loop when inlining, but not in the scalar or int SWAR loops (see below), so that might be the explanation instead.
The -O2 reduction in vector throughput is all due to gcc 4.7.3 doing an even worse job there and sending the accumulator on a round trip to memory (store-forwarding).
For analysis of the implications of that extra store instruction, see the section at the end.
TL;DR: Nehalem likes a bit more loop unrolling than SnB-family requires, and gcc has made major improvements in SSE code-generation in gcc5.
And typically use _mm_xor_si128, not _mm_xor_ps for bulk xor work like this.
Memory bandwidth.
N is huge (40MB), so memory/cache bandwidth is a concern. A Xeon E7-4860 is a 32nm Nehalem microarchitecture, with 256kiB of L2 cache (per core), and 24MiB of shared L3 cache. It has a quad-channel memory controller supporting up to DDR3-1066 (compared to dual-channel DDR3-1333 or DDR3-1600 for typical desktop CPUs like SnB or Haswell).
A typical 3GHz desktop Intel CPU can sustain a load bandwidth of something like ~8B / cycle from DRAM, in theory. (e.g. 25.6GB/s theoretical max memory BW for an i5-4670 with dual channel DDR3-1600). Achieving this in an actual single thread might not work, esp. when using integer 4B or 8B loads. For a slower CPU like a 2267MHz Nehalem Xeon, with quad-channel (but also slower) memory, 16B per clock is probably pushing the upper limits.
I had a look at the asm from the original unchanged code with gcc 4.7.3 on godbolt.
The stand-alone version looks fine (but the inlined version isn't), see below!), with the loop being
## float __vector Sum(...) non-inlined version
.L3:
xorps xmm0, XMMWORD PTR [rdi]
add rdi, 16
cmp rdi, rax
jne .L3
That's 3 fused-domain uops, and should issue and execute at one iteration per clock. Actually, it can't because xorps and fused compare-and-branch both need port5.
N is huge, so the overhead of the clunky char-at-a-time horizontal XOR doesn't come into play, even though gcc 4.7 emits abysmal code for it (multiple copies of sumVV stored to the stack, etc. etc.). (See Fastest way to do horizontal float vector sum on x86 for ways to reduce down to 4B with SIMD. It might be faster to then movd the data into integer regs and use integer shift/xor there for the last 4B -> 1B, esp. if you're not using AVX. The compiler might be able to take advantage of al/ah low and high 8bit component regs.)
The vector loop was inlined stupidly:
## float __vector Sum(...) inlined into main at -O3
.L12:
xorps xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+64], xmm0
jne .L12
It's storing the accumulator every iteration, instead of just after the last iteration! Since gcc doesn't / didn't default to optimizing for macro-fusion, it didn't even put the cmp/jne next to each other where they can fuse into a single uop on Intel and AMD CPUs, so the loop has 5 fused-domain uops. This means it can only issue at one per 2 clocks, if the Nehalem frontend / loop buffer is anything like the Sandybridge loop buffer. uops issue in groups of 4, and a predicted-taken branch ends an issue block. So it issues in a 4/1/4/1 uop pattern, not 4/4/4/4. This means we can get at best one 16B load per 2 clocks of sustained throughput.
-mtune=core2 might double the throughput, because it puts the cmp/jne together. The store can micro-fuse into a single uop, and so can the xorps with a memory source operand. A gcc that old doesn't support -mtune=nehalem, or the more generic -mtune=intel. Nehalem can sustain one load and one store per clock, but obviously it would be far better not to have a store in the loop at all.
Compiling with -O2 makes even worse code with that gcc version:
The inlined inner loop now loads the accumulator from memory as well as storing it, so there's a store-forwarding round trip in the loop-carried dependency that the accumulator is part of:
## float __vector Sum(...) inlined at -O2
.L14:
movaps xmm0, XMMWORD PTR [rsp+16] # reload sum
xorps xmm0, XMMWORD PTR [rdx] # load data[i]
add rdx, 16
cmp rdx, rbx
movaps XMMWORD PTR [rsp+16], xmm0 # spill sum
jne .L14
At least with -O2 the horizontal byte-xor compiles to just a plain integer byte loop without spewing 15 copies copies of xmm0 onto the stack.
This is just totally braindead code, because we haven't let a reference / pointer to sumVV escape the function, so there are no other threads that could be observing the accumulator in progress. (And even if so, there's no synchronization stopping gcc from just accumulating in a reg and storing the final result). The non-inlined version is still fine.
That massive performance bug is still present all the way up to gcc 4.9.2, with -O2 -fno-tree-vectorize, even when I rename the function from main to something else, so it gets the full benefit of gcc's optimization efforts. (Don't put microbenchmarks inside main, because gcc marks it as "cold" and optimizes less.)
gcc 5.1 makes good code for the inlined version of template<>
__m128 Sum(const __m128* data, const int N). I didn't check with clang.
This extra loop-carried dep chain is almost certainly why the vector version has a smaller speedup with -O2. i.e. it's a compiler bug that's fixed in gcc5.
The scalar version with -O2 is
.L12:
xor bpl, BYTE PTR [rdx] # sumS, MEM[base: D.27594_156, offset: 0B]
add rdx, 1 # ivtmp.135,
cmp rdx, rbx # ivtmp.135, D.27613
jne .L12 #,
so it's basically optimal. Nehalem can only sustain one load per clock, so there's no need to use more accumulators.
The int version is
.L18:
xor ecx, DWORD PTR [rdx] # sum, MEM[base: D.27549_296, offset: 0B]
add rdx, 4 # ivtmp.135,
cmp rbx, rdx # D.27613, ivtmp.135
jne .L18 #,
so again, it's what you'd expect. It should be sustaining on load per clock.
For uarches that can sustain two loads per clock (Intel SnB-family, and AMD), you should be using two accumulators. compiler-implemented -funroll-loops usually just reduces loop overhead without introducing multiple accumulators. :(
You want the compiler to make code like:
xorps xmm0, xmm0
xorps xmm1, xmm1
.Lunrolled:
pxor xmm0, XMMWORD PTR [rdi]
pxor xmm1, XMMWORD PTR [rdi+16]
pxor xmm0, XMMWORD PTR [rdi+32]
pxor xmm1, XMMWORD PTR [rdi+48]
add rdi, 64
cmp rdi, rax
jb .Lunrolled
pxor xmm0, xmm1
# horizontal xor of xmm0
movhlps xmm1, xmm0
pxor xmm0, xmm1
...
Urolling by two (pxor / pxor / add / cmp/jne) would make a loop that can issue at one iteration per 1c, but requires four ALU execution ports. Only Haswell and later can keep up with that throughput. (Or AMD Bulldozer-family, because vector and integer instructions don't compete for execution ports, but conversely there are only two integer ALU pipes, so they only max out their instruction throughput with mixed code.)
This unroll by four is 6 fused-domain uops in the loop, so it can easily issue at one per 2c, and SnB/IvB can keep up with three ALU uops per clock.
Note that on Intel Nehalem through Broadwell, pxor (_mm_xor_si128) has better throughput than xorps (_mm_xor_ps), because it can run on more execution ports. If you're using AVX but not AVX2, it can make sense to use 256b _mm256_xor_ps instead of _mm_xor_si128, because _mm256_xor_si256 requires AVX2.
If it's not memory bandwidth, why is it only 12.6x speedup?
Nehalem's loop buffer (aka Loop Stream Decoder or LSD) has a "one clock delay" (according to Agner Fog's microarch pdf), so a loop with N uops will take ceil(N/4.0) + 1 cycles to issue out of the loop buffer if I understand him correctly. He doesn't explicitly say what happens to the last group of uops if there are less than 4, but SnB-family CPUs work this way (divide by 4 and round up). They can't issue uops from the next iteration following the taken branch. I tried to google about nehalem, but couldn't find anything useful.
So the char and int loops are presumably running at one load & xor per 2 clocks (since they're 3 fused-domain uops). Loop unrolling could ~double their throughput up to the point where they saturate the load port. SnB-family CPUs don't have that one clock delay, so they can run tiny loops at one clock per iteration.
Using perf counters or at least microbenchmarks to make sure that your absolute throughput is what you expect is a good idea. With just your relative measurements, you have no indication without this kind of analysis that you're leaving half your performance on the table.
The vector -O3 loop is 5 fused-domain uops, so it should be taking three clock cycles to issue. Doing 16x as much work, but taking 3 cycles per iteration instead of 2 would give us a speedup of 16 * 2/3 = 10.66. We're actually getting somewhat better than that, which I don't understand.
I'm going to stop here, instead of digging out a nehalem laptop and running actual benchmarks, since Nehalem is too old to be interesting to tune for at this level of detail.
Did you maybe compile with -mtune=core2? Or maybe your gcc had a different default tune setting, and didn't split up the compare-and-branch? In that case, the frontend probably wasn't the bottleneck, and throughput was maybe slightly limited by memory bandwidth, or by memory false dependencies:
Core 2 and Nehalem both have a false dependence between memory
addresses with the same set and offset, i.e. with a distance that is a
multiple of 4 kB.
This might cause a short bubble in the pipeline every 4k.
Before I checked on Nehalem's loop buffer and found the extra 1c per loop, I had a theory which I'm now confident is incorrect:
I thought the extra store uop in the loop that bumps it up over 4 uops would essentially cut the speed in half, so you'd see a speedup of ~6. However, maybe there are some execution bottlenecks that make the frontend issue throughput not the bottleneck after all?
Or maybe Nehalem's loop buffer is different from SnB's, and doesn't end an issue group at a predicted-taken branch. This would give a thoughput speedup of 16 * 4/5 = 12.8, for the -O3 vector loop, if it's 5 fused-domain uops can issue at a consistent 4 per clock. This matches the experimental data of 12.6429 speedup factor very well: slightly less than 12.8 is to be expected because of increased bandwidth requirements (occasional cache miss stalls when the prefetcher falls behind).
(The scalar loops still just run one iteration per clock: issuing more than one iteration per clock just means they bottleneck on one load per clock, and the 1 cycle xor loop-carried dependency.)
This can't be right because xorps in Nehalem can only run on port5, same as a fused compare-and-branch. So there's no way the non-unrolled vector loop could be running at more than one iteration per 2 cycles.
According to Agner Fog's tables, conditional branches have a throughput of one per 2c on Nehalem, further confirming that this is a bogus theory.

SSE2 is optimal when operating on completely parallel data. e.g.
for (int i = 0 ; i < N ; ++i)
z[i] = _mm_xor_ps(x[i], y[i]);
But in your case, each iteration of the loop depends upon the output of the previous iteration. This is known as a dependency chain. In short, it means that each consecutive xor is going to have to wait for the entire latency of the previous one before it can continue so it lowers the throughput.

jaket has already explained the likely problem: a dependency chain. I'll give it a try:
template<>
__m128 Sum(const __m128* data, const int N)
{
__m128 sum1 = _mm_set_ps1(0);
__m128 sum2 = _mm_set_ps1(0);
for (int i = 0; i < N; i += 2) {
sum1 = _mm_xor_ps(sum1, data[i + 0]);
sum2 = _mm_xor_ps(sum2, data[i + 1]);
}
return _mm_xor_ps(sum1, sum2);
}
Now there are no dependencies at all between the two lanes. Try expanding this to more lanes (e.g. 4).
You could also try using the integer version of these instructions (using __m128i). I do not understand the difference so this is just a hint.

In fact, the gcc compiler is optimized for SIMD. It explains why when you used -O2 the perf decreases significantly. You can re-check with -O1.

How can I optimize conversion from half-precision float16 to single-precision float32?

I'm trying improve performance for my function. Profiler points to the code at inner loop. Can I improve perfomance of that code, maybe using SSE intrinsics?
void ConvertImageFrom_R16_FLOAT_To_R32_FLOAT(char* buffer, void* convertedData, DWORD width, DWORD height, UINT rowPitch)
{
struct SINGLE_FLOAT
{
union {
struct {
unsigned __int32 R_m : 23;
unsigned __int32 R_e : 8;
unsigned __int32 R_s : 1;
};
struct {
float r;
};
};
};
C_ASSERT(sizeof(SINGLE_FLOAT) == 4); // 4 bytes
struct HALF_FLOAT
{
unsigned __int16 R_m : 10;
unsigned __int16 R_e : 5;
unsigned __int16 R_s : 1;
};
C_ASSERT(sizeof(HALF_FLOAT) == 2);
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
for(DWORD i = 0; i< width; i++)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d++;
s++;
}
}
}
Update:
Disassembly
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01
TITLE Utils.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES
PUBLIC ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
; Function compile flags: /Ogtp
; COMDAT ?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z
_TEXT SEGMENT
_buffer$ = 8 ; size = 4
tv83 = 12 ; size = 4
_convertedData$ = 12 ; size = 4
_width$ = 16 ; size = 4
_height$ = 20 ; size = 4
_rowPitch$ = 24 ; size = 4
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z PROC ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT, COMDAT
; 323 : {
push ebp
mov ebp, esp
; 343 : for(DWORD j = 0; j< height; j++)
mov eax, DWORD PTR _height$[ebp]
push esi
mov esi, DWORD PTR _convertedData$[ebp]
test eax, eax
je SHORT $LN4#ConvertIma
; 324 : union SINGLE_FLOAT {
; 325 : struct {
; 326 : unsigned __int32 R_m : 23;
; 327 : unsigned __int32 R_e : 8;
; 328 : unsigned __int32 R_s : 1;
; 329 : };
; 330 : struct {
; 331 : float r;
; 332 : };
; 333 : };
; 334 : C_ASSERT(sizeof(SINGLE_FLOAT) == 4);
; 335 : struct HALF_FLOAT
; 336 : {
; 337 : unsigned __int16 R_m : 10;
; 338 : unsigned __int16 R_e : 5;
; 339 : unsigned __int16 R_s : 1;
; 340 : };
; 341 : C_ASSERT(sizeof(HALF_FLOAT) == 2);
; 342 : SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
push ebx
mov ebx, DWORD PTR _buffer$[ebp]
push edi
mov DWORD PTR tv83[ebp], eax
$LL13#ConvertIma:
; 344 : {
; 345 : HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
; 346 : for(DWORD i = 0; i< width; i++)
mov edi, DWORD PTR _width$[ebp]
mov edx, ebx
test edi, edi
je SHORT $LN5#ConvertIma
npad 1
$LL3#ConvertIma:
; 347 : {
; 348 : d->R_s = s->R_s;
movzx ecx, WORD PTR [edx]
movzx eax, WORD PTR [edx]
shl ecx, 16 ; 00000010H
xor ecx, DWORD PTR [esi]
shl eax, 16 ; 00000010H
and ecx, 2147483647 ; 7fffffffH
xor ecx, eax
mov DWORD PTR [esi], ecx
; 349 : d->R_e = s->R_e - 15 + 127;
movzx eax, WORD PTR [edx]
shr eax, 10 ; 0000000aH
and eax, 31 ; 0000001fH
add eax, 112 ; 00000070H
shl eax, 23 ; 00000017H
xor eax, ecx
and eax, 2139095040 ; 7f800000H
xor eax, ecx
mov DWORD PTR [esi], eax
; 350 : d->R_m = s->R_m << (23-10);
movzx ecx, WORD PTR [edx]
and ecx, 1023 ; 000003ffH
shl ecx, 13 ; 0000000dH
and eax, -8388608 ; ff800000H
or ecx, eax
mov DWORD PTR [esi], ecx
; 351 : d++;
add esi, 4
; 352 : s++;
add edx, 2
dec edi
jne SHORT $LL3#ConvertIma
$LN5#ConvertIma:
; 343 : for(DWORD j = 0; j< height; j++)
add ebx, DWORD PTR _rowPitch$[ebp]
dec DWORD PTR tv83[ebp]
jne SHORT $LL13#ConvertIma
pop edi
pop ebx
$LN4#ConvertIma:
pop esi
; 353 : }
; 354 : }
; 355 : }
pop ebp
ret 0
?ConvertImageFrom_R16_FLOAT_To_R32_FLOAT##YAXPADPAXKKI#Z ENDP ; ConvertImageFrom_R16_FLOAT_To_R32_FLOAT
_TEXT ENDS

The x86 F16C instruction-set extension adds hardware support for converting single-precision float vectors to/from vectors of half-precision float.
The format is the same IEEE 754 half-precision binary16 that you describe. I didn't check that the endianness is the same as your struct, but that's easy to fix if needed (with a pshufb).
F16C is supported starting from Intel IvyBridge and AMD Piledriver. (And has its own CPUID feature bit, which your code should check for, otherwise fall back to SIMD integer shifts and shuffles).
The intrinsics for VCVTPS2PH are:
__m128i _mm_cvtps_ph ( __m128 m1, const int imm);
__m128i _mm256_cvtps_ph(__m256 m1, const int imm);
The immediate byte is a rounding control. The compiler can use it as a convert-and-store directly to memory (unlike most instructions that can optionally use a memory operand, where it's the source operand that can be memory instead of a register.)
VCVTPH2PS goes the other way, and is just like most other SSE instructions (can be used between registers or as a load).
__m128 _mm_cvtph_ps ( __m128i m1);
__m256 _mm256_cvtph_ps ( __m128i m1)
F16C is so efficient that you might want to consider leaving your image in half-precision format, and converting on the fly every time you need a vector of data from it. This is great for your cache footprint.

Accessing bitfields in memory can be really tricky, depending on the architecture, of course.
You might achieve better performance if you would make a union of a float and a 32 bit integer, and simply perform all decomposition and composition using a local variables. That way the generated code could perform the entire operation using only processor registers.

the loops are independent of each other, so you could easily parallelize this code, either by using SIMD or OpenMP, a simple version would be splitting the top half and the bottom half of the image into two threads, running concurrently.

You're processing the data as a two dimension array. If you consider how it's laid out in memory you may be able to process it as a single dimensional array and you can save a little overhead by having one loop instead of nested loops.
I'd also compile to assembly code and make sure the compiler optimization worked and it isn't recalculating (15 + 127) hundreds of times.

You should be able to reduce this to a single instruction on chips which use the upcoming CVT16 instruction set. According to that Wikipedia article:
The CVT16 instructions allow conversion of floating point vectors between single precision and half precision.

SSE Intrinsics seem to be an excellent idea. Before you go down that road, you should
look at the assembly code generated by the compiler, (is there potential for optimization?)
search your compiler documentation how to generate SSE code automatically,
search your software library's documentation (or wherever the 16bit float type originated) for a function to bulk convert this type. (a conversion to 64bit floating point could be helpful too.) You are very likely not the first person to encounter this problem!
If all that fails, go and try your luck with some SSE intrinsics. To get some idea, here is some SSE code to convert from 32 to 16 bit floating point. (you want the reverse)
Besides SSE you should also consider multi-threading and offloading the task to the GPU.

Here are some ideas:
Put the constants into const register variables.
Some processors don't like fetching constants from memory; it is awkward and may take many instruction cycles.
Loop Unrolling
Repeat the statements in the loop, and increase the increment.
Processors prefer continuous instructions; jumps and branches anger them.
Data Prefetching (or loading the cache)
Use more variables in the loop, and declare them as volatile so the compiler doesn't optimize them:
SINGLE_FLOAT* d = (SINGLE_FLOAT*)convertedData;
SINGLE_FLOAT* d1 = d + 1;
SINGLE_FLOAT* d2 = d + 2;
SINGLE_FLOAT* d3 = d + 3;
for(DWORD j = 0; j< height; j++)
{
HALF_FLOAT* s = (HALF_FLOAT*)((char*)buffer + rowPitch * j);
HALF_FLOAT* s1 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 1));
HALF_FLOAT* s2 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 2));
HALF_FLOAT* s3 = (HALF_FLOAT*)((char*)buffer + rowPitch * (j + 3));
for(DWORD i = 0; i< width; i += 4)
{
d->R_s = s->R_s;
d->R_e = s->R_e - 15 + 127;
d->R_m = s->R_m << (23-10);
d1->R_s = s1->R_s;
d1->R_e = s1->R_e - 15 + 127;
d1->R_m = s1->R_m << (23-10);
d2->R_s = s2->R_s;
d2->R_e = s2->R_e - 15 + 127;
d2->R_m = s2->R_m << (23-10);
d3->R_s = s3->R_s;
d3->R_e = s3->R_e - 15 + 127;
d3->R_m = s3->R_m << (23-10);
d += 4;
d1 += 4;
d2 += 4;
d3 += 4;
s += 4;
s1 += 4;
s2 += 4;
s3 += 4;
}
}

I don't know about SSE intrinsics but it would be interesting to see a disassembly of your inner loop. An old-school way (that may not help much but that would be easy to try out) would be to reduce the number of iterations by doing two inner loops: one that does N (say 32) repeats of the processing (loop count of width/N) and then one to finish the remainder (loop count of width%N)... with those divs and modulos calculated outside the first loop to avoid recalculating them. Apologies if that sounds obvious!

The function is only doing a few small things. It is going to be tough to shave much off the time by optimisation, but as somebody already said, parallelisation has promise.
Check how many cache misses you are getting. If the data is paging in and out, you might be able to speed it up by applying more intelligence into the ordering to minimise cache swaps.
Also consider macro-optimisations. Are there any redundancies in the data computation that might be avoided (e.g. caching old results instead of recomputing them when needed)? Do you really need to convert the whole data set or could you just convert the bits you need? I don't know your application so I'm just guessing wildly here, but there might be scope for that kind of optimisation.

My suspicion is that this operation will be already bottlenecked on memory access, and making it more efficient (e.g., using SSE) would not make it execute more quickly. However this is only a suspicion.
Other things to try, assuming x86/x64, might be:
Don't d++ and s++, but use d[i] and s[i] on each iteration. (Then of course bump d after each scanline.) Since the elements of d are 4 bytes and those of s 2, this operation can be folded into the address calculation. (Unfortunately I can't guarantee that this would necessarily make execution more efficient.)
Remove the bitfield operations and do the operations manually. (When extracting, shift first and mask second, to maximize the likelihood that the mask can fit into a small immediate value.)
Unroll the loop, though with a loop as easily-predicted as this one it might not make much difference.
Count along each line from width down to zero. This stops the compiler having to fetch width each time round. Probably more important for x86, because it has so few registers. (If the CPU likes my "d[i] and s[i]" suggestion, you could make width signed, count from width-1 instead, and walk backwards.)
These would all be quicker to try than converting to SSE and would hopefully make it memory-bound, if it isn't already, at which point you can give up.
Finally if the output is in write-combined memory (e.g., it's a texture or vertex buffer or something accessed over AGP, or PCI Express, or whatever it is PCs have these days) then this could well result in poor performance, depending on what code the compiler has generated for the inner loop. So if that is the case you may get better results converting each scanline into a local buffer then using memcpy to copy it to its final destination.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js