Profiling SIMD Code

Profiling SIMD Code - c++

UPDATED - Check Below
Will keep this as short as possible. Happy to add any more details if required.
I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.
If I measure like this
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_sse);
NormaliseSSE( vectors_sse+j);
}
The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).
for( int j = 0; j < NUM_VECTORS; ++j )
{
Timer t(norm_dbl);
NormaliseDBL( vectors_dbl+j);
}
However, timing just the entirety of the loop like this
{
Timer t(norm_sse);
for( int j = 0; j < NUM_VECTORS; ++j ){
NormaliseSSE( vectors_sse+j );
}
}
shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version.
I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.
For example, I know there can be penalities when casting the results to float, but none of that is going on here.
Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?
Thanks for reading :)
More details below:
Both normalise methods are inlined (verified in disassembly)
Running in release
32 bit compilation
Simple Vector struct
_declspec(align(16)) struct FVECTOR{
typedef float REAL;
union{
struct { REAL x, y, z, w; };
__m128 Vec;
};
};
Code to Normalise SSE:
__m128 Vec = _v->Vec;
__m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
__m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e );
__m128 addOne = _mm_add_ps( sqr, yxwz );
__m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
__m128 addTwo = _mm_add_ps( addOne, swapPairs );
__m128 invSqrOne = _mm_rsqrt_ps( addTwo );
_v->Vec = _mm_mul_ps( invSqrOne, Vec );
Code to normalise doubles
double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;
Helper struct
struct Timer{
Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
QueryPerformanceCounter( &PStart );
}
~Timer(){
LARGE_INTEGER PEnd;
QueryPerformanceCounter( &PEnd );
Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
}
LARGE_INTEGER& Storage;
LARGE_INTEGER PStart;
};
Update
So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.
I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).
struct PreciseTimer{
PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
StartVal.QuadPart = GetRDTSC();
}
~PreciseTimer(){
Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
}
unsigned __int64 inline GetRDTSC() {
unsigned int lo, hi;
__asm {
; Flush the pipeline
xor eax, eax
CPUID
; Get RDTSC counter in edx:eax
RDTSC
mov DWORD PTR [hi], edx
mov DWORD PTR [lo], eax
}
return (unsigned __int64)(hi << 32 | lo);
}
LARGE_INTEGER StartVal;
LARGE_INTEGER& Storage;
};

When it's only the SSE code running the loop, the processor should be able to keep its pipelines full and executing a huge number of SIMD instructions per unit time. When you add the timer code within the loop, now there's a whole bunch of non-SIMD instructions, possibly less predictable, between each of the easy-to-optimize operations. It's likely that the QueryPerformanceCounter call is either expensive enough to make the data manipulation part insignificant, or the nature of the code it executes wreaks havoc with the processor's ability to keep executing instructions at the maximum rate (possibly due to cache evictions or branches that are not well-predicted).
You might try commenting out the actual calls to QPC in your Timer class and see how it performs--this may help you discover if it's the construction and destruction of the Timer objects that is the problem, or the QPC calls. Likewise, try just calling QPC directly in the loop instead of making a Timer and see how that compares.

QPC is a kernel function, and calling it causes a context switch, which is inherently far more expensive and destructive than any equivalent user-mode function call, and will definitely annihilate the processor's ability to process at it's normal speed. In addition to that, remember that QPC/QPF are abstractions and require their own processing- which likely involves the use of SSE itself.

Related

Multiplying a struct (storing a float) by a float is coming out to be faster than simply multiplying a float by a float?

I needed a way of initializing a scalar value given either a single float, or three floating point values (corresponding to RGB). So I just threw together a very simple struct:
struct Mono {
float value;
Mono(){
this->value = 0;
}
Mono(float value) {
this->value = value;
};
Mono(float red, float green, float blue){
this->value = (red+green+blue)/3;
};
};
// Multiplication operator overloads:
Mono operator*( Mono const& lhs, Mono const& rhs){
return Mono(lhs.value*rhs.value);
};
Mono operator*( float const& lhs, Mono const& rhs){
return Mono(lhs*rhs.value);
};
Mono operator*( Mono const& lhs, float const& rhs){
return Mono(lhs.value*rhs);
};
This worked as expected, but then I wanted to benchmark to see if this wrapper is going to impact performance at all so I wrote the following benchmark test where I simply multiplied a float by the struct 100,000,000 times, and multipled a float by a float 100,000,000 times:
#include <vector>
#include <chrono>
#include <iostream>
using namespace std::chrono;
int main() {
size_t N = 100000000;
std::vector<float> inputs(N);
std::vector<Mono> outputs_c(N);
std::vector<float> outputs_f(N);
Mono color(3.24);
float color_f = 3.24;
for (size_t i = 0; i < N; i++){
inputs[i] = i;
};
auto start_c = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_c[i] = color*inputs[i];
}
auto stop_c = high_resolution_clock::now();
auto duration_c = duration_cast<microseconds>(stop_c - start_c);
std::cout << "Mono*float duration: " << duration_c.count() << "\n";
auto start_f = high_resolution_clock::now();
for (size_t i = 0; i < N; i++){
outputs_f[i] = color_f*inputs[i];
}
auto stop_f = high_resolution_clock::now();
auto duration_f = duration_cast<microseconds>(stop_f - start_f);
std::cout << "float*float duration: " << duration_f.count() << "\n";
return 0;
}
When I compile it without any optimizations: g++ test.cpp, it prints the following times (in microseconds) very reliably:
Mono*float duration: 841122
float*float duration: 656197
So the Mono*float is clearly slower in that case. But then if I turn on optimizations (g++ test.cpp -O3), it prints the following times (in microseconds) very reliably:
Mono*float duration: 75494
float*float duration: 86176
I'm assuming that something is getting optimized weirdly here and it is NOT actually faster to wrap a float in a struct like this... but I'm struggling to see what is going wrong with my test.

On my system (i7-6700k with GCC 12.2.1), whichever loop I do second runs slower, and the asm for the two loops is identical.
Perhaps because cache is still partly primed from the inputs[i] = i; init loop when the first loop runs. (See https://blog.stuffedcow.net/2013/01/ivb-cache-replacement/ re: Intel's adaptive replacement policy for L3 which might explain some but not all of the entries surviving that big init loop. 100000000 floats is 400 MB per array, and my CPU has 8 MiB of L3 cache.)
So as expected from the low computational intensity (one vector math instruction per 16 bytes loaded + stored), it's just a cache / memory bandwidth benchmark since you used one huge array instead of repeated passes over a smaller array. Nothing to do with whether you have a bare float or a struct {float; }
As expected, both loops compile to the same asm - https://godbolt.org/z/7eTh4ojYf - doing a movups load, mulps to multiply 4 floats, and a movups unaligned store. For some reason, GCC reloads the vector constant of 3.24 instead of hoisting it out of the loop, so it's doing 2 loads and 1 store per multiply. Cache misses on the big arrays should give plenty of time for out-of-order exec to do those extra loads from the same .rodata address that hit in L1d cache every time.
I tried How can I mitigate the impact of the Intel jcc erratum on gcc? but it didn't make a difference; still about the same performance delta with -Wa,-mbranches-within-32B-boundaries, so as expected it's not a front-end bottleneck; IPC is plenty low. Maybe some quirk of cache.
On my system (Linux 6.1.8 on i7-6700k at 3.9GHz, compiled with GCC 12.2.1 -O3 without -march=native or -ffast-math), your whole program spends nearly half its time in the kernel's page fault handler. (perf stat vs. perf stat --all-user cycle counts). So that's not great; if you're not trying to benchmark memory allocation and TLB misses.
But that's total time; you do touch the input and output arrays before the loop (std::vector<float> outputs_c(N); allocates and zeros space for N elements, same for your custom struct with a constructor.) There shouldn't be page faults inside your timed regions, only potentially TLB misses. And of course lots of cache misses.
BTW, clang correctly optimizes away all the loops, because none of the results are ever used. Benchmark::DoNotOptimize(outputs_c[argc]) might help with that. Or some manual use of asm with dummy memory inputs / outputs to force the compiler to materialize arrays in memory and forget their contents.
See also Idiomatic way of performance evaluation?

C++ parallel/multithreaded modify vector in place

I am trying to modify a vector in place using multiple threads. It is a very simple operation— subtracting 1 from each index, but speed is highest priority here since both the vector size and number of times I need to do this operation can be quite large (10k elements, 500 increments). Right now I have loops of the sort:
#include<vector>
using namespace std;
int main() {
vector<int> my_vec(10000);
fill(my_vec.begin(), my_vec.end(), 10);
for (int i = my_vec.size(); i--; ;) {
my_vec[i] -= 1;
}
}
I am coming back to C/C++ after several years working primarily in R, where splitting this embarrassingly parallel loop across multiple threads is trivial (i.e., each of n threads operates over a portion of the indices, and then concatenate the results).
How can I do this best in C++ that a) avoids copying of the entire vector, and b) is not ultimately slower than the original loop?

As I expected, this is purely a memory I/O problem for the given size of your vector.
I took your initial example and built an AVX2-enabled version and it does not fare much better than a simple loop - it might be that the simple loop gets optimized with AVX too btw.
The reverse loop:
for (int i = my_vec.size(); i--;) {
my_vec[i] -= 1;
}
The forward loop:
for ( int i = 0; i<my_vec.size(); ++i ) {
my_vec[i] -= 1;
}
The AVX2 unaligned loop:
__m256i* ptr = (__m256i*)my_vec.data();
constexpr size_t per_block = sizeof(__m256i)/sizeof(int);
size_t num_blocks = my_vec.size() / per_block;
size_t remaining = my_vec.size() % per_block;
__m256i ones = _mm256_set1_epi32( 1 );
for ( size_t j=0; j<num_blocks; ++j, ++ptr ) {
__m256i val = _mm256_lddqu_si256( ptr );
val = _mm256_sub_epi32( val, ones );
_mm256_storeu_si256( ptr, val );
}
The AVX2 aligned loop:
__m256i* ptr = (__m256i*)my_vec.data();
constexpr size_t per_block = sizeof(__m256i)/sizeof(int);
size_t num_blocks = my_vec.size() / per_block;
size_t remaining = my_vec.size() % per_block;
__m256i ones = _mm256_set1_epi32( 1 );
for ( size_t j=0; j<num_blocks; ++j, ++ptr ) {
__m256i val = _mm256_load_si256( ptr );
val = _mm256_sub_epi32( val, ones );
_mm256_store_si256( ptr, val );
}
The tests run pretty much at the same elapsed time range
Test:reverse Elapsed:0.295494 ticks/int
Test:forward Elapsed:0.313866 ticks/int
Test:avx2forward Elapsed:0.367432 ticks/int
Test:avx2aligned Elapsed:0.298912 ticks/int
The entire test is here: https://godbolt.org/z/MWjrorncs
As far as why this is a memory-limited problem, your array fits entirely in any L1 cache existent today. As the L1 cache is local to a core, adding threads will only make it worse because it will increase the contention for memory between threads. You can see for example the main solution for high throughput fizzbuzz at Code Golf is single threaded for this exact reason.

SSE-copy, AVX-copy and std::copy performance

I'm tried to improve performance of copy operation via SSE and AVX:
#include <immintrin.h>
const int sz = 1024;
float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(mas, mas+sz, [&](){return ++a;});
const int nn = 1000;//Number of iteration in tester loops
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3;
//std::copy testing
start1 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
std::copy(mas, mas+sz, tar);
end1 = std::chrono::system_clock::now();
float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=4, _tar+=4)
{
__m128 buffer = _mm_load_ps(_mas);
_mm_store_ps(_tar, buffer);
}
}
end2 = std::chrono::system_clock::now();
float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
//AVX-copy testing
start3 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m256 buffer = _mm256_load_ps(_mas);
_mm256_store_ps(_tar, buffer);
}
}
end3 = std::chrono::system_clock::now();
float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
_mm_free(mas);
_mm_free(tar);
It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:
nn=10: SSE-gain=3, AVX-gain=6;
nn=100: SSE-gain=0.75, AVX-gain=1.5;
nn=1000: SSE-gain=0.55, AVX-gain=1.1;
Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:
for blah blah:
sleep(500ms)
std::copy
sse
axv
output:
SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy
So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..
for blah blah:
sleep(500ms)
sse
axv
std::copy
Notice that absolutely nothing changed, except the order of the tests.
SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy
Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.
This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.
The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"
No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.
For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.
Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!
The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.
See also Enhanced REP MOVSB for memcpy here on SO, where #BeeOnRope's answer discusses NT stores (and non-RFO stores done by rep stosb/stosq) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.

Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).
You can test this idea by manually unrolling one notch:
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m128 buffer1 = _mm_load_ps(_mas);
__m128 buffer2 = _mm_load_ps(_mas+4);
_mm_store_ps(_tar, buffer1);
_mm_store_ps(_tar+4, buffer2);
}
}
Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".

I think this is because the measurement is not accurate for kinda short operations.
When measuring performance on Intel CPU
Disable "Turbo Boost" and "SpeedStep". You can to this on system BIOS.
Change Process/Thread priority to High or Realtime. This will keep your thread running.
Set Process CPU Mask to only one core. CPU Masking with Higher priority will minimize context switching.
use __rdtsc() intrinsic function. Intel Core series returns CPU internal clock counter with __rdtsc(). You will get 3400000000 counts/second from 3.4Ghz CPU. And __rdtsc() flushes all scheduled operations in CPU so it can measure timing more accurate.
This is my test-bed startup code for testing SSE/AVX codes.
int GetMSB(DWORD_PTR dwordPtr)
{
if(dwordPtr)
{
int result = 1;
#if defined(_WIN64)
if(dwordPtr & 0xFFFFFFFF00000000) { result += 32; dwordPtr &= 0xFFFFFFFF00000000; }
if(dwordPtr & 0xFFFF0000FFFF0000) { result += 16; dwordPtr &= 0xFFFF0000FFFF0000; }
if(dwordPtr & 0xFF00FF00FF00FF00) { result += 8; dwordPtr &= 0xFF00FF00FF00FF00; }
if(dwordPtr & 0xF0F0F0F0F0F0F0F0) { result += 4; dwordPtr &= 0xF0F0F0F0F0F0F0F0; }
if(dwordPtr & 0xCCCCCCCCCCCCCCCC) { result += 2; dwordPtr &= 0xCCCCCCCCCCCCCCCC; }
if(dwordPtr & 0xAAAAAAAAAAAAAAAA) { result += 1; }
#else
if(dwordPtr & 0xFFFF0000) { result += 16; dwordPtr &= 0xFFFF0000; }
if(dwordPtr & 0xFF00FF00) { result += 8; dwordPtr &= 0xFF00FF00; }
if(dwordPtr & 0xF0F0F0F0) { result += 4; dwordPtr &= 0xF0F0F0F0; }
if(dwordPtr & 0xCCCCCCCC) { result += 2; dwordPtr &= 0xCCCCCCCC; }
if(dwordPtr & 0xAAAAAAAA) { result += 1; }
#endif
return result;
}
else
{
return 0;
}
}
int _tmain(int argc, _TCHAR* argv[])
{
// Set Core Affinity
DWORD_PTR processMask, systemMask;
GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask);
SetProcessAffinityMask(GetCurrentProcess(), 1 << (GetMSB(processMask) - 1) );
// Set Process Priority. you can use REALTIME_PRIORITY_CLASS.
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
DWORD64 start, end;
start = __rdtsc();
// your code here.
end = __rdtsc();
printf("%I64d\n", end - start);
return 0;
}

I think that your main problem/bottleneck is your _mm_malloc.
I highly suggest to use std::vector as your main data structure if you are concerned about locality in C++.
intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.
Also note that the fact that the AVX are a newer than SSE doesn't make the AVX faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.
Try with a POD int array[] or an std::vector.

Beat the compiler

I'm trying to use Intel intrinsics to beat the compiler optimized code. Sometimes I can do it, other times I can't.
I guess the question is, why can I sometimes beat the compiler, but other times not? I got a time of 0.006 seconds for operator+= below using Intel intrinsics, (vs 0.009 when using bare C++), but a time of 0.07 s for operator+ using intrinsics, while bare C++ was only 0.03 s.
#include <windows.h>
#include <stdio.h>
#include <intrin.h>
class Timer
{
LARGE_INTEGER startTime ;
double fFreq ;
public:
Timer() {
LARGE_INTEGER freq ;
QueryPerformanceFrequency( &freq ) ;
fFreq = (double)freq.QuadPart ;
reset();
}
void reset() { QueryPerformanceCounter( &startTime ) ; }
double getTime() {
LARGE_INTEGER endTime ;
QueryPerformanceCounter( &endTime ) ;
return ( endTime.QuadPart - startTime.QuadPart ) / fFreq ; // as double
}
} ;
inline float randFloat(){
return (float)rand()/RAND_MAX ;
}
// Use my optimized code,
#define OPTIMIZED_PLUS_EQUALS
#define OPTIMIZED_PLUS
union Vector
{
struct { float x,y,z,w ; } ;
__m128 reg ;
Vector():x(0.f),y(0.f),z(0.f),w(0.f) {}
Vector( float ix, float iy, float iz, float iw ):x(ix),y(iy),z(iz),w(iw) {}
//Vector( __m128 val ):x(val.m128_f32[0]),y(val.m128_f32[1]),z(val.m128_f32[2]),w(val.m128_f32[3]) {}
Vector( __m128 val ):reg( val ) {} // 2x speed, above
inline Vector& operator+=( const Vector& o ) {
#ifdef OPTIMIZED_PLUS_EQUALS
// YES! I beat it! Using this intrinsic is faster than just C++.
reg = _mm_add_ps( reg, o.reg ) ;
#else
x+=o.x, y+=o.y, z+=o.z, w+=o.w ;
#endif
return *this ;
}
inline Vector operator+( const Vector& o )
{
#ifdef OPTIMIZED_PLUS
// This is slower
return Vector( _mm_add_ps( reg, o.reg ) ) ;
#else
return Vector( x+o.x, y+o.y, z+o.z, w+o.w ) ;
#endif
}
static Vector random(){
return Vector( randFloat(), randFloat(), randFloat(), randFloat() ) ;
}
void print() {
printf( "%.2f %.2f %.2f\n", x,y,z,w ) ;
}
} ;
int runs = 8000000 ;
Vector sum ;
// OPTIMIZED_PLUS_EQUALS (intrinsics) runs FASTER 0.006 intrinsics, vs 0.009 (std C++)
void test1(){
for( int i = 0 ; i < runs ; i++ )
sum += Vector(1.f,0.25f,0.5f,0.5f) ;//Vector::random() ;
}
// OPTIMIZED* runs SLOWER (0.03 for reg.C++, vs 0.07 for intrinsics)
void test2(){
float j = 27.f ;
for( int i = 0 ; i < runs ; i++ )
{
sum += Vector( j*i, i, i/j, i ) + Vector( i, 2*i*j, 3*i*j*j, 4*i ) ;
}
}
int main()
{
Timer timer ;
//test1() ;
test2() ;
printf( "Time: %f\n", timer.getTime() ) ;
sum.print() ;
}
Edit
Why am I doing this? The VS 2012 profiler is telling me my vector arithmetic operations could use some tuning.

As noted by Mysticial, the union hack is the most likely culprit in test2. It forces the data to go through L1 cache, which, while fast, has some latency that is much more than your gain of 2 cycles that the vector code offers (see below).
But also consider that the CPU can run multiple instructions out of order and in parallel (superscalar CPU). For example, Sandy Bridge has 6 execution units, p0--p5, floating point multiplication/division runs on p0, floating point addition and integer multiplication runs on p1. Also, division takes 3-4 times more cycles then multiplication/addition, and is not pipelined (i.e. the execution unit cannot start another instruction while division is being performed). So in test2, while the vector code is waiting for the expensive division and some multiplications to finish on unit p0, the scalar code can be performing the extra 2 add instructions on p1, which most likely obliterates any advantage of vector instructions.
test1 is different, the constant vector can be stored in xmm register and in that case the loop contains only the add instruction. But the code is not 3x faster as might be expected. The reason is pipelined instructions: each add instruction has latency 3 cycles, but the CPU can start a new one every cycle when they are independent of each other. This is the case of the per-component vector addition. Therefore the vector code executes one add instruction per loop iteration with 3 cycle latency, and the scalar code executes 3 add instructions, taking only 5 cycles (1 started/cycle, and the 3rd has latency 3: 2 + 3 = 5).
A very good resource on CPU architectures and optimization is http://www.agner.org/optimize/

Which one will be faster

Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.

There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).

Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.

In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.

The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.

For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);

The first one will be faster because you loop from 1 to 10000 only one time.

C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).

If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.

The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.

If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js