Why in C++ overwritingis is slower than writing?

Why in C++ overwritingis is slower than writing? - c++

USELESS QUESTION - ASKED TO BE DELETED
I have to run a piece of code that manages a video stream from camera.
I am trying to boost it, and I realized a weird C++ behaviour. (I have to admit I am realizing I do not know C++)
The first piece of code run faster than the seconds, why? It might be possible that the stack is almost full?
Faster version
double* temp = new double[N];
for(int i = 0; i < N; i++){
temp[i] = operation(x[i],y[i]);
res = res + (temp[i]*temp[i])*coeff[i];
}
Slower version1
double temp;
for(int i = 0; i < N; i++){
temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
Slower version2
for(int i = 0; i < N; i++){
double temp = operation(x[i],y[i]);
res = res + (temp*temp)*coeff[i];
}
EDIT
I realized the compiler was optimizing the product between elemnts of coeff and temp. I beg your pardon for the unuseful question. I will delete this post.

This has obviously nothing to do with "writing vs overwriting".
Assuming your results are indeed correct, I can guess that your "faster" version can be vectorized (i.e. pipelined) by the compiler more efficiently.
The difference in that in this version you allocate a storage space for temp, whereas each iteration uses its own member of the array, hence all the iterations can be executed independently.
Your "slow 1" version creates a (sort of) false dependence on a single temp variable. A primitive compiler might "buy" it, producing a non-pipelined code.
Your "slow 2" version seems to be ok actually, loop iterations are independent.
Why is this still slower?
I can guess that this is due to the use of the same CPU registers. That is, arithmetic on double is usually implemented via FPU stack registers, this is the interference between loop iterations.

Related

My C++ program gets slower as computation proceeds

I wrote a neural network program in C++ to test something, and I found that my program gets slower as computation proceeds. Since this kind of phenomenon is somewhat I've never seen before, I checked possible causes. Memory used by program did not change when it got slower. RAM and CPU status were fine when I ran the program.
Fortunately, the previous version of the program did not have such problem. So I finally found that a single statement that makes the program slow. The program does not get slower when I use this statement:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
However, the program gets slower and slower as soon as I replace above statement with:
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
To solve this problem, I did my best to find and remove the cause but I failed... The below is the simple code structure. For the case that this is not the problem that is related to local statement, I uploaded my code to google drive. The URL is located at the end of this question.
MLP.h
class MLP
{
private:
...
double lambda;
double ***w;
double ***dw;
neuron **hidden;
...
MLP.cpp
...
for(k = n_depth - 1; k > 0; k--)
{
if(k == n_depth - 1)
...
else
{
...
for(j = 1; n_neuron > j; j++)
{
for(i = 0; n_neuron > i; i++)
{
//dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi;
dw[k][i][j] = hidden[k-1][i].y * hidden[k][j].phi - lambda*w[k][i][j];
}
}
}
}
...
Full source code: https://drive.google.com/open?id=1A8Uw0hNDADp3-3VWAgO4eTtj4sVk_LZh

I'm not sure exactly why it gets slower and slower, but I do see where you can gain some performance.
Two and higher dimensional arrays are still stored in one dimensional
memory. This means (for C/C++ arrays) array[i][j] and array[i][j+1]
are adjacent to each other, whereas array[i][j] and array[i+1][j] may
be arbitrarily far apart.
Accessing data in a more-or-less sequential fashion, as stored in
physical memory, can dramatically speed up your code (sometimes by an
order of magnitude, or more)!
When modern CPUs load data from main memory into processor cache,
they fetch more than a single value. Instead they fetch a block of
memory containing the requested data and adjacent data (a cache line
). This means after array[i][j] is in the CPU cache, array[i][j+1] has
a good chance of already being in cache, whereas array[i+1][j] is
likely to still be in main memory.
Source: https://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf
With your current code, w[k][i][j] will be read, and on the next iteration, w[k][i+1][j] will be read. You should invert i and j so that w is read in sequential order:
for(j = 1; n_neuron > j; ++j)
{
for(i = 0; n_neuron > i; ++i)
{
dw[k][j][i] = hidden[k-1][j].y * hidden[k][i].phi - lambda*w[k][j][i];
}
}
Also note that ++x should be slightly faster than x++, since x++ has to create a temporary containing the old value of x as the expression result. The compiler might optimize it when the value is unused though, but do not count on it.

for loop optimization c ++

this is my first time posting in this site and I hope I get some help/hint. I have an assignment where I need to optimize the performance to the inner for loop but I have no idea how to do that. the code was given in the assignment. I need to count the time(which I was able to do) and improve the performance.
Here is the code:
//header files
#define N_TIMES 200 //This is originally 200000 but changed it to test the program faster
#define ARRAY_SIZE 9973
int main (void) {
int *array = (int*)calloc(ARRAY_SIZE, sizeof(int));
int sum = 0;
int checksum = 0;
int i;
int j;
int x;
// Initialize the array with random values 0 to 13.
srand(time(NULL));
for (j=0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
//printf("Checksum is %d.\n",checksum);
for (i = 0; i < N_TIMES; i++) {
// Do not alter anything above this line.
// Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
printf("Sum is now: %d\n",sum);
}
// Do not alter anything below this line.
// ---------------------------------------------------------------
// Check each iteration.
//
if (sum != checksum) {
printf("Checksum error!\n");
}
sum = 0;
}
return 0;
}
The code takes about 695 seconds to run. Any help on how to optimize it please?
thanks a lot.

The bottleneck in that loop is obviously the IO done by printf; since you are probably writing the output on a console, the output is line buffered, which means that the stdio buffer is flushed at each iteration, which slows down things a lot.
If you have to do all that prints, you can greatly enhance the performance by forcing the stream to do block buffering: before the for add a
setvbuf(stdout, NULL, _IOFBF, 0);
In alternative, if this approach is not considered valid, you can do your own buffering by allocating a big buffer on your own and do your own buffering: write in your buffer using sprintf, periodically emptying it in the output stream with a fwrite.
Also, you can use the poor man's approach to buffering - just use a buffer big enough to write all that stuff (you can calculate how big it must be quite easily) and write in it without worrying about when it's full, when to empty it, ... - just empty it at the end of the loop. edit: see #paxdiablo's answer for an example of this
Applying just the first optimization, what I get with time is
real 0m6.580s
user 0m0.236s
sys 0m2.400s
vs the original
real 0m8.451s
user 0m0.700s
sys 0m3.156s
So, we got down of ~3 seconds in real time, half a second in user time and ~0.7 seconds in system time. But what we can see here is the huge difference between user+sys and real, which means that the time is not spent in doing something inside the process, but waiting.
Thus, the real bottleneck here is not in our process, but in the process of the virtual terminal emulator: sending huge quantities of text to the console is going to be slow no matter what optimizations we do in our program; in other words, your task is not CPU-bound, but IO-bound, so CPU-targeted optimizations won't be of much benefit, since at the end you have to wait anyway for your IO device to do his slow stuff.
The real way to speed up such a program would be much simpler: avoid the slow IO device (the console) and just write the data to file (which, by the way, is block-buffered by default).
matteo#teokubuntu:~/cpp/test$ time ./a.out > test
real 0m0.369s
user 0m0.240s
sys 0m0.068s

Since there's absolutely no variation in that loop based on i (the outer loop), you don't need to calculate it each time.
In addition, the printing of the data should be outside the inner loop so as not to impose I/O costs on the calculation.
With those two things in mind, one possibility is:
static int sumCalculated = 0;
if (!sumCalculated) {
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sumCalculated = 1;
}
printf("Sum is now: %d\n",sum);
although that has different output to the original which may be an issue (one line at the end rather than one line per addition).
If you do need to print the accumulating sum within the loop, I'd simply buffer that as well (since it doesn't vary each time through the i loop.
The string Sum is now: 999999999999\n (12 digits, it may vary depending on your int size) takes up 25 bytes (excluding terminating NUL). Multiply that by 9973 and you need a buffer of about 250K (including a terminating NUL). So something like this:
static char buff[250000];
static int sumCalculated = 0;
if (!sumCalculated) {
int offset = 0;
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
offset += sprintf (buff[offset], "Sum is now: %d\n",sum);
}
sumCalculated = 1;
}
printf ("%s", buff);
Now that sort of defeats the whole intent of the outer loop as a benchmark tool but loop-invariant removal is a valid approach to optimisation.

Move the printf outside the for loop.
// Do not alter anything above this line.
//Need to optimize this for loop----------------------------------------
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
printf("Sum is now: %d\n",sum);
// Do not alter anything below this line.
// ---------------------------------------------------------------

Getting the I/O out of the loop is a big help.
Depending on the compiler and machine, you might get a tiny increase in speed by using pointers rather than indexing (though on modern hardware, it generally doesn't make a difference).
Loop unrolling might help to increase the ratio of useful work to loop overhead.
You could use vector instructions (e.g., SIMD) to do a bunch of calculation in parallel.
Are you allowed to pack the array? Can you use an array of a smaller type than int (given that all the values are very small)? Making the array physically shorter improves locality.
Loop unrolling might look something like this:
for (int j = 0; j < ARRAY_SIZE; j += 2) {
sum += array[j] + array[j+1];
}
You'd have to figure out what to do if the array isn't an exact multiple of the unrolling size (which is probably why the assignment uses a prime number).
You would have to experiment to see how much unrolling would be the right amount.

Where is the loop-carried dependency here?

Does anyone see anything obvious about the loop code below that I'm not seeing as to why this cannot be auto-vectorized by VS2012's C++ compiler?
All the compiler gives me is info C5002: loop not vectorized due to reason '1200' when I use the /Qvec-report:2 command-line switch.
Reason 1200 is documented in MSDN as:
Loop contains loop-carried data dependences that prevent
vectorization. Different iterations of the loop interfere with each
other such that vectorizing the loop would produce wrong answers, and
the auto-vectorizer cannot prove to itself that there are no such data
dependences.
I know (or I'm pretty sure that) there aren't any loop-carried data dependencies but I'm not sure what's preventing the compiler from realizing this.
These source and dest pointers do not ever overlap nor alias the same memory and I'm trying to provide the compiler with that hint via __restrict.
pitch is always a positive integer value, something like 4096, depending on the screen resolution, since this is a 8bpp->32bpp rendering/conversion function, operating column-by-column.
byte * __restrict source;
DWORD * __restrict dest;
int pitch;
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
The parens around each source[] are remnants of a function call which I have elided here because the loop still won't be auto-vectorized without the function call, in its most simplest form.
EDIT:
I've simplified the loop to its most trivial form that I can:
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
This still produces the same 1200 reason code.
EDIT (2):
This minimal test case with local allocations and identical pointer types still fails to auto-vectorize. I'm simply baffled at this point.
const byte * __restrict source;
byte * __restrict dest;
source = (const byte * __restrict ) new byte[1600];
dest = (byte * __restrict ) new byte[1600];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}

Let's just say there's more than just a couple of things preventing this loop from vectorizing...
Consider this:
int main(){
byte *source = new byte[1000];
DWORD *dest = new DWORD[1000];
for (int i = 0; i < 200; ++i) {
dest[(i*2*4096)+0] = (source[(i*8)+0]);
}
for (int i = 0; i < 200; ++i) {
dest[i*2*4096] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i*8192] = source[i*8];
}
for (int i = 0; i < 200; ++i) {
dest[i] = source[i];
}
}
Compiler Output:
main.cpp(10) : info C5002: loop not vectorized due to reason '1200'
main.cpp(13) : info C5002: loop not vectorized due to reason '1200'
main.cpp(16) : info C5002: loop not vectorized due to reason '1203'
main.cpp(19) : info C5002: loop not vectorized due to reason '1101'
Let's break this down:
The first two loops are the same. So they give the original reason 1200 which is the loop-carried dependency.
The 3rd loop is the same as the 2nd loop. Yet the compiler gives a different reason 1203:
Loop body includes non-contiguous accesses into an array
Okay... Why a different reason? I dunno. But this time, the reason is correct.
The 4th loop gives 1101:
Loop contains a non-vectorizable conversion operation (may be implicit)
So VC++ doesn't isn't smart enough to issue the SSE4.1 pmovzxbd instruction.
That's a pretty niche case, I wouldn't have expected any modern compiler to be able to do this. And if it could, you'd need to specify SSE4.1.
So the only thing that's out-of-the ordinary is why the initial loop reports a loop-carried dependency.Well, that's a tough call... I would go so far to say that the compiler just isn't issuing the correct reason. (When it really should be non-contiguous access.)
Getting back to the point, I wouldn't expect MSVC or any compiler to be able to vectorize your original loop. Your original loop has accesses grouped in chunks of 4 - which makes it contiguous enough to vectorize. But it's a long-shot to expect the compiler to be able to recognize that.
So if it matters, I suggest manually vectorizing this loop. The intrinsic that you will need is _mm_cvtepu8_epi32().
Your original loop:
for (int i = 0; i < count; ++i) {
dest[(i*2*pitch)+0] = (source[(i*8)+0]);
dest[(i*2*pitch)+1] = (source[(i*8)+1]);
dest[(i*2*pitch)+2] = (source[(i*8)+2]);
dest[(i*2*pitch)+3] = (source[(i*8)+3]);
dest[((i*2+1)*pitch)+0] = (source[(i*8)+4]);
dest[((i*2+1)*pitch)+1] = (source[(i*8)+5]);
dest[((i*2+1)*pitch)+2] = (source[(i*8)+6]);
dest[((i*2+1)*pitch)+3] = (source[(i*8)+7]);
}
vectorizes as follows:
for (int i = 0; i < count; ++i) {
__m128i s0 = _mm_loadl_epi64((__m128i*)(source + i*8));
__m128i s1 = _mm_unpackhi_epi64(s0,s0);
*(__m128i*)(dest + (i*2 + 0)*pitch) = _mm_cvtepu8_epi32(s0);
*(__m128i*)(dest + (i*2 + 1)*pitch) = _mm_cvtepu8_epi32(s1);
}
Disclaimer: This is untested and ignores alignment.

From the MSDN documentation, a case in which an error 1203 would be reported
void code_1203(int *A)
{
// Code 1203 is emitted when non-vectorizable memory references
// are present in the loop body. Vectorization of some non-contiguous
// memory access is supported - for example, the gather/scatter pattern.
for (int i=0; i<1000; ++i)
{
A[i] += A[0] + 1; // constant memory access not vectorized
A[i] += A[i*2+2] + 2; // non-contiguous memory access not vectorized
}
}
It really could be the computations at the indexes that mess with the auto-vectorizer. Funny the error code shown is not 1203, though.
MSDN Parallelizer and Vectorizer Messages

Optimize indexed array summation

I have the following C++ code:
const int N = 1000000
int id[N]; //Value can range from 0 to 9
float value[N];
// load id and value from an external source...
int size[10] = { 0 };
float sum[10] = { 0 };
for (int i = 0; i < N; ++i)
{
++size[id[i]];
sum[id[i]] += value[i];
}
How should I optimize the loop?
I considered using SSE to add every 4 floats to a sum and then after N iterations, the sum is just the sum of the 4 floats in the xmm register but this doesn't work when the source is indexed like this and needs to write out to 10 different arrays.

This kind of loop is very hard to optimize using SIMD instructions. Not only isn't there an easy way in most SIMD instruction sets to do this kind of indexed read ("gather") or write ("scatter"), even if there was, this particular loop still has the problem that you might have two values that map to the same id in one SIMD register, e.g. when
id[0] == 0
id[1] == 1
id[2] == 2
id[3] == 0
in this case, the obvious approach (pseudocode here)
x = gather(size, id[i]);
y = gather(sum, id[i]);
x += 1; // componentwise
y += value[i];
scatter(x, size, id[i]);
scatter(y, sum, id[i]);
won't work either!
You can get by if there's a really small number of possible cases (e.g. assume that sum and size only had 3 elements each) by just doing brute-force compares, but that doesn't really scale.
One way to get this somewhat faster without using SIMD is by breaking up the dependencies between instructions a bit using unrolling:
int size[10] = { 0 }, size2[10] = { 0 };
int sum[10] = { 0 }, sum2[10] = { 0 };
for (int i = 0; i < N/2; i++) {
int id0 = id[i*2+0], id1 = id[i*2+1];
++size[id0];
++size2[id1];
sum[id0] += value[i*2+0];
sum2[id1] += value[i*2+1];
}
// if N was odd, process last element
if (N & 1) {
++size[id[N]];
sum[id[N]] += value[N];
}
// add partial sums together
for (int i = 0; i < 10; i++) {
size[i] += size2[i];
sum[i] += sum2[i];
}
Whether this helps or not depends on the target CPU though.

Well, you are calling id[i] twice in your loop. You could store it in a variable, or a register int if you wanted to.
register int index;
for(int i = 0; i < N; ++i)
{
index = id[i];
++size[index];
sum[index] += value[i];
}
The MSDN docs state this about register:
The register keyword specifies that
the variable is to be stored in a
machine register.. Microsoft Specific
The compiler does not accept user
requests for register variables;
instead, it makes its own register
choices when global
register-allocation optimization (/Oe
option) is on. However, all other
semantics associated with the register
keyword are honored.

Something you can do is to compile it with the -S flag (or equivalent if you aren't using gcc) and compare the various assembly outputs using -O, -O2, and -O3 flags. One common way to optimize a loop is to do some degree of unrolling, for (a very simple, naive) example:
int end = N/2;
int index = 0;
for (int i = 0; i < end; ++i)
{
index = 2 * i;
++size[id[index]];
sum[id[index]] += value[index];
index++;
++size[id[index]];
sum[id[index]] += value[index];
}
which will cut the number of cmp instructions in half. However, any half-decent optimizing compiler will do this for you.

Are you sure it will make much difference? The likelihood is that the loading of "id from an external source" will take significantly longer than adding up the values.
Do not optimise until you KNOW where the bottleneck is.
Edit in answer to the comment: You misunderstand me. If it takes 10 seconds to load the ids from a hard disk then the fractions of a second spent on processing the list are immaterial in the grander scheme of things. Lets say it takes 10 seconds to load and 1 second to process:
You optimise the processing loop so it takes 0 seconds (almost impossible but its to illustrate a point) then it is STILL taking 10 seconds. 11 Seconds really isn't that ba a performance hit and you would be better off focusing your optimisation time on the actual data load as this is far more likely to be the slow part.
In fact it can be quite optimal to do double buffered data loads. ie you load buffer 0, then you start the load of buffer 1. While buffer 1 is loading you process buffer 0. when finished start the load of the next buffer while processing buffer 1 and so on. this way you can completely amortise the cost of procesing.
Further edit: In fact your best optimisation would probably come from loading things into a set of buckets that eliminate the "id[i]" part of te calculation. You could then simply offload to 3 threads where each uses SSE adds. This way you could have them all going simultaneously and, provided you have at least a triple core machine, process the whole data in a 10th of the time. Organising data for optimal processing will always allow for the best optimisation, IMO.

Depending on your target machine and compiler, see if you have the _mm_prefetch intrinsic and give it a shot. Back in the Pentium D days, pre-fetching data using the asm instruction for that intrinsic was a real speed win as long as you were pre-fetching a few loop iterations before you needed the data.
See here (Page 95 in the PDF) for more info from Intel.

This computation is trivially parallelizable; just add
#pragma omp parallel_for reduction(+:size,+:sum) schedule(static)
immediately above the loop if you have OpenMP support (-fopenmp in GCC.) However, I would not expect much speedup on a typical multicore desktop machine; you're doing so little computation per item fetched that you're almost certainly going to be constrained by memory bandwidth.
If you need to perform the summation several times for a given id mapping (i.e. the value[] array changes more often than id[]), you can halve your memory bandwidth requirements by pre-sorting the value[] elements into id order and eliminating the per-element fetch from id[]:
for (i = 0, j = 0, k = 0; j < 10; sum[j] += tmp, j++)
for (k += size[j], tmp = 0; i < k; i++)
tmp += value[i];

Which one will be faster

Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.

There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).

Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.

In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.

The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.

For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);

The first one will be faster because you loop from 1 to 10000 only one time.

C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).

If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.

The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.

If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js