Why is my Threaded Thrift calls slow? - c++

My thrift definition is something like this:
list<i32> getValues()
Implemented it in C++.
Server.cpp has the following piece of code:
.....
std::vector<int32_t> store;
TransferServiceHandler() {
for(int i=0;i<100000000;i++)
store.push_back(i);
}
void getValues(std::vector<int32_t> & _return) {
// Your implementation goes here
_return = store;
}
.....
Client.cpp has a simple loop in which it calls the getValues():
for(int k=0;k<10;k++){
clock_gettime(CLOCK_REALTIME, &ds_spec);
int64_t dstarted = ds_spec.tv_sec * 1000 + (ds_spec.tv_nsec / 1.0e6);
std::vector<int32_t> values;
client.getValues(values);
clock_gettime(CLOCK_REALTIME, &de_spec);
int64_t dended = de_spec.tv_sec * 1000 + (de_spec.tv_nsec / 1.0e6);
std::cout << "Values size :" << values.size() << " in " << (dended - dstarted) << " ms\n";
}
Connections are initialized and closed outside the loop.
Usually few hundred thousand entries are returned by this call.
When there is no data (when the lists are empty) i can see the call happening in 1ms-2ms, when i vary data there's a unpredictable delay in the transfer. Both the client and server are running on the same machine (equipped with 10Gb/s Ethernet, 8 cores and 30 GB of memory).
How do you normally debug a situation like this? I don't think the issue is with the network since its a 10 Gigs machine and size of the data is hardly few MBs.
I ran a benchmark with various data size and you can see the delay isn't stable for each call.

I am not sure I fully understand the interaction between client and server, however your getValue method could be improved by using move semantics (C++11), therefore you could move the store vector rather then making a copy (memory operations are quite expensive) as follows:
void getValues(std::vector<int32_t> & _return) {
// Your implementation goes here
_return = std::move(store);
}
Note that this works fine as long as content of store (which now has been moved into _return) does not need to persist beyond the call to getValue.

It looks to me like you are loosing resolution here:
clock_gettime(CLOCK_REALTIME, &ds_spec);
int64_t dstarted = ds_spec.tv_sec * 1000 + (ds_spec.tv_nsec / 1.0e6);
Which goes against the reason for using clock_gettime() to begin with;
Here is a link on how to profile code using clock_gettime(); hopefully it will solve your problem.
I'm pointing to resolution because this can be a good cause for unexpected profiling results.

I made significant improvement in the performance after transferring the data as binary rather than vector.
On the thrift definition file, changed the list to binary.
Here's the new benchmark on the same amount of data:

Related

Using std::async slower than non-async method to populate a vector

I am experimenting with std::async to populate a vector. The idea behind it is to use multi-threading to save time. However, running some benchmark tests I find that my non-async method is faster!
#include <algorithm>
#include <vector>
#include <future>
std::vector<int> Generate(int i)
{
std::vector<int> v;
for (int j = i; j < i + 10; ++j)
{
v.push_back(j);
}
return v;
}
Async:
std::vector<std::future<std::vector<int>>> futures;
for (int i = 0; i < 200; i+=10)
{
futures.push_back(std::async(
[](int i) { return Generate(i); }, i));
}
std::vector<int> res;
for (auto &&f : futures)
{
auto vec = f.get();
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
Non-async:
std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
auto vec = Generate(i);
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
My benchmark test shows that the async method is 71 times slower than non-async. What am I doing wrong?
std::async has two modes of operation:
std::launch::async
std::launch::deferred
In this case, you've called std::async without specifying either one, which means it's allowed to choose either one. std::launch::deferred basically means do the work on the calling thread. So std::async returns a future, and with std::launch::deferred, the action you've requested won't be carried out until you call .get on that future. It can be kind of handy under a few circumstances, but it's probably not what you want here.
Even if you specify std::launch::async, you need to realize that this starts up a new thread of execution to carry out the action you've requested. It then has to create a future, and use some sort of signalling from the thread to the future to let you know when the computation you've requested is done.
All of that adds a fair amount of overhead--anywhere from microseconds to milliseconds or so, depending on the OS, CPU, etc.
So, for asynchronous execution to make sense, the "stuff" you do asynchronously typically needs to take tens of milliseconds at the very least (and hundreds of milliseconds might be a more sensible lower threshold). I wouldn't get too wrapped up in the exact cutoff, but it needs to be something that takes a while.
So, filling an array asynchronously probably only makes sense if the array is quite a lot larger than you're dealing with here.
For filling memory, you'll quickly run into another problem though: most CPUs are enough faster than main memory that if all you're doing is writing to memory, there's a pretty good chance that a single thread will already saturate the path to memory, so even at best doing the job asynchronously will only gain a little, and may still pretty easily cause a slow-down.
The ideal case for asynchronous operation would be something like one thread that's heavily memory bound, but another that (for example) reads a little bit of data, and does a lot of computation on that small amount of data. In this case, the computation thread will mostly operate on its data in the cache, so it won't get in the way of the memory thread doing its thing.
There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.
Your array sizes are too small
Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.
You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:
All your code is bottlenecked by an inherently serial process
Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.
So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.
Fixing the code
Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.
But here's what I came up with:
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>
//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;
std::vector<int> Generate(int i) {
std::vector<int> v;
for (int j = i; j < i + BLOCK_SIZE; ++j) {
v.push_back(j);
}
return v;
}
void asynchronous_attempt() {
std::vector<std::future<void>> futures;
//#2: Preallocated Vector
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
{
futures.push_back(std::async(
[it](int i) {
auto vec = Generate(i);
//#3 Copying done multithreaded
std::copy(vec.begin(), vec.end(), it + i);
}, i));
}
for (auto &&f : futures) {
f.get();
}
}
void serial_attempt() {
//#4 Changes here to show fair comparison
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
auto vec = Generate(i);
it = std::copy(vec.begin(), vec.end(), it);
}
}
int main() {
using clock = std::chrono::steady_clock;
std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
auto begin = clock::now();
asynchronous_attempt();
auto end = clock::now();
std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
begin = clock::now();
serial_attempt();
end = clock::now();
std::cout << "Duration of Serial Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}
This resulted in the following output:
Theoretical # of Threads: 2
Duration of Multithreaded Attempt: 361149213ns
Duration of Serial Attempt: 364785676ns
Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.
Below are the changes I made, that are ID'd in the code:
We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
The vector has its size pre-allocated. No more frequent resizing.
Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
We have to change the serial code so it also preallocates + copies so that it's a fair comparison.
Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.
First of all, you are not forcing the std::async to work asynchronously (you would need to specify std::launch::async policy to do so). Second of all, it'd be kind of an overkill to asynchronously create an std::vector of 10 ints. It's just not worth it. Remember - using more threads does not mean that you will see performance benefit! Creating a thread (or even using a threadpool) introduces some overhead, which, in this case, seems to dwarf the benefits of running tasks asynchronously.
Thanks #NathanOliver ;>

c++ pass value by reference vs by copy of POD

I heard that passing variable by reference is not always faster than passing by value. Passing by reference is faster for big variables but for small one this problem could be tricky.
Passing by value requires time for copy creation but taking value of local variable should be faster.
Passing by reference do not waste time for creating variable copy but there is look at pointer and then on required data.
I am aware that this detail is not so important in optimization problem however it was interesting for me to measure it (i know that -O0 is passe for optimization but this code is to simple, after optimization i was not sure what i was measuring)
g++ -std=c++14 -O0 -g3 -DSIZE_OF_DATA_ARRAY=16 main.cpp && ./a.out
g++ (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406
SIZE_OF_DATA_ARRAY | copy time[s] | reference time [s]
4 |0.04 |0.045
8 |0.04 |0.46
16|0.04 |0.05
17|0.07 |0.05
24|0.07 |0.05
My questions:
Why time of execution is quite constant for copying vs struct size?
Why there is threshold between 16 and 17 on copying?
Guess: it is connected with cache
My code:
#include <iostream>
#include <vector>
#include <limits>
#include <chrono>
#include <iomanip>
#include <vector>
#include <algorithm>
struct Data {
double x[SIZE_OF_DATA_ARRAY];
};
double workOnData(Data &data) {
for (auto i = 0; i < 10; ++i) {
data.x[0] -= 0.5 * (data.x[0] - 1);
}
return data.x[0];
}
void runTestSuite() {
auto queries = 1000000;
Data data;
for (auto i = 0; i < queries; ++i) {
data.x[0] = i;
auto val = workOnData(data);
if (val == -357)
data.x[0] = 1;
}
}
int main() {
std::cout << "sizeof(Data) = " << sizeof(Data) << "\n";
size_t numberOfTests = 99;
std::vector<std::chrono::duration<double>> timeMeasurements{numberOfTests};
std::chrono::time_point<std::chrono::system_clock> startTime, endTime;
for (auto i = 0; i < numberOfTests; ++i) {
startTime = std::chrono::system_clock::now();
runTestSuite();
endTime = std::chrono::system_clock::now();
timeMeasurements[i] = endTime - startTime;
}
std::sort(timeMeasurements.begin(), timeMeasurements.end());
std::chrono::system_clock::time_point now = std::chrono::system_clock::now();
std::time_t now_c = std::chrono::system_clock::to_time_t(now);
std::cout << std::put_time(std::localtime(&now_c), "%F %T")
<< ": median time = " << timeMeasurements[numberOfTests * 0.5].count() << "s\n";
return 0;
}
Why time of execution is quite constant for copying vs struct size?
The best understanding is from viewing the assembly language to see the instructions that the compiler emitted. Optimizations here would depend on the optimization setting of the compiler and whether you are in release or debug configuration.
Also depends on the processor. For example, some processors may have specialized instructions for copying large blocks of memory. Other processors may copy data in parallel chunks, depending on the size of the structure. Also, some platforms may have hardware assistance, such as a DMA controller.
Alas, sometimes unrolling may be faster than using special instructions or hardware assistance (depends on the data size).
Why there is threshold between 16 and 17 on copying?
The threshold may be between alignment boundaries and non-alignment.
Let's take a 32-bit processor. It likes to access (fetch) 4 bytes at a time. Accessing 24 bytes requires 6 fetches. Access 16 bytes takes 4 fetches.
However, accessing 17, 18, or 19 bytes requires 5 fetches. It may fetch another 4 bytes to get those remainder bytes.
Another scenario is the implementation of the copy function. Some copy functions may use 32 bit copies for the first set of 4 byte quantities, then switch to byte comparing for the remainders. It could switch to byte copying for all the bytes depending on the size of the data. Many possibilities.
The truth for your system lies in debugging the assembly language or function for copying the data.
Cache Hits & Misses
Your performance metrics may be skewed by processor cache operations. If the processor has your data in cache, the loop will be a lot faster. Usually there will be a performance hit for the first access of the data. There may be more time wasted reloading the data cache if your data is too big for the cache or lies outside the domain of the data cache size.
Instruction Cache Issues
A lot of processors have large pipelines and caches for data instructions. When they encounter a branch (such as the end of a for loop), the processor may have to reset the instruction cache and reload from another location in the program. This takes time. You can demonstrate by unrolling the loop in different size chunks and measuring the performance.

c++ stack efficient for multicore application

I am trying to code a multicode Markov Chain in C++ and while I am trying to take advantage of the many CPUs (up to 24) to run a different chain in each one, I have a problem in picking a right container to gather the result the numerical evaluations on each CPU. What I am trying to measure is basically the average value of an array of boolean variables. I have tried coding a wrapper around a `std::vector`` object looking like that:
struct densityStack {
vector<int> density; //will store the sum of boolean varaibles
int card; //will store the amount of elements we summed over for normalizing at the end
densityStack(int size){ //constructor taking as only parameter the size of the array, usually size = 30
density = vector<int> (size, 0);
card = 0;
}
void push_back(vector<int> & toBeAdded){ //method summing a new array (of measurements) to our stack
for(auto valStack = density.begin(), newVal = toBeAdded.begin(); valStack != density.end(); ++valStack, ++ newVal)
*valStack += *newVal;
card++;
}
void savef(const char * fname){ //method outputting into a file
ofstream out(fname);
out.precision(10);
out << card << "\n"; //saving the cardinal in first line
for(auto val = density.begin(); val != density.end(); ++val)
out << << (double) *val/card << "\n";
out.close();
}
};
Then, in my code I use a single densityStack object and every time a CPU core has data (can be 100 times per second) it will call push_back to send the data back to densityStack.
My issue is that this seems to be slower that the first raw approach where each core stored each array of measurement in file and then I was using some Python script to average and clean (I was unhappy with it because storing too much information and inducing too much useless stress on the hard drives).
Do you see where I can be losing a lot of performance? I mean is there a source of obvious overheading? Because for me, copying back the vector even at frequencies of 1000Hz should not be too much.
How are you synchronizing your shared densityStack instance?
From the limited info here my guess is that the CPUs are blocked waiting to write data every time they have a tiny chunk of data. If that is the issue, a simple technique to improve performance would be to reduce the number of writes. Keep a buffer of data for each CPU and write to the densityStack less frequently.

A code works slowly because of compiler optimizations

I have 3 separate global functions and I want to test it's speed. I'm using this code:
// case 1
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func1(); // "normal" c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 2
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func2(); // multithreaded c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
// case 3
{
chrono::duration<double, milli> totalTime;
for (uint32_t i{ 0 }; i < REPEATS; ++i)
{
auto start = chrono::steady_clock::now();
func3(); // SIMD c++ code
auto end = chrono::steady_clock::now();
auto diff = end - start;
cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
}
}
This func1(), func2(), func3() are global functions that doesn't change the state of the program (I don't have any global variables).
The outputed result depends on the running cases. If I run case 1 and case 2 I have 100ms and 10ms respectively. If I run case 1 and case 3 I have 100ms and 130ms. If I run cases 1, 2, 3 I have 130ms, 10ms, 120 ms. The first case became slower on 30% and the third one became faster! If I run cases separatelly I have 100ms, 10ms, 130ms. I tried to turn optimisation off - the code became (surprise, surprise) much slower but at least the results are the same not depending on cases order. So I came to a conclusion that compiler do something special. Is it true?
I'm using Win7 and VS 2013.
A few things can happen:
Your test get's pre-empted by the kernel. Not much you can do to prevent that. Run your tests multiple times to make sure the results are consistent.
The compiler can inline your functions and optimize on the inlined code.
There is interaction between the functions. The simplest thing I would guess is memory allocation (i.e. func1() is optimized to request a memory block sufficient for the entire program, or one of the function brings some memory blocks into cache).
So suggestions:
First run each function before the benchmark to get rid of some of
the memory artifacts.
Run the benchmark a few times and see how
the values fluctuate to eliminate some of the OS artifacts.
Shuffle the order of the functions on each run, or take it as a
parameter, to make sure you eliminate ordering artifacts.
Don't do cout in your loop because that interacts with the OS and can mess up your cache or even get your process preempted. Write the results to some vector and output everything at the end.
There can be other things affecting your performance (i.e. Disk, networking, other processes, memory and cpu load), so take the values with a grain of salt.
Compiler optimizations vary. There are numerous things that compiler can do - one of optimizations(at least in GNU GCC) is aggresive loop unrolling - this might create faster code, but you must be aware that this can cause cache misses, effectively slowing down your code. That is, if we take just the compiler optimizations into consideration.
Now you have three different cases that if run separately give different output. This might be affected by alignment issue - if your code is properly aligned it will be faster, and if it isn't, additional padding might slow it down - I've seen similar thing happening in C#, but I can't find this thread now.
And last thing that could happen is you run too few tests to be sure - 10k tests is decent set, and you can start comparing speed output. One-time tests can be affected by OS, so keep that in mind.
Oh, and because Microsoft is brilliant at writing compilers, there are bugs in certain versions. I don't think the world of Microsoft's C++ compiler - there are many hacks and workarounds, it's not as up-to-date as other popular compilers - but that's simply my opinion. So another option is that compiler is simply malfunctioning. Also, see this and this beautiful typedef.

How do you calculate memory access time?

I create a large boolean 2d array (5000X5000 for a total of 25 billion elements at 23MB). Then I loop through and instantiate every element with a random true or false. Then I loop through and read every single element. All 25 million elements are read in ~100ms.
23MB is too big to fit in the CPU's cache and I think my program is too simple to benefit from any type of compiler optimization so am I right to conclude that the program is reading 25 million elements from RAM in ~100ms?
#include "stdafx.h"
#include <iostream>
#include <chrono>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
bool **locs;
locs = new bool*[5000];
for(int i = 0; i < 5000; i++)
locs[i] = new bool[5000];
for(int i = 0; i < 5000; i++)
for(int i2 = 0; i2 < 5000; i2++)
locs[i][i2] = rand() % 2 == 0 ? true : false;
int *idx = new int [5000*5000];
for(int i = 0; i < 5000*5000; i++)
*(idx + i) = rand() % 4999;
bool val;
int memAccesses = 0;
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 5000*5000; i++) {
val = locs[*(idx + i)][*(idx + ++i)];
memAccesses += 2;
}
auto finish = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(finish-start).count() << " ns\n";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(finish-start).count() << " ms\n";
cout << "TOTAL MEMORY ACCESSES: " << memAccesses << endl;
cout << "The size of the array in memory is " << ((sizeof(bool)*5000*5000)/1048576) << "MB";
int exit; cin >> exit;
return 0;
}
/*
OUTPUT IS:
137013700 ns
137 ms
TOTAL MEMORY ACCESSES: 25000000
The size of the array in memory is 23MB
*/
As other answers have mentioned, the "speed" you are seeing (even if the CPU is executing your code and it is not stripped by the compiler) is about 250 MBps, which is very very low number for modern systems.
However, your methodology seems flawed to me (admittedly, I'm not an expert in benchmarking.) And here are the problems I see:
For any benchmark such as this, even in the simplest form, you need to distinguish random-access from sequential-access. Memory is not a random-access device (despite its name) and performs very poorly here. Your code seems to be accessing memory randomly, so you add that to your conclusion as a qualifier: that you are "reading 25 million elements from random locations from RAM in ~100ms."
Another aspect of this sort of benchmarks is the concept of latency vs. throughput. Again, if you want to conclude anything from your numbers and timings, you need to be aware what are you measuring exactly.
You are counting memory accesses incorrectly. Depending of the exact code your compiler is generating, this line:
val = locs[*(idx + i)][*(idx + ++i)];
might realistically access the memory system anywhere between 4 to 9 times.
At best, if i, idx, locs and val are all either in registers or access to them is eliminated, then you need to read *(idx + i), read locs[*(idx + i)] (remember that locs is an array of pointers to arrays, not a 2D array,) read *(idx + ++i), and finally read locs[*(idx + i)][*(idx + ++i)]. A few of these might be cached, but it's unlikely, with the cache-thrashing that's going on.
At worst, in addition to the above, you need two accesses for ++i (read, then write back,) one for idx, one for locs and one for val. I don't know, you might even need another read for the single i and/or two reads for the two idx occurrences (due to pointer aliasing and whatnot.)
You need to be aware that memory is never accessed in single bytes or even words. Memory is always read and written in units of cache-line. And cache line size can be different from system to system, although the most common size these days is 64 bytes. So, each time you read a memory location that is not in the cache, you are loading 64-bytes (or more) from RAM. If the memory locations you are reading are at the cache line boundary (some of the bytes in one cache line and some in the next) then you are loading two cache lines from RAM. Given a sane compiler and properly aligned variables in memory, this doesn't happen very often, but it might. So you have to at least multiply your calculated bandwidth used by the size of your cache line.
However, if you are accessing a memory location that is already in cache, then you don't load anything from RAM. You need to consider this in your conclusions too.
You also need to consider cache line eviction, your cache's associativity, number of levels, the fact that some cache levels are shared between instructions and data and some aren't, some are shared between cores and some aren't, and a lot of other things when evaluating the performance of caches and memory.
The DRAM chips also have a lot of weird and complex behaviors and characteristics. Some memory locations are faster to read after some others (due to the arrangements of rows and columns,) some accesses might get delayed a long time (at CPU speeds) because of the refresh cycle, other devices might be using the RAM or the bus that RAM is on, etc., etc. I'm far from familiar with the operations of modern memory chips, and even I know that it's a complete mess.
You have to consider the effects of compiler optimization on your code. This means that you have to look t your code after the compiler is done with it, in assembly form. You need to look at the generated assembly to be able to know what your code is actually doing: whether and which of your memory accesses are optimized out.
All in all, I don't think that you can conclude much useful information from your program. Sorry about that, but memory is very complex!
Portions (blocks) of memory will be stored in the processor cache at a time, which allows the processor to quickly access those items. However, that speed is perfectly reasonable for modern memory. Even the slowest DDR3 ram can transfer data at about 6 GB/s.
Cache usage is independent from program's complexity. Whenever data is read from RAM it goes into cache. Since cache has a certain size, there's always that amount of data available. If you access a memory location next to the previous, there is a good chance it will be cached already. In such case RAM is not accessed.
I would suggest reading CPU cache wikipedia entry to broaden your knowledge.
BTW: val = locs[*(idx + i)][*(idx + ++i)]; are you certain that this is evaluated from left to right? I am not. This is an undefined behavior. I'd suggest putting the ++i below the accessor line.
//EDIT:
There is nothing done with the value read from memory. It is quite possible that these instructions are not executed at all! Check the bytecode or add a (void) val; instruction which should force it to be generated.
No. The reads won't always go all the way down to the RAM. Blocks of memory get pulled into the cache when a read (or write) is performed. As long as the block from which you are reading is already in the cache, the cache is used. If you request data from a block that is not in the cache, then the RAM is accessed to fetch the block of memory and place it in the cache. Reading from the cache is significantly cheaper than reading from RAM.
EDIT
Again, write oprerations cause blocks from memory to get pulled into the cache. Because you are storing the values in your program before reading them, the data you are reading is most likely already in the cache from when you stored it. Therefore, it is likely that your loop that reads the values never needs to access RAM.