OpenMP first kernel much slower than the second kernel

OpenMP first kernel much slower than the second kernel - c++

I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!

Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless

Related

Reducing memory footprint of c++ program utilising large vectors

In scaling up the problem size I'm handing to a self-coded program I started to bump into Linux's OOM killer. Both Valgrind (when ran on CPU) and cuda-memcheck (when ran on GPU) do not report any memory leaks. The memory usage keeps expanding while iterating through the inner loop, while I explicitly clear the vectors holding the biggest chunk of data at the end of the this loop. How can I ensure this memory hogging will disappear?
Checks for memory leaks were performed, all the memory leaks are fixed. Despite this, Out of Memory errors keep killing the program (via the OOM Killer). Manual monitoring of memory consumption shows an increase in memory utilisation, even after explicitly clearing the vectors containing the data.
Key to know is having three nested loops, one outer containing the sub-problems at hand. The middle loop loops over the Monte Carlo trials, with an inner loop running some sequential process required inside the trial. Pseudo-code looks as follows:
std::vector<object*> sub_problems;
sub_problems.push_back(retrieved_subproblem_from_database);
for(int sub_problem_index = 0; sub_problem_index < sub_problems.size(); ++sub_problem_index){
std::vector< std::vector<float> > mc_results(100000, std::vector<float>(5, 0.0));
for(int mc_trial = 0; mc_trial < 100000; ++mc_trial){
for(int sequential_process_index = 0; sequential_process_index < 5; ++sequential_process_index){
mc_results[mc_trial][sequential_process_index] = specific_result;
}
}
sub_problems[sub_problem_index]->storeResultsInObject(mc_results);
// Do some other things
sub_problems[sub_problem_index]->deleteMCResults();
}
deleteMCResults looks as follows:
bool deleteMCResults() {
for (int i = 0; i < asset_values.size(); ++i){
object_mc_results[i].clear();
object_mc_results[i].shrink_to_fit();
}
object_mc_results.clear();
object_mc_results.shrink_to_fit();
return true;
}
How can I ensure memory consumption to be solely dependent on the middle and inner loop instead of the outer loop? The second, and third and fourth and so, could theoretically use exactly the same memory space/addresses as utilised for the first iteration.

Perhaps I'm reading your pseudocode too literally, but it looks like you have two mc_results variables, one declared inside the for loop and one that deleteMCResults is accessing.
In any case, I have two suggestions for how to debug this. First, rather than letting the OOM killer strike, which takes a long time, is unpredictable, and might kill something important, use ulimit -v to put a limit on process size. Set it to something reasonable like, say, 1000000 (about 1GB) and work on keeping your process under that.
Second, start deleting or commenting out everything except the parts of the program that allocate and deallocate memory. Either you will find your culprit or you will make a program small enough to post in its entirety.

deleteMCResults() can be written a lot simpler.
void deleteMCResults() {
decltype(object_mc_results) empty;
std::swap(object_mc_results, empty);
}
But in this case, I'm wondering if you really want to release the memory. As you say, the iterations could reuse the same memory, so perhaps you should replace deleteMCResults() with returnMCResultsMemory(). Then hoist the declaration of mc_results out of the loop, and just reset its values to 5.0 after returnMCResultsMemory() returns.

There is one thing that could easily be improved from the code you show. However, it is really not enough and not precise enough info to make a full analysis. Extracting a relevant example ([mcve]) and perhaps asking for a review on codereview.stackexchange.com might improve the outcome.
The simple thing that could be done is to replace the inner vector of five floats with an array of five floats. Each vector consists (in typical implementations) of three pointers, to the beginnig and end of the allocated memory and another one to mark the used amount. The actual storage requires a separate allocation, which in turn incurs some overhead (and also performance overhead when accessing the data, keyword "locality of reference"). These three pointers require 24 octets on a common 64-bit machine. Compare that with five floats, those only require 20 octets. Even if those floats were padded to 24 octets, you would still benefit from eliding the separate allocation.
In order to try this out, just replace the inner vector with a std::array (https://en.cppreference.com/w/cpp/container/array). Odds are that you won't have to change much code, raw arrays, std::array and std::vector have very similar interfaces.

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same

This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);

Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).

Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

What is the theoretical impact of direct index access with "high" memory usage vs. "shifted" index access with "low" memory usage?

Well I am really curious as to what practice is better to keep, I know it (probably?) does not make any performance difference at all (even in performance critical applications?) but I am more curious about the impact on the generated code with optimization in mind (and for the sake of completeness, also "performance", if it makes any difference).
So the problem is as following:
element indexes range from A to B where A > 0 and B > A (eg, A = 1000 and B = 2000).
To store information about each element there are a few possible solutions, two of those which use plain arrays include direct index access and access by manipulating the index:
example 1
//declare the array with less memory, "just" 1000 elements, all elements used
std::array<T, B-A> Foo;
//but make accessing by index slower?
//accessing index N where B > N >= A
Foo[N-A];
example 2
//or declare the array with more memory, 2000 elements, 50% elements not used, not very "efficient" for memory
std::array<T, B> Foo;
//but make accessing by index faster?
//accessing index N where B > N >= A
Foo[N];
I'd personally go for #2 because I really like performance, but I think in reality:
the compiler will take care of both situations?
What is the impact on optimizations?
What about performance?
does it matter at all?
Or is this just the next "micro optimization" thing that no human being should worry about?
Is there some Tradeoff ratio between memory usage : speed which is recommended?

Accessing any array with an index involves adding an index multiplied by element size and adding it to the base-address of the array itself.
Since we are already adding one number to another, making the adjustment for foo[N-A] could easily be done by adjusting the base-address down by N * sizeof(T) before adding A * sizeof(T), rather than actually calculating (A-N)*sizeof(T).
In other words, any decent compiler should comletely hide this subtraction, assuming it is a constant value.
If it's not a constant [say you are using std::vector instread of std::array, then you will indeed subtract A from N at some point in the code. It is still pretty cheap to do this. Most modern processors can do this in one cycle with no latency for the result, so at worst adds a single clock-cycle to the access.
Of course, if the numbers are 1000-2000, probably makes really little difference in the whole scheme of things - either the total time to process that is nearly nothing, or it's a lot becuase you do complicated stuff. But if you were to make it a million elements, offset by half a million, it may make the difference between a simple or complex method of allocating them, or some such.
Also, as Hans Passant implies: Modern OS's with virutal memory handling, memory that isn't actually used doesn't get populated with "real memory". At work I was investigating a strange crash on a board that has 2GB of RAM, and when viewing the memory usage, it showed that this one applciation had allocated 3GB of virtual memory. This board does not have a swap-disk (it's an embedded system). It turns out that some code was simply allocating large chunks of memory that wasn't filled with anything, and it only stopped working when it reached 3GB (32-bit processor, 3+1GB memory split between user/kernel space). So even for LARGE lumps of memory, if you only have half of it, it won't actually take up any RAM, if you do not actually access it.
As ALWAYS when it comes to performance, compilers and such, if it's important, do not trust "the internet" to tell you the answer. Set up a test with the code you actually intend to use, using the actual compiler(s) and processor type(s) that you plan to produce your code with/for, and run benchmarks. Some compiler may well have a misfeature (on processor type XYZ9278) that makes it produce horrible code for a case that most other compilers do this "with no overhead at all".

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?

One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.

I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

why the memory access is so slow?

I've got very strange performance issue, related to access to memory.
The code snippet is:
#include <vector>
using namespace std;
vector<int> arrx(M,-1);
vector< vector<int> > arr(N,arrx);
...
for(i=0;i<N;i++){
for(j=0;j<M;j++){
//>>>>>>>>>>>> Part 1 <<<<<<<<<<<<<<
// Simple arithmetic operations
int n1 = 1 + 2; // does not matter what (actually more complicated)
// Integer assignment, without access to array
int n2 = n1;
//>>>>>>>>>>>> Part 2 <<<<<<<<<<<<<<
// This turns out to be most expensive part
arr[i][j] = n1;
}
}
N and M - are some constants of the order of 1000 - 10000 or so.
When I compile this code (release version) it takes approximately 15 clocks to finish it if the Part 2 is commented. With this part the execution time goes up to 100+ clocks, so almost 10 times slower. I expected the assignment operation to be much cheaper than even simple arithmetic operations. This is actually true if we do not use arrays. But with that array the assignment seems to be much more expensive.
I also tried 1-D array instead of 2-D - the same result (for 2D is obviously slower).
I also used int** or int* instead of vector< vector< int > > or vector< int > - again the result is the same.
Why do I get such poor performance in array assignment and can I fix it?
Edit:
One more observation: in the part 2 of the given code if we change the assignment from
arr[i][j] = n1; // 172 clocks
to
n1 = arr[i][j]; // 16 clocks
the speed (numbers in comments) goes up. More interestingly, if we change the line:
arr[i][j] = n1; // 172 clocks
to
arr[i][j] = arr[i][j] * arr[i][j]; // 110 clocks
speed is also higher than for simple assignment
Is there any difference in reading and writing from/to memory? Why do I get such strange performance?
Thanks in advance!

Your assumptions are really wrong...
Writing to main memory is much slower than doing a simple addition.
If you don't do the write, it is
likely that the loop will be optimized away entirely.

Unless your actual "part 1" is significantly more complex than your example would lead us to believe, there's no surprise here -- memory access is slow compared to basic arithmetic. Additionally, if you're compiling with optimizations, most or all of "part 1" may be getting optimized away because the results are never used.

You have discovered a sad truth about modern computers: memory access is really slow compared to arithmetic. There's nothing you can do about it. It is because electric fields in a copper wire only travel at about two-thirds of the speed of light.

Your nested arrays are going to be somewhere in the region of 50–500MB of memory. It takes time to write that much memory, and no amount of clever hardware memory caching is going to help that much. Moreover, even a single memory write will take time as it has to make its way over some copper wires on a circuit board to a lump of silicon some distance away; physics wins.
But if you want to dig in more, try the cachegrind tool (assuming it's present on your platform). Just be aware that the code you're using above doesn't really permit a lot of optimization; its data access pattern doesn't have enormous amounts of reuse potential.

Let's make a simple estimate. A typical CPU clock speed nowadays is about 1--2 GHz (giga=10 to power nine). Simplifying a (really) great deal, it means that a single processor operation takes about 1ns (nano=10 to power negative nine). A simple arithmetic like int addition takes several CPU cycles, of order ten.
Now, memory: typical memory access time is about 50ns (again, it's not necessary just now to go into gory details, which are aplenty).
You see that even in the very best case scenario, the memory is slower than CPU by a factor of about 5 to 10.
In fact, I'm sweeping under the rug an enormous amount of detail, but I hope the basic idea is clear. If you're interested, there are books around (keywords: cache, cache misses, data locality etc). This one is dated, but still very good at the explaining general concepts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js