Performance decreases with a higher number of threads (no synchronization) - c++

I have a data structure (a vector) which elements have to be parsed by a function, where elements can be parsed by different threads.
Following is the parsing method:
void ConsumerPool::parse(size_t n_threads, size_t id)
{
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
//parse(nodes[idx]);
parse(idx);
}
}
Where:
n_threads is the total number of threads
id is the (univocal) index of the current thread
and the threads are created as follow:
std::vector<std::thread> threads;
for (size_t i = 0; i < n_threads; i++)
threads.emplace_back(&ConsumerPool::parse, this, n_threads, i);
Unfortunately, even if this method works, the performance of my application decreases if the number of threads is too high. I would like to understand why the performance decreases even if there's no synchronization between these threads.
Following are the elapsed times (between the threads start and the last join() return) according to the number of threads used:
2 threads: 500 ms
3 threads: 385 ms
4 threads: 360 ms
5 threads: 475 ms
6 threads: 580 ms
7 threads: 635 ms
8 threads: 660 ms
The time necessary for the threads creation is always between 1/2 ms.
The software has been tested by using its release build. Following is my configuration:
2x Intel(R) Xeon(R) CPU E5507 # 2.27GHz
Maximum speed: 2.26 GHz
Sockets: 2
Cores: 8
Logical processors: 8
Virtualization: Enabled
L1 cache: 512 KB
L2 cache: 2.0 MB
L3 cache: 8.0 MB
EDIT:
What the parse() function does is the following:
// data shared between threads (around 300k elements)
std::vector<std::unique_ptr<Foo>> vfoo;
std::vector<rapidxml::xml_node<>*> nodes;
std::vector<std::string> layers;
void parse(int idx)
{
auto p = vfoo[idx];
// p->parse() allocate memory according to the content of the XML node
if (!p->parse(nodes[idx], layers))
vfoo[idx].reset();
}

The processor you are using Intel(R) Xeon(R) CPU E5507 has only 4 cores (see http://ark.intel.com/products/37100/Intel-Xeon-Processor-E5507-4M-Cache-2_26-GHz-4_80-GTs-Intel-QPI).
So having more threads than 4 would cause the slow down because of the context switching as is visible from the data you have provided.
You can read more about the context switching at the following link: https://en.wikipedia.org/wiki/Context_switch

update:
We still don't have a lot of info about the memory access patterns of parse(), and how much time it spends reading input data from memory vs. how much time spent writing/reading private scratch memory.
You say p->parse() "allocates memory according to the content of the XML node". If it frees it again, you may see a big speedup from keeping a big-enough scratch buffer allocated in each thread. Memory allocation/deallocation is a "global" thing that requires synchronization between threads. A thread-aware allocator can hopefully handle an allocate/free / allocate/free pattern by satisfying allocations from memory just freed by that thread, so it's probably still hot in private L1 or L2 cache on that core.
Use some kind of profiling to find the real hotspots. It might be memory allocation/deallocation, or it might be code that reads some memory.
Your dual-socket Nehalem Xeon doesn't have hyperthreading, so you can't be running into issues with threads slowing each other down if a non-HT-aware OS schedules two on two logical cores of the same physical core.
You should investigate with performance counters (e.g. Linux perf stat, or Intel's VTune) whether you're getting more cache misses per thread once you pass 4 threads. Nehalem uses large shared (for the whole socket) L3 (aka last-level) caches, so more threads running on the same socket creates more pressure on that. The relevant perf events will be something like LLC_something, IIRC.
You should definitely look at L1/L2 misses, and see how those scale with number of threads, and how that changes with strided vs. contiguous access to node[].
There are other perf counters you can check to look for false sharing (one thread's private variable sharing a cache line with another thread's private variable, so the cache line bounces between cores). Really just look for any perf events that change with number of threads; that could point the way towards an explanation.
A multi-socket system like your 2-socket Nehalem will have NUMA (Non-uniform_memory_access). A NUMA-aware OS will try to allocate memory that's fast for the core doing the allocation.
So presumably your buffer has all of its physical pages in memory attached to one of your two sockets. In this case it's probably not something you can or should avoid, since I assume you're filling the array in a single-threaded way before handing it off to multiple threads for parsing. In general, though, try to allocate memory (especially scratch buffers) in the thread that will use it most, when that's convenient.
This may partially explain less-than-perfect scaling with number of threads. Although more likely this has nothing to do with things, if #AntonMalyshev's answer didn't help. Having each thread work on a contiguous range, instead of striding through the array with a stride of n_threads, should be better for L2 / L1 cache efficiency.
node[] is a vector of pointers (so with 8 threads, each thread uses only 8 bytes of each 64 byte cache line it touches in node[]). However, each thread presumably touches way more memory in the pointed-to data structures and strings. If node entries point to monotonically-increasing positions in other data structures and the string, then the strided access to node[] creates non-contiguous access patterns to most of the memory touched by the thread.
One possible benefit of the strided access pattern: Strided means that if all threads run at more or less the same speed, they're all looking at the same part of memory at the same time. Threads that get ahead will slow down from L3 misses, while other threads catch up because they see L3 hits. (Unless something happens that lets one thread get too far behind, like the OS de-scheduling it for a time slice.)
So maybe L3 vs. RAM bandwidth / latency is more of an issue than efficient use of per-core L2/L1. Maybe with more threads, L3 bandwidth can't keep up with all the requests for the same cache lines from the L2 caches of multiple cores. (L3 isn't fast enough to satisfy constant L2 misses from all cores at once, even if they all hit in L3.)
This argument applies to the everything pointed to by node[] only if contiguous ranges of node[] point to contiguous ranges of other memory.

Try to parse continuos ranges of elements inside threads, e.g. change
for (size_t idx = id; idx < nodes.size(); idx += n_threads)
{
// parse node
parse(nodes[idx]);
}
to
for (size_t idx = id * nodes.size()/n_threads; idx < (id+1)*nodes.size()/n_threads; idx++)
{
// parse node
parse(nodes[idx]);
}
that should be better for caching.
Also it's better to precompute size = (id+1)*nodes.size()/n_threads and use it in loop's stop condition instead of computing it on each iteration.

For CPU bound processes, adding additional threads beyond the number of available cores will decrease overall performance. The decrease is due to scheduling and other kernel interactions. For such situations, the optimal number of threads is often the number of cores -1. The remaining core will be used by the kernel and other other running processes.
I address this topic in a bit more detail here A case for minimal multithreading
Looking at the hardware and numbers a bit closer, I suspect you are hitting hyper-threading contention. For a 4 core cpu, 8 cores are simulated with hyper-threading. For a fully cpu bound process, hyper-threading will actually decrease performance. There is some interesting discussion here Hyper-threading and details Wikipedia hyper-threading

2 CPU (4 core each)
Threads run in a shared memory space. Performance decrease is caused by moving shared memory between CPU (threads cannot access directly cache in different CPU, more threads => more moving => bigger performance decrease).

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Missing small primes in C++ atomic prime sieve

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.
My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.
Things I've tried:
Replacing compare_exchange_weak with compare_exchange_strong
Changing the memory_order to anything else.
Swapping around the 'crossing-out' and the write.
I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.
Any ideas on this are welcome, including some best practices I might have missed.
#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>
int main() {
using namespace std;
constexpr memory_order order = memory_order_relaxed;
atomic<int> output{0};
vector<atomic_bool> sieve(10000);
for (auto& each : sieve) atomic_init(&each, false);
atomic<unsigned> finished_worker_count{0};
auto const worker = [&output, &sieve, &finished_worker_count]() {
for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
current = find_if(current, sieve.end(), [](atomic_bool& value) {
bool untrue = false;
return value.compare_exchange_strong(untrue, true, order);
});
if (current == sieve.end()) break;
int const claimed = static_cast<int>(distance(sieve.begin(), current));
int zero = 0;
while (!output.compare_exchange_weak(zero, claimed, order))
;
for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
product += claimed)
sieve[product].store(true, order);
}
finished_worker_count.fetch_add(1, order);
};
const auto core_count = thread::hardware_concurrency();
vector<future<void>> futures;
futures.reserve(core_count);
generate_n(back_inserter(futures), core_count,
[&worker]() { return async(worker); });
vector<int> result;
while (finished_worker_count < core_count) {
auto current = output.exchange(0, order);
if (current > 0) result.push_back(current);
}
sort(result.begin(), result.end());
for (auto each : result) cout << each << " ";
cout << '\n';
return 0;
}
compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.
You'll want to reset zero back to zero before rechecking:
while (!output.compare_exchange_weak(zero, claimed, order))
zero = 0;
Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.
As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.
do{
zero = 0;
}while(!output.compare_exchange_weak(zero, claimed, order));
Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.
And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.
Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.
Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.
Multiple threads storing to the array will cause contention
Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)
Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).
But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)
With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)
A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).
Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).
You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.
The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.
(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)
At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)
Better strategy: each thread owns a chunk of the sieve
Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)
You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)
If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.
Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.
(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

What causes increasing memory consumption in OpenMP-based simulation?

The problem
I am having a big struggle with memory consumption in my Monte Carlo particle simulation, where I am using OpenMP for parallelization. Not going into the details of the simulation method, one parallel part are "particle moves" using some number of threads and the other are "scaling moves" using some, possibly different number of threads. This 2 parallel codes are run interchangeably separated by some serial core and each takes milliseconds to run.
I have an 8-core, 16-thread machine running Linux Ubuntu 18.04 LTS and I'am using gcc and GNU OpenMP implementation. Now:
using 8 threads for "particle moves" and 8 threads for "scaling moves" yields stable 8-9 MB memory usage
using 8 threads for "particle moves" and 16 threads for "scaling moves" causes increasing memory consumption from those 8 MB to tens of GB for long simulation resulting in the end in an OOM kill
using 16 threads and 16 threads is ok
using 16 threads and 8 threads causes increasing consumption
So something is wrong if numbers of threads for those 2 types of moves don't match.
Unfortunately, I was not able to reproduce the issue in a minimal example and I can only give a summary of the OpenMP code. A link to aminimal example is at the bottom.
In the simulation I have N particles with some positions. "Particle moves" are organized in a grid, I am using collapse(3) to distribute threads. The code looks more or less like this:
// Each threads has its own cell in a 2 x 2 x 2 grid
#pragma omp parallel for collapse(3) num_threads(8 or 16)
for (std::size_t i = 0; i < 2; i++) {
for (std::size_t j = 0; j < 2; j++) {
for (std::size_t k = 0; k < 2; k++) {
std::array<std::size_t, 3> gridCoords = {i, j, k};
// This does something for all particles in {i, j, k} grid cell
doIndependentParticleMovesInAGridCellGivenByCoords(gridCoords);
}
}
}
(Notice, that only 8 threads are to be distributed in both cases - 8 and 16, but using those additional, jobless 8 threads magically fixes the problem when 16 scaling threads are used.)
In "volume moves" I am doing an overlap check on each particle independently and exit when a first overlap is found. It looks like this:
// We independently check for each particle
std::atomic<bool> overlapFound = false;
#pragma omp parallel for num_threads(8 or 16)
for (std::size_t i = 0; i < N; i++) {
if (overlapFound)
continue;
if (isParticleOverlappingAnything(i))
overlapFound = true;
}
Now, in parallel regions I don't allocate any new memory and don't need any critical sections - there should be no race conditions.
Moreover, all memory management in the whole program is done in a RAII fashion by std::vector, std::unique_ptr, etc. - I don't use new or delete anywhere.
Investigation
I tried to use some Valgrind tools. I ran a simulation for a time, which produces about 16 MB of (still increasing) memory consumption for non-matching thread numbers case, while is stays still on 8 MB for matching case.
Valgrind Memcheck does not show any memory leaks (only a couple of kB "still reachable" or "possibly lost" from OpenMP control structures, see here) in either case.
Valgrind Massif reports only those "correct" 8 MB of allocated memory in both cases.
I also tried to surround the contents of main in { } and add while(true):
int main() {
{
// Do the simulation and let RAII do all the cleanup when destructors are called
}
// Hang
while(true) { }
}
During the simulation memory consumption increases let say up to 100 MB. When { ... } ends its execution, memory consumption gets lower by around 6 MB and stays at 94 in while(true) - 6 MB is the actual size of biggest data structures (I estimated it), but the remaining part is of an unknown kind.
Hypothesis
So I assume it must be something with OpenMP memory management. Maybe using 8 and 16 threads interchangeably causes OpenMP to constantly create new thread pools abandoning old ones without releasing resources? I found something like this here, but it seems to be another OpenMP implementation.
I would be very grateful for some ideas what else can I check and where might be the issue.
re #1201ProgramAlarm: I have changed volatile to std::atomic
re #Gilles: I have checked 16 threads case for "particle moves" and updated accordingly
Minimal example
I was finally able to reproduce the issue in a minimal example, it ended up being extremely simple and all the details here are unnecessary. I created a new question without all the mess here.
Where lies the problem?
It seem that the problem is not connected with what this particular code does or how the OpenMP clauses are structured, but solely with two alternating OpenMP parallel regions with different numbers of threads. After millions of those alterations there is a substantial amount of memory used by the process irregardless of what is in the sections. They may be even as simple as sleeping for a couple of milliseconds.
As this question contains too much unnecessary details I have moved the discussion to a more direct question here. I refer there the interested reader.
A brief summary of what happens
Here I give a brief summary of what StackOverflow members and I were able to determine. Let's say we have 2 OpenMP sections with different number of threads, such as here:
#include <unistd.h>
int main() {
while (true) {
#pragma omp parallel num_threads(16)
usleep(30);
#pragma omp parallel num_threads(8)
usleep(30);
}
return 0;
}
As described with more details here, OpenMP reuses common 8 threads, but other 8 needed for 16-thread section are constantly created and destroyed. This constant thread creation causes increasing memory consumption, either because of an actual memory leak, or memory fragmentation, I don't know. Moreover, the problem seems to be specific to GOMP OpenMP implementation in GCC (up to at least version 10). Clang and Intel compilers seem not to replicate the issue.
Although not stated explicitly by the OpenMP standard, most implementations tend to reuse the already spawned threads, but is seems not to be the case for GOMP and it is probably a bug. I will file the bug issue and update the answer. For now, the only workaround is to use the same number of threads in every parallel region (then GOMP properly reuses old threads). In cases like collapse loop from the question, when there are less threads to distribute than in the other section, one can always request 16 threads instead of 8 and let the other 8 just do nothing. It worked in my "production" code quite well.

OpenMP overhead with a very parallel loop

I'm trying to parallelize a loop with OpenMP where each iteration is independent (code sample below).
!$OMP PARALLEL DO DEFAULT(PRIVATE)
do i = 1, 16
begin = omp_get_wtime()
allocate(array(100000000))
do j=1, 100000000
array(j) = j
end do
deallocate(array)
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$END OMP PARALLEL DO
I would except a linear speedup out of this piece of code, with each iteration taking as much time as in the sequential version, as there are no possible race conditions or false sharing issues. However, I obtain the following results on a machine with 2 Xeon E5-2670 (8 cores each):
With only one thread:
It 1 Thread 0 time 0.435683965682983
It 2 Thread 0 time 0.435048103332520
It 3 Thread 0 time 0.435137987136841
It 4 Thread 0 time 0.434695959091187
It 5 Thread 0 time 0.434970140457153
It 6 Thread 0 time 0.434894084930420
It 7 Thread 0 time 0.433521986007690
It 8 Thread 0 time 0.434685945510864
It 9 Thread 0 time 0.433223009109497
It 10 Thread 0 time 0.434834957122803
It 11 Thread 0 time 0.435106039047241
It 12 Thread 0 time 0.434649944305420
It 13 Thread 0 time 0.434831142425537
It 14 Thread 0 time 0.434768199920654
It 15 Thread 0 time 0.435182094573975
It 16 Thread 0 time 0.435090065002441
And with 16 threads :
It 1 Thread 0 time 1.14882898330688
It 3 Thread 2 time 1.19775915145874
It 4 Thread 3 time 1.24406099319458
It 14 Thread 13 time 1.28723978996277
It 8 Thread 7 time 1.39885497093201
It 10 Thread 9 time 1.46112895011902
It 6 Thread 5 time 1.50975203514099
It 11 Thread 10 time 1.63096308708191
It 16 Thread 15 time 1.69229602813721
It 7 Thread 6 time 1.74118590354919
It 9 Thread 8 time 1.78044819831848
It 15 Thread 14 time 1.82169485092163
It 12 Thread 11 time 1.86312794685364
It 2 Thread 1 time 1.90681600570679
It 5 Thread 4 time 1.96404480934143
It 13 Thread 12 time 2.00902700424194
Any idea where the 4x factor in the iteration time is coming from ?
I have tested with both the GNU compiler and the Intel compiler with the O3 optimization flag.
The speed of the operation
do j=1, 100000000
array(j) = j
end do
is limited not by the ALU speed but by the memory bandwith. Typically, you now have several channels to the main memory per CPU socket available, but still smaller number then the number of cores.
Also the allocation and deallocation are memory access bound. I am not sure whether there may be also some synchronization needed for the allocate and deallocate.
For the same reason, the STREAM benchmark http://www.cs.virginia.edu/stream/ gives different speed-ups than purely arithmetically intensive problems.
I'm sure I've covered that topic before, but since I cannot seem to find my earlier posts, here I go again...
Large memory allocations on Linux (and possibly on other platforms) are handled via anonymous memory mappings. That is, some area gets reserved in the virtual address space of the process by calling mmap(2) with flags MAP_ANONYMOUS. The maps are initially empty - there is no physical memory backing them up. Instead, they are associated with the so-called zero page, which is a read-only frame in physical memory filled with zeros. Since the zero page is not writeable, an attempt to write into a memory location still backed by it results in a segmentation fault. The kernel handles the fault by finding a free frame in physical memory and associating it with the virtual memory page where the fault has occurred. This process is known as faulting the memory.
Faulting the memory is a relatively slow process as it involves modifications to the process' PTEs (page table entries) and flushes of the TLB (Translation Lookaside Buffer) cache. On multicore and multisocket systems it is even slower as it involves invalidation of the remote TLBs (known as remote TLB shootdown) via expensive inter-processor interrupts. Freeing an allocation results in removal of the memory mapping and reset of the PTEs. Therefore, the whole process is repeated during the next iteration.
Indeed, if you look at the effective memory bandwidth in your serial case, it is (assuming an array of double precision floats):
(100000000 * 8) / 0.435 = 1.71 GiB/s
Should your array be of REAL or INTEGER elements, the bandwidth should be cut in half. This is nowhere the memory bandwidth that even the very first generation of E5-2670 provides.
For the parallel case, the situation is even worse, since the kernel locks the page tables while faulting the pages. That's why the average bandwidth for a single thread varies from 664 MiB/s down to 380 MiB/s for a total of 7.68 GiB/s, which is almost an order of magnitude slower than the memory bandwidth of a single CPU (and your system has two, hence twice the available bandwidth!).
A completely different picture will emerge if you move the allocation outside of the loop:
!$omp parallel default(private)
allocate(array(100000000))
!$omp do
do i = 1, 16
begin = omp_get_wtime()
do j=1, 100000000
array(j) = j
end do
end = omp_get_wtime()
write(*,*) "It", i, "Thread", omp_get_thread_num(), "time", end - begin
end do
!$omp end do
deallocate(array)
!$omp end parallel
Now the second and later iterations will yield twice shorter times (at least on E5-2650). This is because after the first iteration, all the memory is already faulted. The gain is even larger for the multithreaded case (increase the loop count to 32 to have each thread do two iterations).
The time to fault the memory depends heavily on the system configuration. On systems that have THP (transparent huge pages) enabled, the kernel automatically uses huge pages to implement large mappings. This reduces the number of faults by a factor of 512 (for huge pages of 2 MiB). The above cited speed gains of 2x for the serial case and 2.5x for the parallel one are from a system with THP enabled. The mere use of huge pages decreases the time for the first iteration on E5-2650 to 1/4 (1/8 if your array is of integers or single-precision floats) of the time in your case.
This is usually not the case for smaller arrays, which are allocated via subdivision of a larger and reused persistent memory allocation known as arena. Newer memory allocators in glibc usually have one arena per CPU core in order to facilitate lock-less multithreaded allocation.
That is the reason why many benchmark applications simply throw away the very first measurement.
Just to substantiate the above with real-life measurements, my E5-2650 needs 0.183 seconds to perform in serial one iteration over already faulted memory and 0.209 seconds to perform it with 16 threads (on a dual-socket system).
They're not independent. Allocate/deallocate will be sharing the heap.
Try allocating a bigger array outside of the parallel section, then timing just the memory access.
It's also a non uniform memory architecture - if all the allocations come from one cpu's local memory, access from the other cpu will be relatively slow as they get routed via the first cpu. This is tedious to work around.

Parallelization of elementwise matrix multiplication

I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into
!$OMP PARALLEL DO
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
!$OMP END PARALLEL DO
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.
You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:
!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).
Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.