Erroneous single thread memory bandwidth benchmark - c++

In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.
Code (for the Intel compiler)
#include <omp.h>
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
#include <random> // std::mt19937
int main()
{
// test-parameters
const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
const auto experiment_count = std::size_t{500};
//+/////////////////
// access a data-point 'on a whim'
//+/////////////////
// warm-up
for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
// deallocate resources
free(data);
}
// timed run
auto min_duration = std::numeric_limits<double>::max();
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
// garbage data allocation and memory page loading
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
//#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = -1.0;
}
const auto dur1 = omp_get_wtime() * 1E+6;
//#pragma omp parallel for simd safelen(8) schedule(static)
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 10.0;
}
const auto dur2 = omp_get_wtime() * 1E+6;
const auto run_duration = dur2 - dur1;
if (run_duration < min_duration)
{
min_duration = run_duration;
}
// deallocate resources
free(data);
}
// REPORT
const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
<< "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
return 0;
}
Notes on code
Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
using ICC with compiler flags -O3 -ffast-math -march=coffeelake
size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).
Results of code
On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)
Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE
I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.

Actually, the question seems to be, "why is the obtained bandwidth so high?", to which I have gotten quite a lot of input from #PeterCordes and #Sebastian. This information needs to be digested in its own time.
I can still offer an auxiliary 'answer' to the topic of interest. By substituting the write operation (which, as I now understand, cannot be properly modeled in a benchmark without delving into the assembly) by a cheap e.g. a bitwise operation, we can prevent the compiler from doing its job a little too well.
Updated code
#include <omp.h>
#include <iostream> // std::cout
#include <limits> // std::numeric_limits
#include <cstdlib> // std::free
#include <unistd.h> // sysconf
#include <stdlib.h> // posix_memalign
int main()
{
// test-parameters
const auto size = std::size_t{100 * 1024 * 1024};
const auto experiment_count = std::size_t{250};
//+/////////////////
// access a data-point 'on a whim'
//+/////////////////
// allocate for exp. data and load the memory pages
char* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), size);
if (data == nullptr)
{
std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
std::abort();
}
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] = 0;
}
// timed run
auto min_duration = std::numeric_limits<double>::max();
for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
{
// run
const auto dur1 = omp_get_wtime() * 1E+6;
#pragma omp parallel for simd safelen(8) schedule(static)
for (auto index = std::size_t{}; index < size; ++index)
{
data[index] ^= 1;
}
const auto dur2 = omp_get_wtime() * 1E+6;
const auto run_duration = dur2 - dur1;
if (run_duration < min_duration)
{
min_duration = run_duration;
}
}
// deallocate resources
free(data);
// REPORT
const auto traffic = size * 2; // 1x load, 1x write
std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;\n"
<< "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
return 0;
}
The benchmark remains a 'naive' one and shall only serve as an indicator of the model's performance (as opposed to a program which can exactly calculate the memory bandwidth).
With the updated code, I get 24 GiB/s for single thread and 37 GiB/s when all 6 cores get involved. When compared to Intel Advisor's measured values of 25.5 GiB/s and 37.5 GiB/s, I think this is acceptable.
#PeterCordes I have retained the warm-up loop to once do an exactly identical run of the whole procedure so as to counter-act against effects unknown (healthy programmer's paranoia).
Edit In this case, the warm-up loop is indeed redundant because the minimum duration is being clocked.

Related

memcpy beats SIMD intrinsics

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.
I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.
I have got four versions to copy data:
1. Baseline
Copies element by element:
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
2. NEON
This code loads four values into a temporary register, then copies the register to output.
Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
3. Stepped memcpy,
Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
4. Normal memcpy
Uses memcpy with full amount of data.
memcpy(orig, copy4, size);
My benchmark using 2^16 values gave some surprising results:
1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]
The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.
Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?
This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well
The full code listing is below:
#include <arm_neon.h>
#include <vector>
#include <cinttypes>
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>
int main(int argc, char *argv[]) {
int arr_size;
if (argc==1)
{
std::cout << "Please enter an array size" << std::endl;
exit(1);
}
int size = atoi(argv[1]); // not very C++, sorry
std::int32_t* orig = new std::int32_t[size];
std::int32_t* copy = new std::int32_t[size];
std::int32_t* copy2 = new std::int32_t[size];
std::int32_t* copy3 = new std::int32_t[size];
std::int32_t* copy4 = new std::int32_t[size];
// Non-neon version
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// NEON version
begin = std::chrono::steady_clock::now();
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}
end = std::chrono::steady_clock::now();
std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
memcpy(orig, copy4, size);
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
return 0;
}
Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes).
Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig is not used after the last memcpy, so it eliminated the memcpy.
A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x) function prevent dead code elimination.

Parallelizing a small array is slower than parallelizing a large array?

I wrote a small program that generates random values for two valarrays and in a for loop the values of said arrays are added to a new one.
However, when I use a small array size(20 elements) the parallel version takes significantly longer than the serial one and when I'm using large arrays(200 000 elements) it takes roughly the same amount of time(parallel is always a bit slower though).
Why is this?
The only reason I can think is that with the large array the CPU puts it in L3 cache and shares it across all cores, whereas with the small one its having to copy it around the lower cache levels? Or I'm getting this wrong?
Here is the code:
#include <valarray>
#include <iostream>
#include <ctime>
#include <omp.h>
#include <chrono>
int main()
{
int size = 2000000;
std::valarray<double> num1(size), num2(size), result(size);
std::srand(std::time(nullptr));
std::chrono::time_point<std::chrono::steady_clock> start, stop;
std::chrono::microseconds duration;
for (int i = 0; i < size; ++i) {
num1[i] = std::rand();
num2[i] = std::rand();
}
//Parallel execution
start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Parallel for loop executed in: " << duration.count() << " microseconds" << std::endl;
//Serial execution
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < size; ++i) {
result[i] = num1[i] + num2[i];
}
stop = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Serial for loop executed in: " << duration.count() << " microseconds" << std::endl;
}
Output with size = 200 000
Parallel for loop executed in: 2450 microseconds
Serial for loop executed in: 2726 microseconds
Output with size = 20
Parallel for loop executed in: 4727 microseconds
Serial for loop executed in: 0 microseconds
I'm using a Xeon E3-1230 V5 and I'm compiling with Intel's compiler using maximum optimization and Skylake specific optimizations as well.
I get identical results with Visual Studio's C++ compiler.

Impact of the prior loop iteration on the execution time of the current iteration

I am trying to measure the performance of concurrent insertion in folly hashmap. A simplified version of a program for such insertion is brought here:
#include <folly/concurrency/ConcurrentHashMap.h>
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
const int kNumMutexLocks = 2003;
std::unique_ptr<std::mutex[]> mutices(new std::mutex[kNumMutexLocks]);
__inline__ void
concurrentInsertion(unsigned int threadId, unsigned int numInsertionsPerThread,
unsigned int numInsertions, unsigned int numUniqueKeys,
folly::ConcurrentHashMap<int, int> &follyMap) {
int base = threadId * numInsertionsPerThread;
for (int i = 0; i < numInsertionsPerThread; i++) {
int idx = base + i;
if (idx >= numInsertions)
break;
int val = idx;
int key = val % numUniqueKeys;
mutices[key % kNumMutexLocks].lock();
auto found = follyMap.find(key);
if (found != follyMap.end()) {
int oldVal = found->second;
if (oldVal < val) {
follyMap.assign(key, val);
}
} else {
follyMap.insert(key, val);
}
mutices[key % kNumMutexLocks].unlock();
}
}
void func(unsigned int numInsertions, float keyValRatio) {
const unsigned int numThreads = 12; // Simplified just for this post
unsigned int numUniqueKeys = numInsertions * keyValRatio;
unsigned int numInsertionsPerThread = ceil(numInsertions * 1.0 / numThreads);
std::vector<std::thread> insertionThreads;
insertionThreads.reserve(numThreads);
folly::ConcurrentHashMap<int, int> follyMap;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; i++) {
insertionThreads.emplace_back(std::thread([&, i] {
concurrentInsertion(i, numInsertionsPerThread, numInsertions,
numUniqueKeys, follyMap);
}));
}
for (int i = 0; i < numThreads; i++) {
insertionThreads[i].join();
}
auto end = std::chrono::steady_clock::now();
auto diff = end - start;
float insertionTimeMs =
std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << numInsertions << "\tj: " << keyValRatio
<< "\ttime: " << insertionTimeMs << std::endl;
}
int main() {
std::vector<float> js = {0.5, 0.25};
for (auto j : js) {
std::cout << "-------------" << std::endl;
for (int i = 2048; i < 4194304 * 8; i *= 2) {
func(i, j);
}
}
}
The problem is that using this loop in main, suddenly increases the measured time in the func function. That is, if I call the function directly from main without any loop (as shown in what follows), the measure time for some cases is suddenly more than 100X smaller.
int main() {
func(2048, 0.25); // ~ 100X faster now that the loop is gone.
}
Possible Reasons
I allocate a huge amount of memory while building the hasmap. I believe when I run the code in a loop, while the second iteration of loop being executed the computer is busy freeing the memory for the first iteration. Hence, the program becomes much slower. If this is the case, I'd be grateful if someone can suggest a change that I can get the same results with loop.
More Details
Please note that if I unroll the loop in main, I have the same issue. That is, the following program has the same problem:
int main() {
performComputation(input A);
...
performComputation(input Z);
}
Sample Output
The output of the first program is shown here:
i: 2048 j: 0.5 time: 1.39932
...
i: 16777216 j: 0.5 time: 3704.33
-------------
i: 2048 j: 0.25 time: 277.427 <= sudden increase in execution time
i: 4096 j: 0.25 time: 157.236
i: 8192 j: 0.25 time: 50.7963
i: 16384 j: 0.25 time: 133.151
i: 32768 j: 0.25 time: 8.75953
...
i: 2048 j: 0.25 time: 162.663
Running the func alone in main with i=2048 and j=0.25 yields:
i: 2048 j: 0.25 time: 1.01
Any comment/insight is highly appreciated.
Iff it is the memory allocation that is slowing it down and the contents of the memory before performComputation(input) is irrelevant you could just re-use the allocated memory block.
int performComputation(input, std::vector<char>& memory) {
/* Note: memory will need to be passed by reference*/
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; i++) {
t.emplace_back(std::thread([&, i] {
func(...); // Random access to memory
}));
}
for (int i = 0; i < numThreads; i++) {
t[i].join();
}
auto end = std::chrono::steady_clock::now();
float time = std::chrono::duration<double, std::milli>(end - start).count();
}
int main() {
// A. Allocate ~1GB memory here
std::vector<char> memory(1028 * 1028 * 1028) //is that 1 gig?
for (input: inputs)
performComputation(input, memory);
}
I can't be too confident on the exact details, but it seems to me to be a result of memory allocation in building the map. I replicated the behaviour you're seeing using a plain unordered_map and a single mutex, and making the map object in func static fixed it entirely. (Actually now it's slightly slower the first time around, since no memory has been allocated for the map yet, and then faster and a consistent time every subsequent run.)
I'm not sure why this makes a difference, since the map has been destructed and the memory should have been freed. For some reason it seems the map's freed memory isn't reused on subsequent calls to func. Perhaps someone else more knowledgeable than I can elaborate on this.
Edit: reduced minimal, reproducible example and output
void func(int num_insertions)
{
const auto start = std::chrono::steady_clock::now();
std::unordered_map<int, int> map;
for (int i = 0; i < num_insertions; ++i)
{
map.emplace(i, i);
}
const auto end = std::chrono::steady_clock::now();
const auto diff = end - start;
const auto time = std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << num_insertions << "\ttime: " << time << "\n";
}
int main()
{
func(2048);
func(16777216);
func(2048);
}
With non-static map:
i: 2048 time: 0.6035
i: 16777216 time: 4629.03
i: 2048 time: 124.44
With static map:
i: 2048 time: 0.6524
i: 16777216 time: 4828.6
i: 2048 time: 0.3802
Another edit: should also mention that the static version also requires a call to map.clear() at the end, though that's not really relevant to the question of the performance of the insertions.
When measuring wall clock time use averages !
You are measuring wall clock time. The actual time jumps seen is somewhat in the small range in this regard and could in theory be caused OS delays or other processing or perhaps it may be worse due to thread management(eg. cleanup) caused by your program (note this can vary a lot depending on platform/system and remember that a context switch can easily take ~10-15ms) There are just too many paramters in play to be sure.
When using wall clock to measure, it is a common practice to averaged over a loop of some hundreds or thousands of times to takes spikes/etc... into account
Use a profiler
Learn to use a profiler - a profiler can help you to quickly see what your program is actually spending time on and save preciouse time again and again.

Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

If we look at the Visual C++ documentation of omp_set_dynamic, it is literally copy-pasted from the OMP 2.0 standard (section 3.1.7 on page 39):
If [the function argument] evaluates to a nonzero value, the number of threads that are used for executing upcoming parallel regions may be adjusted automatically by the run-time environment to best use system resources. As a consequence, the number of threads specified by the user is the maximum thread count. The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function.
It seems clear that omp_set_dynamic(1) allows the implementation to use fewer than the current maximum number of threads for a parallel region (presumably to prevent oversubscription under high loads). Any reasonable reading of this paragraph would suggest that said reduction should be observable by querying omp_get_num_threads inside parallel regions.
(Both documentations also show the signature as void omp_set_dynamic(int dynamic_threads);. It appears that "the number of threads specified by the user" does not refer to dynamic_threads but instead means "whatever the user specified using the remaining OpenMP interface").
However, no matter how high I push my system load under omp_set_dynamic(1), the return value of omp_get_num_threads (queried inside the parallel regions) never changes from the maximum in my test program. Yet I can still observe clear performance differences between omp_set_dynamic(1) and omp_set_dynamic(0).
Here is a sample program to reproduce the issue:
#include <atomic>
#include <chrono>
#include <iostream>
#include <thread>
#include <cstdlib>
#include <cmath>
#include <omp.h>
#define UNDER_LOAD true
const int SET_DYNAMIC_TO = 1;
const int REPEATS = 3000;
const unsigned MAXCOUNT = 1000000;
std::size_t threadNumSum = 0;
std::size_t threadNumCount = 0;
void oneRegion(int i)
{
// Pesudo-randomize the number of iterations.
unsigned ui = static_cast<unsigned>(i);
int count = static_cast<int>(((MAXCOUNT + 37) * (ui + 7) * ui) % MAXCOUNT);
#pragma omp parallel for schedule(guided, 512)
for (int j = 0; j < count; ++j)
{
if (j == 0)
{
threadNumSum += omp_get_num_threads();
threadNumCount++;
}
if ((j + i + count) % 16 != 0)
continue;
// Do some floating point math.
double a = j + i;
for (int k = 0; k < 10; ++k)
a = std::sin(i * (std::cos(a) * j + std::log(std::abs(a + count) + 1)));
volatile double out = a;
}
}
int main()
{
omp_set_dynamic(SET_DYNAMIC_TO);
#if UNDER_LOAD
for (int i = 0; i < 10; ++i)
{
std::thread([]()
{
unsigned x = 0;
float y = static_cast<float>(std::sqrt(2));
while (true)
{
//#pragma omp parallel for
for (int i = 0; i < 100000; ++i)
{
x = x * 7 + 13;
y = 4 * y * (1 - y);
}
volatile unsigned xx = x;
volatile float yy = y;
}
}).detach();
}
#endif
std::chrono::high_resolution_clock clk;
auto start = clk.now();
for (int i = 0; i < REPEATS; ++i)
oneRegion(i);
std::cout << (clk.now() - start).count() / 1000ull / 1000ull << " ms for " << REPEATS << " iterations" << std::endl;
double averageThreadNum = double(threadNumSum) / threadNumCount;
std::cout << "Entered " << threadNumCount << " parallel regions with " << averageThreadNum << " threads each on average." << std::endl;
std::getchar();
return 0;
}
Compiler version: Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27024.1 for x64
On e.g. gcc, this program will print a significantly lower averageThreadNum for omp_set_dynamic(1) than for omp_set_dynamic(0). But on MSVC, the same value is shown in both cases, despite a 30% performance difference (170s vs 230s).
How can this be explained?
In Visual C++, the number of threads executing the loop does get reduced with omp_set_dynamic(1) in this example, which explains the performance difference.
However, contrary to any good-faith interpretation of the standard (and Visual C++ docs), omp_get_num_threads does not report this reduction.
The only way to figure out how many threads MSVC actually uses for each parallel region is to inspect omp_get_thread_num on every loop iteration (or parallel task). The following would be one way to do it with little in-loop performance overhead:
// std::hardware_destructive_interference_size is not available in gcc or clang, also see comments by Peter Cordes:
// https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons
struct alignas(2 * std::hardware_destructive_interference_size) NoFalseSharing
{
int flagValue = 0;
};
void foo()
{
std::vector<NoFalseSharing> flags(omp_get_max_threads());
#pragma omp parallel for
for (int j = 0; j < count; ++j)
{
flags[omp_get_thread_num()].flagValue = 1;
// Your real loop body
}
int realOmpNumThreads = 0;
for (auto flag : flags)
realOmpNumThreads += flag.flagValue;
}
Indeed, you will find realOmpNumThreads to yield significantly different values from the omp_get_num_threads() inside the parallel region with omp_set_dynamic(1) on Visual C++.
One could argue that technically
"the number of threads in the team executing a parallel region" and
"the number of threads that are used for executing upcoming parallel regions"
are not literally the same.
This is a nonsensical interpretation of the standard in my view, because the intent is very clear and there is no reason for the standard to say "The number of threads in the team executing a parallel region stays fixed for the duration of that parallel region and is reported by the omp_get_num_threads function" in this section if this number is unrelated to the functionality of omp_set_dynamic.
However, it could be that MSVC decided to keep the number of threads in a team unaffected and just assign no loop iterations for execution to a subset of them under omp_set_dynamic(1) for ease of implementation.
Whatever the case may be: Do not trust omp_get_num_threads in Visual C++.

OpenMP overhead calculation

Given n threads, is there a way that I can calculate the amount of overhead (e.g. # of cycles) that is required to implement a specific directive in OpenMP.
For example, given the code below
#pragma omp parallel
{
#pragma omp for
for( int i=0 ; i < m ; i++ )
a[i] = b[i] + c[i];
}
Can I calculate somehow how much overhead is required to create these threads?
I think the way to measure the overhead is to time both the serial and parallel versions, and then see how far off the parallel version is from its 'ideal' running time for your number of threads.
So for example, if your serial version takes 10 seconds and you have 4 threads on 4 cores, then your ideal running time is 2.5 seconds. If your OpenMP version takes 4 seconds, then your 'overhead' is 1.5 seconds. I put overhead in quotes because some of that will be thread creation and memory sharing (actual threading overhead), and some of that will just be unparallelized sections of code. I'm trying to think here in terms of Amdahl's Law.
For demonstration, here are two examples. They don't measure thread creation overhead, but they might show the difference between expected and achieved improvement. And while Mystical was right that the only real way to measure is to time it, even trivial examples like your for loop aren't necessarily memory bound. OpenMP does a lot of work that we don't see.
Serial (speedtest.cpp)
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i] * 2;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
Parallel (omp_speedtest.cpp)
#include <omp.h>
#include <iostream>
int main(int argc, char** argv) {
const int SIZE = 100000000;
int* a = new int[SIZE];
int* b = new int[SIZE];
int* c = new int[SIZE];
std::cout << "There are " << omp_get_num_procs() << " procs." << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] * c[i];
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < SIZE; i++) {
a[i] = b[i] + c[i] + 1;
}
}
std::cout << "a[" << (SIZE-1) << "]=" << a[SIZE-1] << std::endl;
delete[] a;
delete[] b;
delete[] c;
return 0;
}
So I compiled these these with
g++ -O3 -o speedtest.exe speedtest.cpp
g++ -fopenmp -O3 -o omp_speedtest.exe omp_speedtest.cpp
And when I ran them
$ time ./speedtest.exe
a[99999999]=0
a[99999999]=1
real 0m1.379s
user 0m0.015s
sys 0m0.000s
$ time ./omp_speedtest.exe
There are 4 procs.
a[99999999]=0
a[99999999]=1
real 0m0.854s
user 0m0.015s
sys 0m0.015s
Yes, you can. Please take a look at EPCC benchmark. Although this code is a bit older, it measures the various overhead of OpenMP's constructs, including omp parallel for and omp critical.
Basic approach is somewhat very simple and straightforward. You measure a baseline serial time without any OpenMP, and just include a OpenMP pragma that you want to measure. Then, subtract the elapsed times. This is exactly how EPCC benchmark measures the overhead. See the source like 'syncbench.c'.
Please note that the overhead is expressed as time, rather than the # of cycles. I also tried to measure # of cycles, but OpenMP parallel constructs' overhead may include blocked time due to synchronizations. Hence, # of cycles may not reflect the real overhead of OpenMP.