OpenMP task directive slower multithreaded than singlethreaded - c++

I've encountered a problem where the task directive seems to slow down the execution time of the code the more threads I have. Now I have removed all of the unnecessary stuff from my code that isn't related to the problem since the problem still occurs even for this slimmed down piece of code that doesn't really do anything. But the general idea I have for this code is that I have the master thread generate tasks for all the other worker threads to execute.
#ifndef _REENTRANT
#define _REENTRANT
#endif
#include <vector>
#include <iostream>
#include <random>
#include <sched.h>
#include <semaphore.h>
#include <time.h>
#include <bits/stdc++.h>
#include <sys/times.h>
#include <stdio.h>
#include <stdbool.h>
#include <omp.h>
#include <chrono>
#define MAXWORKERS 16
using namespace std;
int nbrThreads = MAXWORKERS; //Number of threads
void busyWait() {
for (int i=0; i < 999; i++){}
}
void generatePlacements() {
#pragma omp parallel
{
#pragma omp master
{
int j = 0;
while (j < 8*7*6*5*4*3*2) {
#pragma omp task
{
busyWait();
}
j++;
}
}
}
}
int main(int argc, char const *argv[])
{
for (int i = 1; i <= MAXWORKERS; i++) {
int nbrThreads = i;
omp_set_num_threads(nbrThreads);
auto begin = omp_get_wtime();
generatePlacements();
double elapsed;
auto end = omp_get_wtime();
auto diff = end - begin;
cout << "Time taken for " << nbrThreads << " threads to execute was " << diff << endl;
}
return 0;
}
And I get the following output from running the program:
Time taken for 1 threads to execute was 0.0707005
Time taken for 2 threads to execute was 0.0375168
Time taken for 3 threads to execute was 0.0257982
Time taken for 4 threads to execute was 0.0234329
Time taken for 5 threads to execute was 0.0208451
Time taken for 6 threads to execute was 0.0288127
Time taken for 7 threads to execute was 0.0380352
Time taken for 8 threads to execute was 0.0403016
Time taken for 9 threads to execute was 0.0470985
Time taken for 10 threads to execute was 0.0539719
Time taken for 11 threads to execute was 0.0582986
Time taken for 12 threads to execute was 0.051923
Time taken for 13 threads to execute was 0.571846
Time taken for 14 threads to execute was 0.569011
Time taken for 15 threads to execute was 0.562491
Time taken for 16 threads to execute was 0.562118
Most notably was that from 6 threads on the time seems to get slower, and going from 12 threads to 13 threads seems to have the biggest performance hit, becoming whooping 10 times slower. Now I know that this issue revolves around the openMP task directive, since if I remove the busyWait() function the performance stays the same as seen above. But if I also remove the #pragma omp task header along with the busyWait() call I don't get any slowdown whatsoever, so the slowdown can't depend on the thread-creation. I have no clue what the problem here is.

First of all, the for (int i=0; i < 999; i++){} loop can be optimized by the compiler when optimization flags like -O2 or -O3 are enabled. In fact, mainstream compilers like Clang and GCC optimize it in -O2. Profiling non-optimized build is a wast of time and should never be done unless you have a very good reason to do that.
Assuming you enabled optimizations, the created task will be empty which means you are measuring the time to create many tasks. The thing is creating tasks is slow and creating many tasks doing nothing causes a contention making the creation even slower. The task granularity should be carefully tuned so not to put to much pressure on the OpenMP runtime. Assuming you did not enabled optimisations, then even a loop of 999 iterations is not enough for the runtime not to be under pressure (it should last less than 1 us on mainstream machines). Tasks should last for at least few microseconds for the overhead not to be the main bottleneck. On mainstream servers with a lot of cores, it should be at least dozens of microseconds. For the overhead to be negligible, tasks should last even longer. Task scheduling is powerful but expensive.
Due to the use of shared data structure protected with atomics and locks in OpenMP runtimes, the contention tends to grows with the number of core. On NUMA systems, it can be significantly higher when using multiple NUMA nodes due to NUMA effects. AMD processors with 16 cores are typically processors having multiple NUMA nodes. Using SMT (multiple hardware thread per physical core) does not significantly speed up this operation and adds more pressure to the OpenMP scheduler and the OS scheduler so it is generally not a good idea to use more threads than cores in this case (it can worth it when the task computational work can benefit from SMT, that is for latency-bound tasks for example, and when the overhead is small).
For more information about the overhead of mainstream OpenMP runtimes please consider reading On the Impact of OpenMP Task Granularity.

Related

code runs significantly slower after thread::sleep_for() [duplicate]

Consider:
#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;
const int times = 1000;
const int N = 100000;
void run() {
for (int j = 0; j < N; j++) {
}
}
int main() {
clock_t main_start = clock();
for (int i = 0; i < times; i++) {
clock_t start = clock();
run();
cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
//usleep(1000);
}
cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}
Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.
When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?
The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU # 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).
After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.
If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.
If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)
Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)
comments on other theories:
re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.
Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!
In more detail for x86:
The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.
The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.
gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).
Even if the loop-counter was kept in a register, 100k cycles is a lot.
A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Is lock-free multithreading slower than a single-threaded program?

I have considered parallelizing a program so that in the first phase it sorts items into buckets modulo the number of parallel workers, so that this avoids collisions in the second phase. Each thread of the parallel program uses std::atomic::fetch_add to reserve a place in the output array, and then it uses std::atomic::compare_exchange_weak to update current bucket head pointer. So it's lock free. However, I got doubt about the performance of multiple threads struggling for a single atomic (the one we do fetch_add, as the bucket head count is equal to the number of threads, thus on average there is not much contention), so I decided to measure this. Here is the code:
#include <atomic>
#include <chrono>
#include <cstdio>
#include <string>
#include <thread>
#include <vector>
std::atomic<int64_t> gCounter(0);
const int64_t gnAtomicIterations = 10 * 1000 * 1000;
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
void BenchmarkAtomic() {
const uint32_t maxThreads = std::thread::hardware_concurrency();
std::vector<std::thread> thrs;
thrs.reserve(maxThreads + 1);
for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
auto start = std::chrono::high_resolution_clock::now();
for (uint32_t i = 0; i < nThreads; i++) {
thrs.emplace_back(CountingThread);
}
for (uint32_t i = 0; i < nThreads; i++) {
thrs[i].join();
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
printf("%d threads: %.3lf Ops/sec, counter=%lld\n", (int)nThreads, (nThreads * gnAtomicIterations) / nSec,
(long long)gCounter.load(std::memory_order_acquire));
thrs.clear();
gCounter.store(0, std::memory_order_release);
}
}
int __cdecl main() {
BenchmarkAtomic();
return 0;
}
And here is the output:
1 threads: 150836387.770 Ops/sec, counter=10000000
2 threads: 91198022.827 Ops/sec, counter=20000000
3 threads: 78989357.501 Ops/sec, counter=30000000
4 threads: 66808858.187 Ops/sec, counter=40000000
5 threads: 68732962.817 Ops/sec, counter=50000000
6 threads: 64296828.452 Ops/sec, counter=60000000
7 threads: 66575046.721 Ops/sec, counter=70000000
8 threads: 64487317.763 Ops/sec, counter=80000000
9 threads: 63598622.030 Ops/sec, counter=90000000
10 threads: 62666457.778 Ops/sec, counter=100000000
11 threads: 62341701.668 Ops/sec, counter=110000000
12 threads: 62043591.828 Ops/sec, counter=120000000
13 threads: 61933752.800 Ops/sec, counter=130000000
14 threads: 62063367.585 Ops/sec, counter=140000000
15 threads: 61994384.135 Ops/sec, counter=150000000
16 threads: 61760299.784 Ops/sec, counter=160000000
The CPU is 8-core, 16-thread (Ryzen 1800X #3.9Ghz).
So the total over all threads of operations per second decreases dramatically till 4 threads are used. Then it decreases slowly and fluctuates a bit.
So is this phenomenon common to other CPUs and compilers? Is there any workaround (except resorting to a single thread)?
A lock free multi threaded program is not slower than a single threaded program. What makes it slow is data contention. The example you provided is in fact a highly contentious artificial program. In a real program you will do a lot of work between each access to shared data and thus it will have less cache invalidations and so on.
This CppCon talk by Jeff Preshing can explain some of your questions better than I did.
Add: Try to modify CountingThread and add a sleep once in a while to pretend you are busy with something else than incrementing atomic variable gCounter. Then go ahead and play with value in the if statement to see how it will influence results of your program.
void CountingThread() {
for (int64_t i = 0; i < gnAtomicIterations; i++) {
// take a nap every 10000th iteration to simulate work on something
// unrelated to access to shared resource
if (i%10000 == 0) {
std::chrono::milliseconds timespan(1);
std::this_thread::sleep_for(timespan);
}
gCounter.fetch_add(1, std::memory_order_acq_rel);
}
}
In general every time you call gCounter.fetch_add it means marking that data invalid in other core's cache. It is forcing them to reach for the data into a cache further from the core. This effect is major contributor to performance slowdown in your program.
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 ns
remote DRAM ~100 ns
Above table taken from Approximate cost to access various caches and main memory?
Lock-free doesn't mean you can exchange data between threads without cost. Lock-free means that you don't wait for other threads to unlock mutex for you to read shared data. In fact even lock-free programs use locking mechanisms to prevent data corruption.
Just follow simple rule. Try to access shared data as less as possible to gain more from multicore programming.
It depends on the concrete workload.
See amdahl's law
100 % (whole workload in percentage)
speedup = -----------------------------------------------------------
(sequential work load in %) + (parallel workload in %) / (count of workers)
The parallel workload in your program is 0 %, so the speedup is 1. Aka no speedup. (You are synchronising for incrementing the same memory cell
and so only one thread can increment the cell at any given time.)
Rough explanation, why it even performs worse then speedup=1:
The cache line containing gCounter stays in the cpu cache with only one thread.
With multiple threads, which are scheduled to different cpus or cores, the cache line containing gCounter will bounce around the different caches for the cpus ore cores.
So the difference is somewhat comparable to incrementing a register with only one thread compared to accessing memory for each increment operation. (Sometimes it is faster than a memory access, as there is cache to cache transfers in modern cpu architectures.)
Like most very broad which is faster questions, the only completely general answer is it depends.
A good mental model is that when parallelizing an existing task is that the runtime of the parallel version over N threads will be composed of rought three contributions:
A still serial part common to both the serial and parallel algorithms. I.e,. work that wasn't parallelized such as setup or tear down work, or work that didn't run in parallel because the task was inexactly partitioned1.
A parallel part which was effectively parallelized among the N workers.
An overhead component that represents extra work done in the parallel algorithm that doesn't exist in the serial version. There is almost invariably some small amount of overhead to partition the work, delegate to worker threads and combine the results, but in some cases the overhead can swamp the real work.
So in general you have these three contributions, and lets assign T1p, T2p and T3p respectively. Now the T1p component exists and takes the same time in both the serial and parallel algorithms, so we can ignore since it cancels out for the purposes of determining which is slower.
Of course, if you used coarser grained synchronization, e.g., incrementing a local variable on each thread and only periodically (perhaps only once at the very end) updating the shared variable, the situation would reverse.
1 This also includes the case where the workload was well partitioned, but some threads did more work per unit time, which is common on modern CPUs and in modern OSes.

Why does this delay-loop start to run faster after several iterations with no sleep?

Consider:
#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;
const int times = 1000;
const int N = 100000;
void run() {
for (int j = 0; j < N; j++) {
}
}
int main() {
clock_t main_start = clock();
for (int i = 0; i < times; i++) {
clock_t start = clock();
run();
cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
//usleep(1000);
}
cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}
Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.
When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?
The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU # 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).
After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.
If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).
If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.
If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)
Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)
comments on other theories:
re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.
Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!
In more detail for x86:
The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.
The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.
gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.
The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).
Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).
Even if the loop-counter was kept in a register, 100k cycles is a lot.
A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Multithread superlinear performance implementation with CPU boost mode consideration

I am studying the class in C++11 by using MinGW 4.8.1 lib on WIN7 64bit OS.
The CPU is ARK | Intel® Core™ i7-820QM Processor, which has four physical cores with 8M cache and supports maximum eight threads. This CPU has base operation frequency at 1.73 GHz if four cores are used simultaneously and can be boosted to 3.08 GHz if only one core is used.
The main target of my studying is that I am going to implement a Multithread test program to demonstrate super-linear performance increases as the number of the thread increases.
Here, the SUPER-linear term means exactly 4 speedup times (maybe 3.8 times acceptable) when employing four threads compared to single thread, not 3.2 or 3.5 times.
The codes and results are pasted here,
inline void count(int workNum) // some working to do .
//These codes are extracted from a real application except that the count function do some "meaningful" job and
//these codes have identical speedup ratio as my real application.
{
int s=0;
for(int i=0;i<workNum;++i)
++s;
}
inline void devide(int numThread) // create multiThreads [1,7] to do same amount task
{
int max = 100000000;
typedef std::vector<std::thread> threadList;
threadList list;
for(int i=1;i<=numThread;++i){
list.push_back(std::thread(count,max/numThread));
}
std::for_each(list.begin(),list.end(),std::mem_fun_ref(&std::thread::join));
}
inline void thread_efficiency_load() // to start test
{
for(int i=7;i>0;--i)
{
std::cout<< "*****************************************" << std::endl;
std::chrono::time_point<std::chrono::system_clock> start, end;
start = std::chrono::system_clock::now();
devide(i); // this is the work load to be measured, which i is the number of thread
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "thread num=#" << i << " time=" << elapsed_seconds.count() << std::endl;
}
}
The output is:
The time unit is seconds,
*****************************************
thread num=#7 time=0.101006
*****************************************
thread num=#6 time=0.0950055
*****************************************
thread num=#5 time=0.0910052
*****************************************
thread num=#4 time=0.0910052
*****************************************
thread num=#3 time=0.102006
*****************************************
thread num=#2 time=0.127007
*****************************************
thread num=#1 time=0.229013
This is very clear that I do not obtain a super-linear performance increases as the number of thread increases. I would like to know why I do not get it. Why ? Why ? Why ?
Some basic things from my mind,
Due to the fact that there are only 4 physical cores, so the maximum speedup should show up when there are four active threads (more threads does not really help a lot). There are only 2.4 times speed up by using four cores compared to the single, where 4 times speed-up is expected. I hope the above implementation does block the 4 times speed-up due to memory issue (cache paging) because all variables are local variables.
By considering the CPU boost mode, the CPU increases operating frequency to 3.07 GHz when there is only one core busy, where is a ratio of 1.7 ( base operating frequency of cores is 1.79 GHz), 2.4 * 1.7 is about 4 as excepted, does it really mean that 2.4 time speedup is the maximum speedup can be made compared to the boost single thread mode.
I will be very appreciated that you can answer,
1) In the above implementation, are there some variables located on the same cache line, which results a lot of paging between multithread to reduce the performance ?
2) How to modify the above codes to achieve super-linear performance (4 times speedup compared to the single thread) as the number of the threads increases ?
Thank you very much for your help.
Just as a warning up front: arguing about actual performance numbers of a multithreaded Program on a modern x86/x64 system without a RTOS is always a lot of speculation - there are just too many layers between your c/c++ code and the actual operations performed on the processor.
As a rough upper bound estimation, yes, for a ALU(not memory)-bound workload you won't get much more than a 1.73*4 / 3.08 = 2.24 times speedup factor for 4 threads on 4 cores vs 1 thread on one core even in the ideal case. Aside from that, I' argue that your tests "workload" is too small to get meaningfull test results. As mentioned in the comments, a compiler would be allowed to completely replace your workload function with a NOP operation leaving you only with the overhead of creating and joining the threads and your measurement (although I don't think that happened here).

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.