How to make this OpenMP run faster?

How to make this OpenMP run faster? - c++

I am new to OpenMP and I have this code of a Sparse Matrix-Vector Multiplication and it runs in between 40 - 50 sec. and has total 4237 MFlops/s. Is there any way to get it faster?
Ss I have edited the post the complete code und Aas an input I have 2 matrices one with 50000 Element and the secound with 400000.
The main problem is when ever I try something different, I get the time to go even worse.
#pragma omp parallel for schedule (static,50)
for (int i=0; i< (tInput->stNumRows); ++i) {
y[i] = 0.0;
for (int j=Arow[i]; j<Arow[i+1]; ++j)
y[i] += Aval[j]*x[Acol[j]];
}

The way thing you can do to improve the performance of the code is to use vectorization (thanks to SIMD instructions). Here is the resulting code:
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
Note that y[i] is not read/written continuously in the loop enabling further compiler optimizations. Please take care to compile the code in -O3 (or /O2 for MSVC) for the code to be effectively vectorized. However, this is probably not enough for this code to be vectorized.
Indeed, one issue with this code is the memory indirection x[Acol[j]] which is very hard to vectorize efficiently. Recent x86-64 processors (the ones with AVX2) and very recent ARM processors (the ones with SVE) have SIMD instructions to do that (although they are not great still due to the memory access pattern). Without these instructions, no compiler will likely vectorize the code. Thus, you should tell to your compiler it can use theses instructions (assuming the target processor is actually recent). For GCC/Clang, one way is to use the non-portable -march=native. Another way is to use -mavx2 combined with -mfma on x86-64 processors (although this does not seems to be as good as -march=native in this case for very complex reasons).
Another way to improve the code is to mitigate possible load balancing issues and unwanted overheads. Indeed, load balancing issues can appear in your code if the expression Arow[i+1]-Arow[i]+1 is very different for many i values. In that case, you can use a guided schedule or a dynamic one. However, keep in mind that using a non-static schedule may introduces a significant overhead (especially if the loop is very small or the gap between values is huge). Finally, you can move the omp parallel directive outside the timing loop body since this can introduce a significant overhead (due to the thread creation regarding the target OpenMP runtime).
Note that the above solutions assume the input matrices are big enough so parallelism is useful. Moreover, if x is huge, the code will likely be bounded by the memory hierarchy and there is not much you can do. Sparse matrix computations are often slow because of such issues.
Here is the final code:
#pragma omp parallel
{
// Timing loop
// [...]
#pragma omp for schedule(guided)
for (int i=0; i< (tInput->stNumRows); ++i) {
double s = 0.0;
#pragma omp simd reduction(+:s)
for (int j=Arow[i]; j<Arow[i+1]; ++j)
s += Aval[j] * x[Acol[j]];
y[i] = s;
}
// [...]
}
EDIT: with your input data, the best solution on my machine (with Clang/IOMP) is not to use multiple threads at all since 400000 elements can be computed in roughly 0.3 ms and the overhead of sharing the work between threads is bigger.

Related

How can I multithread this code snippet in C++ with Eigen

I'm trying to implement a faster version of the following code fragment:
Eigen::VectorXd dTX = (( (XPSF.array() - x0).square() + (ZPSF.array() - z0).square() ).sqrt() + txShift)*fs/c + t0*fs;
Eigen::VectorXd Zsq = ZPSF.array().square();
Eigen::MatrixXd idxt(XPSF.size(),nc);
for (int i = 0; i < nc; i++) {
idxt.col(i) = ((XPSF.array() - xe(i)).square() + Zsq.array()).sqrt()*fs/c + dTX.array();
idxt.col(i) = (abs(XPSF.array()-xe(i)) <= ZPSF.array()*0.5/fnumber).select(idxt.col(i),-1);
}
The sample array sizes I'm working with right now are:
XPSF: Column Vector of 591*192 coefficients (113,472 total values in the column vector)
ZPSF: Same size as XPSF
xe: RowVector of 192 coefficients
idxt: Matrix of 113,472x192 size
Current runs with gcc and -msse2 and -o3 optimization yield an average time of ~0.08 seconds for the first line of the loop and ~0.03 seconds for the second line of the loop. I know that runtimes are platform dependent, but I believe that this still can be much faster. A commercial software performs the operations I'm trying to do here in ~two orders of magnitude less time. Also, I suspect my code is a bit amateurish right now!
I've tried reading over Eigen documentation to understand how vectorization works, where it is implemented and how much of this code might be "implicitly" parallelized by Eigen, but I've struggled to keep track of the details. I'm also a bit new to C++ in general, but I've seen the documentation and other resources regarding std::thread and have tried to combine it with this code, but without much success.
Any advice would be appreciated.
Update:
Update 2
I would upvote Soleil's answer because it contains helpful information if I had the reputation score for it. However, I should clarify that I would like to first figure out what optimizations I can do without a GPU. I'm convinced (albeit without OpenMP) Eigen's inherent multithreading and vectorization won't speed it up any further (unless there are unnecessary temporaries being generated). How could I use something like std::thread to explicitly parellelize this? I'm struggling to combine both std::thread and Eigen to this end.

OpenMP
If your CPU has enough many cores and threads, usually a simple and quick first step is to invoke OpenMP by adding the pragma:
#pragma omp parallel for
for (int i = 0; i < nc; i++)
and compile with /openmp (cl) or -fopenmp (gcc) or just -ftree-parallelize-loops with gcc in order to auto unroll the loops.
This will do a map reduce and the map will occur over the number of parallel threads your CPU can handle (8 threads with the 7700HQ).
In general you also can set a clause num_threads(n) where n is the desired number of threads:
#pragma omp parallel num_threads(8)
Where I used 8 since the 7700HQ can handle 8 concurrent threads.
TBB
You also can unroll your loop with TBB:
#pragma unroll
for (int i = 0; i < nc; i++)
threading integrated with eigen
With Eigen you can add
OMP_NUM_THREADS=n ./my_program
omp_set_num_threads(n);
Eigen::setNbThreads(n);
remarks with multithreading with eigen
However, in the FAQ:
currently Eigen parallelizes only general matrix-matrix products (bench), so it doesn't by itself take much advantage of parallel hardware."
In general, the improvement with OpenMP is not always here, so benchmark the release build. Another way is to make sure that you're using vectorized instructions.
Again, from the FAQ/vectorization:
How can I enable vectorization?
You just need to tell your compiler to enable the corresponding
instruction set, and Eigen will then detect it. If it is enabled by
default, then you don't need to do anything. On GCC and clang you can
simply pass -march=native to let the compiler enables all instruction
set that are supported by your CPU.
On the x86 architecture, SSE is not enabled by default by most
compilers. You need to enable SSE2 (or newer) manually. For example,
with GCC, you would pass the -msse2 command-line option.
On the x86-64 architecture, SSE2 is generally enabled by default, but
you can enable AVX and FMA for better performance
On PowerPC, you have to use the following flags: -maltivec
-mabi=altivec, for AltiVec, or -mvsx for VSX-capable systems.
On 32-bit ARM NEON, the following: -mfpu=neon -mfloat-abi=softfp|hard,
depending if you are on a softfp/hardfp system. Most current
distributions are using a hard floating-point ABI, so go for the
latter, or just leave the default and just pass -mfpu=neon.
On 64-bit ARM, SIMD is enabled by default, you don't have to do
anything extra.
On S390X SIMD (ZVector), you have to use a recent gcc (version >5.2.1)
compiler, and add the following flags: -march=z13 -mzvector.
multithreading with cuda
Given the size of your arrays, you want to try to offload to a GPU to reach the microsecond; in that case you would have (typically) as many threads as the number of elements in your array.
For a simple start, if you have an nvidia card, you want to look at cublas, which also allows you to use the tensor registers (fused multiply add, etc) of the last generations, unlike regular kernel.
Since eigen is a header only library, it makes sense that you could use it in a cuda kernel.
You also may implements everything "by hand" (ie., without eigen) with regular kernels. This is a nonsense in terms of engineering, but common practice in an education/university project, in order to understand everything.
multithreading with OneAPI and Intel GPU
Since you have a skylake architecture, you also can unroll your loop on your CPU's GPU with OneAPI:
// Unroll loop as specified by the unroll factor.
#pragma unroll unroll_factor
for (int i = 0; i < nc; i++)
(from the sample).

How can I improve the perfomance of my OpenMP code?

I am currently trying to improve parallel performance on my Code and I am still new to OpenMP. I have to iterate over a large container, in each iteration reading from multiple entries and writing a result to a single entry. Below is a very minmal Code example of what I am trying to do.
data is a pointer to an array, where a lot of datapoints are stored. Before the parallel region I create an Array newData, so can use data as read-only and newData as write-only, afterwards I throw the old data away and use newDatafor further calculations.
To my understanding data and newDataare shared between threads and everything declared inside the parallel region is private.
Can reading from databy multiple threads cause performance issues?
I am using #critical for assigning a new value to an element of newData to avoid race conditions. Is this necessary, since I access every element of newDataonly once and never by multiple threads?
Also I am not sure about scheduling. Do I have to specify if I want a static or dynamic schedule? Can I use nowait since all threads are idependent of each other?
array *newData = new array;
omp_set_num_threads (threads);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < range; i++)
{
double middle = (*data)[i];
double previous = (*data)[i-1];
double next = (*data)[i+1];
double new_value = (previous + middle + next) / 3.0;
#pragma omp critical(assignment)
(*newData)[i] = new_value;
}
}
delete data;
data = newData;
I am aware that in the first and last iteration previous and next can not be read from data, in the real code this is taken care of but for this minimal example you get the idea of reading multiple times from data.

First of all, get rid of all unnecessary dependencies. #pragma omp critical(assignment) is not necessary because each index of (*newData) is only written to once per loop, so there's no race condition.
Your code could now look like this:
#pragma omp parallel for
for (int i = 0; i < range; i++)
(*newData)[i] = ((*data)[i-1] + (*data)[i] + (*data)[i+1]) / 3.0;
Now we're looking for bottlenecks. The list of potential candidates I came up with is this:
Slow division
Cache thrashing
ILP (Instruction level parallelism)
Memory bandwith limitations
Hidden dependencies
So let's analyze them further.
Slow division:
It takes some CPUs forever to calculate double/double. To know how long and what througput your CPU has, you have to look at its specs. Maybe replacing /3.0 with *0.3333.. might help, but maybe your compiler does this already. Using extended instruction sets (like SSE/AVX) you might shedule several divisions/multiplications at once.
Cache thrashing:
Because your CPU has to load/store one cache line at a time there could be conflicts. Imagine if thread 1 tries to write to (*newdata)[1] and thread 2 to (*newdata)[2] and they are on the same cache line. Now one of them has to wait for the other. You could resolve this with #pragma omp parallel for schedule(static, 64).
ILP:
CPUs can schedule multiple operations into a pipeline if the operations are independent. For this to happen you have to unroll your loop. This could look like this:
assert(range % 4 == 0);
#pragma omp parallel for
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) / 3.0;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) / 3.0;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) / 3.0;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) / 3.0;
}
Memory bandwith limitations:
For your very simple loop think about this. How much memory do you have to load and how long will your CPU be busy processing it. You're loading about 1 cache line and computing some dereferences, some pointer addition, two additions and one division. Which limit you hit depends on your CPU specs.
Now consider cache locality. Can you modify your code to make better use of the cache? If one thread gets i=3 in one loop-iteration, and i=7 in the next, you have to reload 3 (*data)'s. But if you would go from i=3 to i=4, you might not have to load anything, because (*data)[i+1] was in the cacheline previously loaded. You save some RAM bandwith. To make use of this, unroll the loop. Also using float instead of double increases this chance.
Hidden dependencies:
Now this part I personally find very tricky. Sometimes your compiler isn't shure it can reuse some data, because it doesn't know it hasn't changed. Using const helps the compiler. But sometimes you need a restrict to give the compiler the right hint. But I don't understand this well enough to explain it.
So here is what I would try:
const double ONETHIRD = 1.0 / 3.0;
assert(range % 4 == 0);
#pragma omp parallel for schedule(static, 1024)
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) * ONETHIRD;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) * ONETHIRD;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) * ONETHIRD;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) * ONETHIRD;
}
And then benchmark. Benchmark some more, and benchmark some more. Only benchmarks will show you which tricks help.
PS: One more thing to consider. If you see your program hitting the memory bandwith hard. You could consider changing the algorithm. Maybe fuse two steps into one. Like going from
b[i] := (a[i-1] + a[i] + a[i+1]) / 3.0
to
d[i] := (n[i-1] + n[i] + n[i+1]) / 3.0 = (a[i-2] + 2.0 * a[i-1] + 3.0 * a[i] + 2.0 * a[i+1] + a[i+1]) / 3.0. I think the reason for this you will find out yourself.
Have fun optimizing ;-)

Reading an array by multiple threads usually does no harm.
You only need a critical section if multiple threads work on the exact same piece of data, here each thread accesses a different part of the array so you dont need it. Critical sections are very bad for performance so only use them if absolutely necessary. Often they can be replaced by atomic actions:
openMP, atomic vs critical?
Like a critical section, they dont make sense if each thread accesses different data.
For the scheduler its best to test them each and measure the performance as predictions about performance are often wrong. Also try different chunk sizes.
Some other things that might help:
Measuring performance is often interferred by other tasks on your pc so take multiple measurements and take their minimum (except if the input is different each time, then take the average and do more measurements).
Do you really need double precision? Floats are a lot faster.
edit: nowait is for multiple independent for loops: https://msdn.microsoft.com/en-us/library/ek5st0e3.aspx

I assume you are trying to do some kind of convolution or median blur with 1D array. The short answer is: stick to default schedule strategy, and get rid of critical at all.
As I can tell, you are a quit newbie to parallelism, it's a little bit confusion to deal with OpenMP directives, like nowait/private/reduction/critical/atomic/single, etc. I think what you need is a well written textbook to clarify various concept. If you had a sound knowledge, a hour of learning OpenMP could be enough to deal with most daily programming.

How to optimize a simple loop?

The loop is simple
void loop(int n, double* a, double const* b)
{
#pragma ivdep
for (int i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
I am using intel c++ compiler and using #pragma ivdep for optimization currently. Any way to make it perform better like using multicore and vectorization together, or other techniques?

This loop is absolutely vectorizable by compiler. But make sure that loop was actually vectorized (using Compiler' -qopt-report5, assembly output, Intel (vectorization) Advisor, whatever other techniques). One more overkill way to do that is creating performance baseline using -no-vec option (which will disable ivdep-driven and auto-vectorization) and then compare execution time against it. This is not good way for checking vectorization presence, but it's useful for general performance analysis for next bullets.
If loop hasn't been actually vectorized, make sure you push compiler to auto-vectorize it. In order to push compiler see next bullet. Note that next bullet could be useful even if loop was succesfully auto-vectorized.
To push compiler to vectorize it use: (a) restrict keyword to "disambiguate" a and b pointers (someone has already suggested it to you). (b) #pragma omp simd (which has extra bonus of being more portable and much more flexible than ivdep, but also has a drawback of being unsupported in old compilers before intel compiler version 14 and for other loops is more "dangerous"). To re-emphasize: given bullet may seem to do the same thing as ivdep, but depending on various circumstances it could be better and more powerful option.
Given loop has fine-grain iterations (too small amount of computations per single iteration) and overall is not purely compute-bound (so effort/cycles spent by CPU to load/store data from/to cache/memory is comparable if not bigger to effort/cycles spent to perform multiplication). Unrolling is often good way to slightly mitigate such disadvantages. But I would recommend to explicitly ask compiler to unroll it, by using #pragma unroll. In fact, for certain compiler versions the unrolling will happen automatically. Again, you can check whenever compiler did it by using -qopt-report5, loop assembly or Intel (Vectorization) Advisor:
In given loop you deal with "streaming" access pattern. I.e. you are contiguously loading/store data from/to memory (and cache sub-system will not help a lot for big "n" values). So, depending on target hardware, usage of multi-threading (atop of SIMD), etc, your loop will likely become memory bandwidth bound in the end. Once you become memory bandwidth bound, you could use techniques like loop blocking, non-temporal stores, aggressive prefetching. All of these techniques worth separate article, although for prefetching/NT-stores you have some pragmas in Intel Compiler to play with.
If n is huge, and you already got prepared to memory bandwidth troubles, you could use things like #pragma omp parallel for simd, which will simulteneously thread-parallelize and vectorize the loop. However quality of this feature has been made decent only in very fresh compiler versions AFAIK, so maybe you'd prefer to split n semi-manually. I.e. n=n1xn2xn3, where n1 - is number of iterations to distribute among threads, n2 - for cache blocking, n3 - for vectorization. Rewrite given loop to make it loopnest of 3 nested loops, where outer loop has n1 iterations (and #pragma omp parallel for is applied), next level loop has n2 iterations, n3 - is innermost (where #pragma omp simd is applied).
Some up to date links with syntax examples and more info:
unroll: https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling
OpenMP SIMD pragma (not so fresh and detailed, but still relevant): https://software.intel.com/en-us/articles/enabling-simd-in-program-using-openmp40
restrict vs. ivdep
NT-stores and prefetching : https://software.intel.com/sites/default/files/managed/22/a3/mtaap2013-prefetch-streaming-stores.pdf
Note1: I apologize that I don't provide various code snippets here. There are at least 2 justifiable reasons for not providing them here: 1. My 5 bullets are pretty much applicable to very many kernels, not just to yours. 2. On the other hand specific combination of pragmas/manual rewriting techniques and corresponding performance results will vary depending on target platform, ISA and Compiler version.
Note2: Last comment regarding your GPU question. Think of your loop vs. simple industry benchmarks like LINPACK or STREAM. In fact your loop could become somewhat very similar to some of them in the end. Now think of x86 CPUs and especially Intel Xeon Phi platform characteristics for LINPACK/STREAM. They are very good indeed and will become even better with High Bandwidth Memory platforms (like Xeon Phi 2nd gen). So theoretically there is no any single reason to think that your given loop is not well mapped to at least some variants of x86 hardware (note that I didn't say similar thing for arbitrary kernel in universe).

Assuming the data pointed to by a can't overlap the data pointed to by b the most important information to give the compiler to let it optimize the code is that fact.
In older ICC version "restrict" was the only clean way to provide that key information to the compiler. In newer versions there are a few cleaner ways to give a much stronger guarantee than ivdep gives (in fact ivdep is a weaker promise to the optimizer than it appears and generally doesn't have the intended effect).
But if n is large, the whole thing will be dominated by the cache misses, so no local optimization can help.

Loop unrolling manually is a simple way to optimize your code, and following is my code. Original loop costs 618.48 ms, while loop2 costs 381.10 ms in my PC, the compiler is GCC with option '-O2'. I don't have Intel ICC to verify the code, but I think the optimization principles are the same.
Similarly, I did some experiments that compare the execution time of two programs to XOR two blocks of memories, and one program is vectorized with the help of SIMD instructions, while the other is manually loop-unrolled. If you are interested, see here.
P.S. Of course loop2 only works when n is even.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#define LEN 512*1024
#define times 1000
void loop(int n, double* a, double const* b){
int i;
for(i = 0; i < n; ++i, ++a, ++b)
*a *= *b;
}
void loop2(int n, double* a, double const* b){
int i;
for(i = 0; i < n; i=i+2, a=a+2, b=b+2)
*a *= *b;
*(a+1) *= *(b+1);
}
int main(void){
double *la, *lb;
struct timeval begin, end;
int i;
la = (double *)malloc(LEN*sizeof(double));
lb = (double *)malloc(LEN*sizeof(double));
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
gettimeofday(&begin, NULL);
for(i = 0; i < times; ++i){
loop2(LEN, la, lb);
}
gettimeofday(&end, NULL);
printf("Time cost : %.2f ms\n",(end.tv_sec-begin.tv_sec)*1000.0\
+(end.tv_usec-begin.tv_usec)/1000.0);
free(la);
free(lb);
return 0;
}

I assume, that n is large. You can distribute the workload on k CPUs by starting k threads and provide each with n/k elements. Use big chunks of consecutive data for each thread, don't do finegrained interleaving. Try to align the chunks with cache lines.
If you plan to scale to more than one NUMA node, consider to explicitly copy the chunks of workload to the node, the thread runs on, and copy back the results. In this case, it might not really help, because the workload for each step is very simple. You'll have to run tests for that.

Openmp performance with omp_get_max_threads greater than number of cores

I am novice is parallel programming. I running a my own Gibbs sampler written in C++. The overview of program look some thing like this.
for(int iter=0; iter <=itermax; iter++){ //loop1
#pragma omp parallel for schedule(dynamic)
for(int jobs= 0; jobs<=1000; jobs++){ // loop2
small_job();
#pragma omp critical(dataupdate){
data_updates()
}
}
jobs_that_cannot_be_parallelized();
}
I am running in a machine with 64 cores. Since small_job are of variable length and small I was assigning omp_get_max_threads = 128. The number of cores used seems to be correct (see fig load last hour).. Each of peaks belongs to loop2.
However when I look to the actual cpu usage (see fig it seems lot of of cpu is used by system and only 20% is used by user. Is it because I am spawning lots of threads at loop2. What are best practices to decide on omp_get_max_threads? I know I have not given enough information but I will really appreciate any other recommendation to make the program faster.

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Use Threads or Processes, you may want to look to OpenMp

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js