OpenMP - Poor performance when solving system of linear equations - c++

I am trying to use OpenMP to parallelize a simple c++ code that solves a system of linear equations by Gauss elimination.
The relevant part of my code is:
#include <iostream>
#include <time.h>
using namespace std;
#define nl "\n"
void LinearSolve(double **& M, double *& V, const int N, bool parallel, int threads){
//...
for (int i=0;i<N;i++){
#pragma omp parallel for num_threads(threads) if(parallel)
for (int j=i+1;j<N;j++){
double aux, * Mi=M[i], * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
//...
};
class Time {
clock_t startC, endC;
time_t startT, endT;
public:
void start() {startC=clock(); time (&startT);};
void end() {endC=clock(); time (&endT);};
double timedifCPU() {return(double(endC-startC)/CLOCKS_PER_SEC);};
int timedif() {return(int(difftime (endT,startT)));};
};
int main (){
Time t;
double ** M, * V;
int N=5000;
cout<<"number of equations "<<N<<nl<<nl;
M= new double * [N];
V=new double [N];
for (int i=0;i<N;i++){
M[i]=new double [N];
};
for (int m=1;m<=16;m=2*m){
cout<<m<<" threads"<<nl;
for (int i=0;i<N;i++){
V[i]=i+1.5*i*i;
for (int j=0;j<N;j++){
M[i][j]=(j+2.3)/(i-0.2)+(i+2)/(j+3); //some function to get regular matrix
};
};
t.start();
LinearSolve(M,V,N,m!=1,m);
t.end();
cout<<"time "<<t.timedif()<<", CPU time "<<t.timedifCPU()<<nl<<nl;
};
}
Since the code is extremely simple I would expect that the time would be
inversely proportional to the number of threads. However the typical result I get is (the code is compiled with gcc on Linux)
number of equations 5000
1 threads
time 217, CPU time 215.89
2 threads
time 125, CPU time 245.18
4 threads
time 80, CPU time 302.72
8 threads
time 67, CPU time 458.55
16 threads
time 55, CPU time 634.41
There is a decrease in time, but much less that I would like to and the CPU time mysteriously grows.
I suspect the problem is in memory sharing, but I have been unable to identify it. Access to row M[j] should not be a problem, since each thread writes to the a different row of the matrix. There could be a problem in reading from row M[i], so I also tried to make a separate copy of this row for each thread by replacing the parallel loop by
#pragma omp parallel num_threads(threads) if(parallel)
{
double Mi[N];
for (int j=i;j<N;j++) Mi[j]=M[i][j];
#pragma omp for
for (int j=i+1;j<N;j++){
double aux, * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
Unfortunately it does not help at all.
I would very much appreciate any help.

Your problem is excessive OpenMP synchronization.
Having the #omp parallel inside the first loop means that each iteration of the outer loop comes with the whole synchronization overhead.
Take a look at the top-most chart on the image here (more detail can be found on the Allinea MAP OpenMP profiler introduction). The top line is application activity - and dark gray means "OpenMP synchronization" and green means "doing compute".
You can see a lot of dark gray in the right hand side of that top graph/chart - that is when the 16 threads are running. You're spending a lot of time synchronizing.
I also see a lot of time being spent in memory access (more than in compute) - so it's probably this that is making what should be balanced workload actually be highly unbalanced and giving the synchronization delay.
As the other respondent suggested - it's worth reading other literature for ideas here.

I think the underlying problem may be that traditional Gaussian elimination may not be suitable for parallelization.
Gaussian elimination is a process where by each subsequent step relies on the result of the previous step i.e. each iteration of your linear solve loop is dependent on the results of the previous iteration, i.e. it must be done serially. Try searching the literature for "parallel row reduction algorithms".
Also glancing at your code it looks like you will have race condition.

Related

C++ multithread algorithm creating several threads running on the same CPU thread

I am new to multithreading and I will try to be as clear as possible. I am creating a multithread algorithm in C++ with standard library, std::thread. My code is compiling and running with no errors. I put the code for two threads. Two threads are creating with different id than the main one (I checked with getid() and the thread window in Visual Studio). The problem is that I have no gain in time, and the % utilization of my CPU is the same so it seems that they run on the same CPU thread.
Does it means that threading doesn't automatically mean running on different CPU cores and threads to gain time?
Do I need to add instructions or combine this code with a multithread library? Or is there just a mistake?
Any help would be really appreciated, thanks
double internalenergy=0;
unsigned int NUM_THREADS = std::thread::hardware_concurrency();
NUM_THREADS=2;
vtkIdType num_cells = referencemesh_->GetNumberOfCells();
int start_node[8];
int end_node[8];
int intervalle=num_cells/NUM_THREADS;
for( unsigned int i=0; i < NUM_THREADS; ++i )
{
start_node[i] = (float)num_cells*i/NUM_THREADS;
end_node[i] = (float)num_cells*(i+1)/NUM_THREADS;
}
const double* pointeurpara;
pointeurpara=&params(0);
double value1=0;
double* P_value1;
P_value1=&value1;
double value2=0;
double* P_value2;
P_value2=&value2;
std::thread first(threadinternalenergy,activemesh,referencemesh_,start_node[0],P_value1,pointeurpara);
std::thread second (threadinternalenergy,activemesh,referencemesh_,start_node[1],P_value2,pointeurpara);
first.join();
second.join();
double displacementneighpoint=*P_value1+*P_value2;
Here is the source code of threadinternalenergy
void threadinternalenergy(vtkSmartPointer<vtkPolyData>,
activemesh,vtkSmartPointer<vtkPolyData> referencemesh,int startcell,double*
displacementneighpoint,const double* p)
{
unsigned long processnumber;
processnumber=GetCurrentProcessorNumber();
cout<<"process "<<processnumber<<endl;
unsigned int NUM_THREADS = std::thread::hardware_concurrency();
NUM_THREADS=2;
vtkIdType num_cells = referencemesh->GetNumberOfCells();
int intervalle=num_cells/NUM_THREADS;
int endcell=startcell+intervalle;
//cout<<"start cell "<<startcell<<endl;
//cout<<"end cell "<<endcell<<endl;
int numcell_th=startcell-endcell;
vtkSmartPointer<vtkEdgeTable> vtk_edge_tableT =
vtkSmartPointer<vtkEdgeTable>::New();
vtk_edge_tableT->InitEdgeInsertion(numcell_th*3);
// Initialize edge table
vtkSmartPointer<vtkEdgeTable> vtk_edge_table =
vtkSmartPointer<vtkEdgeTable>::New();
vtk_edge_table->InitEdgeInsertion( numcell_th*3 );
cout<<"start cell "<<startcell<<endl;
for( vtkIdType i=startcell; i < endcell; ++i )
{
vtkCell* cell = referencemesh->GetCell( i );
// cout<<"cell"<<endl;
// Traverse edges in cell -- assuming a linear cell (line, triangle; NOT
rectangle)
for( vtkIdType j=0; j < cell->GetNumberOfEdges(); ++j )
{
// cout<<"edge"<<endl;
vtkCell* edge = cell->GetEdge( j );
vtkIdType pt0 = edge->GetPointId(0);
vtkIdType pt1 = edge->GetPointId(1);
//consider edge if a displacement has been point by at least one point
//if(params(pt0)!=0 || params(pt1)!=0){
if( vtk_edge_table->IsEdge( pt0, pt1 ) == -1 ){ // If this edge
is not in the edge table
vtk_edge_table->InsertEdge( pt0, pt1 );
//acess point coordinate
// vtkPoints* listpoints=edge->GetPoints();
double p0 [3];
double p1 [3];
referencemesh->GetPoint(pt0,p0);
referencemesh->GetPoint(pt1,p1);
//2e Mesh (Mesh transform)
double p0T [3];
double p1T [3];
activemesh->GetPoint(pt0,p0T);
activemesh->GetPoint(pt1,p1T);
// If this edge is not in the edge table
vtk_edge_tableT->InsertEdge( pt0, pt1 );
//find displacement difference for 2 points sharing an edge
double squaredDistancep0 =
vtkMath::Distance2BetweenPoints(p0, p0T);
double distancep0 = sqrt(squaredDistancep0);
double squaredDistancep1 =
vtkMath::Distance2BetweenPoints(p1, p1T);
double distancep1 = sqrt(squaredDistancep1);
double difference = abs(distancep0 - distancep1);
difference=((difference+1)*(difference+1))-1;
//if(difference>0.25){
//cout<<"grosse difference "<<difference<<endl;
//}
*displacementneighpoint+=difference;
//displacementpointneigh.push_back(difference);
}
//}
}
}
cout<<"Fin du thread "<<endl;
}
The threads probably run on different cores. That does not necessarily mean the program will be faster. It could possibly run much slower than with one thread. The threads must do enough work concurrently to justify the overhead of thread-creation and synchronization. There are lots of pitfalls. A likely problem is that the threads are sharing (or "false sharing") memory too often.
Yes, one main idea behind threading is to "do concurrency." But that can be done with a CPU that only has one core! Or it can fail to do it well with lots of cores.
I wrote a system for controlling a robot that handled computer wafers. It ran on a CPU with one core, yet it depended strongly on concurrency. The concurrency resulted from having devices other than the CPU running concurrently with it, e.g. a servo amplifier and a network card. One high priority thread dealt exclusively with managing the servo via the network. Other threads used the slack time to plan trajectories, communicate with a host computer, etc. ... When the the servo-control thread woke up because the network card had a response from the servo, it immediately took over the CPU and did its thing.
There was a talk this year at cppCon titled "When a microsecond is an eternity". The guy programs one of those sneaky stock-trade systems. All that matters to him is how fast he can deal with the network that connects him to the market. To get the throughput he needs, he uses a 4-core machine with three cores disabled. Why? Because he can't find a good enough 1 core machine.
I also wrote a class that used multi-threading to do some calculations on a multi-core machine. It failed miserably. Some tasks are just not suited to divide-and-conquer.

Optimal way to handle array indexing in parallel?

I have the following situation: I have a list of particles in a box of size L, where L is the length of one of the sides.
Next, I split the box into cells, where L/cell_dim = 7. So there are 7*7*7 cells.
Finally, I read through all the particles, note their position, and calculate which cell they are in.
I accomplish the above in an openMP parallel for loop. However, I need to capture the information in a thread safe fashion such that I don't have to loop through all the particles for each cell. So I need some way to record an arbitrary subset of the particles into each cell, in parallel.
The method I have right now makes use of the OpenMP critical code block. I have an array size [7][7][7][max_particles], where max_particles is the highest number of particles per cell, but which is much less than the total number of particles. I record the index of the last particle added in a counter array size [7][7][7], and update the cell array according to the latest count in my parallel loop:
int cube[7][7][7][10];
int cube_counts[7][7][7]={0};
#pragma omp parallel for num_threads(a lot)
for (int i = 0; i < num_particles; i++){
cell_x = //cell calculation;
cell_y = //ditto;
cell_z = //...;
#pragma omp critical
{
cube_counts[cell_x][cell_y][cell_z] += 1;
// for readability
int index = cube_counts[cell_x][cell_y][cell_z];
cube[cell_x][cell_y][cell_z][index] = i;
}
}
// rest in pseudo code:
foreach cell:
adjacent_cell = cell2
particle_countA = cube_counts[cellx][celly][cellz]
particle_countB = cube_counts[cell2x][cell2y][cell2z]
// these two for loops will cover ~2-4 particles,
// so super small...as a result of the cell analysis above.
for particle in cell:
for particle in cell2:
...do stuff
Although this works, it increases in speed by a factor of more than 2 when I am able to eliminate the critical block (I am on an intel coprocessor with 60 physical, 240 logical).
How would I accomplish this without need for the critical block? I thought of doing a big array...but then I lose everything I gained when I iterate through the 7*7*7*257 (where 257 is the particle count) array. Linked lists still have the race conditions.
Maybe some kind of unordered, thread safe list...?
Using a lock instead of the critical section can be driven further:
You may use atomic increment and atomic assignment pseudo calls ("intrinsics") that the compiler will translate to the correct x86 specific assembler instructions. This is however platform or even compiler dependent.
If your use a modern c++ compiler (C++11) then std::atomic_* might be the best way to do it.

Function calls in OpenMP loop?

I have been trying to parallelize a particle simulation code I wrote. But in my parallelization, I came away with no increase in performance when moving from 1 processor to 12, and even worse the code is no longer returning accurate results. I have been banging my head against the wall and can't figure this out. Below is the loop being parallelized:
#pragma omp parallel
{
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp for
// Loop over azimuth ejection angle, from 0-360.
for(int i=0; i<360; i++)
{
// Declare temporary variables
double *y = new double[12];
vector<double> ejecVel(3);
vector<double> colLoc(7);
double azimuth, inc;
bool collision;
// Loop over inclincation ejection angle from 1-max_angle, increasing by 1 degree.
for(int j=1; j<=15; j++)
{
// Update azimuth and inclination angle and get velocity direction vector.
azimuth = (double) i;
inc = (double) j;
ejecVel = Jet::GetEjecVelocity(azimuth,inc);
collision = false;
// Update initial conditions.
y[0] = m_parPos[0];
y[1] = m_parPos[1];
... (define pointer values)
// Simulate particle
systemSolver.ParticleSim(simSteps,dt,y,collision,colLoc);
if(collision == true)
{
cout << "Collision! " << endl;
}
}
delete [] y;
}
The goal is to loop through, simulating particles for different initial conditions over the loops, and store where they have gone and their state vector upon collision in master variables densCount and collisionStates. The simulation takes place in a function from another class (systemSolver.ParticleSim() ), and it seems like each solve from a different thread is not independent. Everything I've read suggests that it should be, but I can't figure out why else the result would not be right only if I have Open MP implemented. Any thoughts are greatly appreciated.
-ben
SOLUTION: The simulation was modifying a member variable of a separate (systemSolver) class. Since I provided a single class object to all threads, they were all simultaneously modifying an important member variable. Thought I would post this in case any other n00bs encounter a similar problem.
I believe one mistake is the call to omp_set_* functions inside the parallel region. In the best case, they take effect on subsequent regions only. Try to reorder as following:
omp_set_dynamic(1);
omp_set_num_threads(12);
#pragma omp parallel

What is the fastest way to do this in c++ (Using OpenMP)

I have an algorithem that I can write in psedu code as follow:
for(int frame=0;frame <1000;frame++)
{
Image *img=ReadFrame();
mat processedImage=processImage(img);
addtompeg(processedImage);
}
ProcessImage is time consuming and would takes around 30 sec. ReadFrame an AddToMpeg are not slow but they need to be done sequentially (otherwise, fame 2 may be added to output before frame 1).
How can I parallelize it using OpenMP?
I am using opencv to readframe and addtompeg.
Technically, in OpenMP you may execute a portion of a for loop in the same order as if the program was sequential using the ordered clause (see section 2.8.7 here). Anyhow I would not suggest using this clause for two reasons:
a thread must not execute more than one ordered region in the same loop (which seems not to be your case)
in many implementations an ordered loop behaves much like a sequential loop, with detrimental effects on performance
Therefore what I would suggest in your case is to unroll the loop:
Image * img [chunk];
mat processedImage[chunk];
/* ... */
for(int frame = 0; frame < nframes; frame += chunk) {
#pragma omp single
{ /* Frames are read in sequential order */
for( int ii = frame; ii < frame + chunk; ii++) {
img[ii%chunk] = ReadFrame();
}
} /* Implicit barrier here */
#pragma omp for
for( int ii = frame; ii < frame + chunk; ii++) {
processedImage[ii%chunk] = processImage(img[ii%chunk]); /* Images are processed in parallel */
} /* Implicit barrier here */
#pragma omp single
{ /* Frames are added to mpeg sequential order */
for( int ii = frame; ii < frame + chunk; ii++) {
addtompeg(processedImage[ii%chunk]);
}
} /* Implicit barrier here */
}
The value of chunk depends mainly on considerations about memory. If you think that memory will not be a problem, then you can completely remove the outer loop and let the inner one go from 0 to nframes.
Of course care must be taken to correctly manage the remainders of the outer loop (which I have not shown in the snippet).
Building on the chunking idea of Massimiliano, a more elegant solution is to use the explicit tasking mechanism of OpenMP 3.0 and later (which means that it would not work with the C++ compiler from Visual Studio):
const int nchunks = 10;
#pragma omp parallel
{
#pragma omp single
{
mat processedImage[nchunks];
for (int frame = 0; frame < nframes; frame++)
{
Image *img = ReadFrame();
#pragma omp task shared(processedImage)
{
processedImage[frame % nchunks] = processImage(img);
disposeImage(img);
}
// nchunks frames read or the last frame reached
if ((1 + frame) % nchunks == 0 || frame == nframes-1)
{
#pragma omp taskwait
int chunks = 1 + frame % nchunks;
for (int i = 0; i < chunks; i++)
addtompeg(processedImage[i]);
}
}
}
}
The code might look awkward, but it is conceptually very simple. If it wasn't for the OpenMP constructs, it would be just like a serial code that buffers up to nchunks processed frames before it adds them in a loop to the output MPEG file. The magic happens in this block of code:
#pragma omp task shared(processedImage)
{
processedImage[frame % nchunks] = processImage(img);
disposeImage(img);
}
This creates a new OpenMP task that executes the two lines of code in the block. img and frame are captured by value, i.e. they are firstprivate, therefore it is not necessary for img to be an array of pointers. The producer task gives the ownership of img to the task and therefore the task has to take care of disposing the image object. It is important here that ReadFrame() allocates each frame in a separate buffer and does not reuse some internal memory each time (I've never used OpenCV and I don't know if this is the case or not). Tasks are queued and executed by idle threads waiting at some task scheduling point. The implicit barrier at the end of the single construct is such a scheduling point, therefore the remaining threads will start executing the tasks. Once that nchunk frames have been read or the end of the input has been reached, the producer thread waits for all queued tasks to be processed (that's what the taskwait is for) and then simply writes the chunks to the output.
Selecting the proper value of nchunks is important, otherwise some of threads might end up idling. If ReadFrame and addtompeg are relatively fast, i.e. reading and writing num_threads frames takes less time than processImage, then nchunks should be an exact multiple of the number of threads. If processImage can take varying amount of time, then you would need to set a really large value of nchunks in order to prevent load imbalance. In this case I would rather try to parallelise processImage instead and keep the processing loop serial.

CUDA Thrust slow when operating large vectors on my machine

I'm a CUDA beginner and reading on some thrust tutorials.I write a simple but terribly organized code and try to figure out the acceleration of thrust.(is this idea correct?). I try to add two vectors (with 10000000 int) to another vector, by adding array on cpu and adding device_vector on gpu.
Here is the thing:
#include <iostream>
#include "cuda.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#define N 10000000
int main(void)
{
float time_cpu;
float time_gpu;
int *a = new int[N];
int *b = new int[N];
int *c = new int[N];
for(int i=0;i<N;i++)
{
a[i]=i;
b[i]=i*i;
}
clock_t start_cpu,stop_cpu;
start_cpu=clock();
for(int i=0;i<N;i++)
{
c[i]=a[i]+b[i];
}
stop_cpu=clock();
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
std::cout<<"Time to generate (CPU):"<<time_cpu<<std::endl;
thrust::device_vector<int> X(N);
thrust::device_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
thrust::transform(X.begin(), X.end(),
Y.begin(),
Z.begin(),
thrust::plus<int>());
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime,start,stop);
std::cout<<"Time to generate (thrust):"<<elapsedTime<<std::endl;
cudaEventDestroy(start);
cudaEventDestroy(stop);
getchar();
return 0;
}
The CPU results appear really fast, But gpu runs REALLY slow on my machine(i5-2320,4G,GTX 560 Ti), CPU time is about 26,GPU time is around 30! Did I just do the thrust wrong with stupid errors in my code? or was there a deeper reason?
As a C++ rookie, I checked my code over and over and still got a slower time on GPU with thrust, so I did some experiments to show the difference of calculating vectorAdd with five different approaches.
I use windows API QueryPerformanceFrequency() as unified time measurement method.
Each of the experiments looks like this:
f = large_interger.QuadPart;
QueryPerformanceCounter(&large_interger);
c1 = large_interger.QuadPart;
for(int j=0;j<10;j++)
{
for(int i=0;i<N;i++)//CPU array adding
{
c[i]=a[i]+b[i];
}
}
QueryPerformanceCounter(&large_interger);
c2 = large_interger.QuadPart;
printf("Time to generate (CPU array adding) %lf ms\n", (c2 - c1) * 1000 / f);
and here is my simple __global__ function for GPU array adding:
__global__ void add(int *a, int *b, int *c)
{
int tid=threadIdx.x+blockIdx.x*blockDim.x;
while(tid<N)
{
c[tid]=a[tid]+b[tid];
tid+=blockDim.x*gridDim.x;
}
}
and the function is called as:
for(int j=0;j<10;j++)
{
add<<<(N+127)/128,128>>>(dev_a,dev_b,dev_c);//GPU array adding
}
I add vector a[N] and b[N] to vector c[N] for a loop of 10 times by:
add array on CPU
add std::vector on CPU
add thrust::host_vector on CPU
add thrust::device_vector on GPU
add array on GPU. and here is the result
with N=10000000
and I get results:
CPU array adding 268.992968ms
CPU std::vector adding 1908.013595ms
CPU Thrust::host_vector adding 10776.456803ms
GPU Thrust::device_vector adding 297.156610ms
GPU array adding 5.210573ms
And this confused me, I'm not familiar with the implementation of template library. Did the performance really differs so much between containers and raw data structures?
Most of the execution time is being spent in your loop that is initializing X[i] and Y[i]. While this is legal, it's a very slow way to initialize large device vectors. It would be better to create host vectors, initialize them, then copy those to the device. As a test, modify your code like this (right after the loop where you are initializing the device vectors X[i] and Y[i]):
} // this is your line of code
std::cout<< "Starting GPU run" <<std::endl; //add this line
cudaEvent_t start, stop; //this is your line of code
You will then see that the GPU timing results appear almost immediately after that added line prints out. So all of the time you're waiting is spent in initializing those device vectors directly from host code.
When I run this on my laptop, I get a CPU time of about 40 and a GPU time of about 5, so the GPU is running about 8 times faster than the CPU for the sections of code you are actually timing.
If you create X and Y as host vectors, and then create analogous d_X and d_Y device vectors, the overall execution time will be shorter, like so:
thrust::host_vector<int> X(N);
thrust::host_vector<int> Y(N);
thrust::device_vector<int> Z(N);
for(int i=0;i<N;i++)
{
X[i]=i;
Y[i]=i*i;
}
thrust::device_vector<int> d_X = X;
thrust::device_vector<int> d_Y = Y;
and change your transform call to:
thrust::transform(d_X.begin(), d_X.end(),
d_Y.begin(),
Z.begin(),
thrust::plus<int>());
OK so you've now indicated that the CPU run measurement is faster than the GPU measurement. Sorry I jumped to conclusions. My laptop is an HP laptop with a 2.6GHz core i7 and a Quadro 1000M gpu. I'm running centos 6.2 linux. A few comments: if you're running any heavy display tasks on your GPU, that can detract from performance. Also, when benchmarking these things it's common practice to use the same mechanism for comparison, you can use cudaEvents for both if you want, it can time CPU code the same as GPU code. Also, it's common practice with thrust to do a warm up run that is untimed, then repeat the test for a measurement, and likewise it's common practice to run the test 10 times or more in a loop, then divide to get an average. In my case, I can tell the clocks() measurement is pretty coarse because successive runs will give me 30, 40 or 50. On the GPU measurement I get something like 5.18256. Some of these things may help, but I can't say exactly why your results and mine differ so much (on the GPU side).
OK I did another experiment. The compiler will make a big difference on CPU side. I compiled with -O3 switch and the CPU time dropped to 0. Then I converted the CPU timing measurement from the clocks() method to cudaEvents, and I got a CPU measured time of 12.4 (with -O3 optimization) and still 5.1 on GPU side.
Your mileage will vary based on timing method and which compiler you are using on the CPU side.
First, Y[i]=i*i; does not fit in an integer for 10M elements. Integers holds roughly 1e10 and your code needs 1e14.
Second, it looks like the timing of transform is correct and should be faster than the CPU, regardless of which library you're using. Robert's suggestion to initialize vectors on CPU and then transfer to GPU is a good one for this case.
Third, since we can't do the integer multiple, below is some simpler CUDA library code (using ArrayFire that I work on) to do similar with floats, for your benchmarking:
int n = 10e6;
array x = array(seq(n));
array y = x * x;
timer t = timer::tic();
array z = x + y;
af::eval(z); af::sync();
printf("elapsed seconds: %g\n", timer::toc( t));
Good luck!
I am running similar test recently using CUDA Thrust on my Quadro 1000m. I use the thrust::sort_by_key as a benchmark to test its performance and the result is too good to convince my boos.It takes 100+ms to sort 512MB pairs.
For your problem, I am confused for 2 things.
(1) Why you multiple this time_cpu by 1000? Without the 1000, it is already in seconds.
time_cpu=(double)(stop_cpu-start_cpu)/CLOCKS_PER_SEC*1000;
(2) And, by mentioning 26, 30, 40, do you mean seconds or ms? The 'cudaEvent' report elapsed time in 'ms' not 's'.