Thrust is very slow for array reduction

Thrust is very slow for array reduction - c++

I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::high_resolution_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?
EDIT 1:
Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::steady_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:
nvcc -o reduction reduction.cu
To run:
./reduction

Am I doing something wrong here?
I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.
The profiler tells me that the thrust::reduce is accomplished under the hood via a call to cub::DeviceReduceKernel followed by cub::DeviceReduceSingleTileKernel. This two-kernel approach should make sense to you if you have studied Mark Harris' reduction material. The profiler tells me that together, these two calls account for ~40us of the ~500us overall time. This is the time that would be most comparable to the measurement you made of your implementation of Mark Harris' reduction code, assuming you are timing the kernels only. If we multiply by 4 to account for the overall perf ratio, it is pretty close to your 150us measurement of that.
The profiler tells me that the big contributors to the ~500us reported time in my case are a call to cudaMalloc (~200us) and a call to cudaFree (~200us). This isn't surprising because if you study the cub::DeviceReduce methodology that is evidently being used by thrust, it requires an initial call to do a temporary allocation. Since thrust provides a self-contained call for thrust::reduce, it has to perform that call, as well as do a cudaMalloc and cudaFree operation for the indicated temporary storage.
So is there anything that can be done?
The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.
One way to do this would be to switch from thrust to cub.
The thrust way to do it is to use a thrust custom allocator.
There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.

Related

How to evaluate a program's runtime?

I've developed a simple program and want to evaluate its runtime performance on a real machine, e.g. my MacBook.
The source code goes:
#include <stdio.h>
#include <vector>
#include <ctime>
int main () {
auto beg = std::clock () ;
for (int i = 0; i < 1e8; ++ i) {
}
auto end = std::clock () ;
printf ("CPU time used: %lf ms\n", 1000.0*(end-beg)/CLOCKS_PER_SEC) ;
}
It's compiled with gcc and the optimization flag is set to the default.
With the help of bash script, I ran it for 1000 times and recorded the runtime by my MacBook, as following:
[130.000000, 136.000000): 0
[136.000000, 142.000000): 1
[142.000000, 148.000000): 234
[148.000000, 154.000000): 116
[154.000000, 160.000000): 138
[160.000000, 166.000000): 318
[166.000000, 172.000000): 139
[172.000000, 178.000000): 40
[178.000000, 184.000000): 11
[184.000000, 190.000000): 3
"[a, b): n" means that the actual runtime of the same program is between a ms and b ms for n times.
It's clear that the real runtime varies greatly and it seems not a normal distribution. Could someone kindly tell me what causes this and how I can evaluate the runtime correctly?
Thanks for responding to this question.

Benchmarking is hard!
Short answer: use google benchmark
Long answer:
There are many things that will interfere with timings.
Scheduling (the OS running other things instead of you)
CPU Scaling (the OS deciding it can save energy by running slower)
Memory contention (Something else using the memory when you want to)
Bus contention (Something else talking to a device you want to talk to)
Cache (The CPU holding on to a value to avoid having to use memory)
CPU migration. (The OS moving you from one CPU to another)
Inaccurate clocks (Only CPU clocks are accurate to any degree, but they change if you migrate)
The only way to avoid these effects are to disable CPU scaling, to do "cache-flush" functions (normally just touching a lot of memory before starting), running at high priority, and locking yourself to a single CPU. Even after all that, your timings will still be noisy, so the last thing is simply to repeat a lot, and use the average.
This why tools like google benchmark are probably your best bet.
video from CPPCon
Also available live online

C++ Eigen for solving linear systems fast

So I wanted to test the speed of C++ vs Matlab for solving a linear system of equations. For this purpose I create a random system and measure the time required to solve it using Eigen on Visual Studio:
#include <Eigen/Core>
#include <Eigen/Dense>
#include <chrono>
using namespace Eigen;
using namespace std;
int main()
{
chrono::steady_clock sc; // create an object of `steady_clock` class
int n;
n = 5000;
MatrixXf m = MatrixXf::Random(n, n);
VectorXf b = VectorXf::Random(n);
auto start = sc.now(); // start timer
VectorXf x = m.lu().solve(b);
auto end = sc.now();
// measure time span between start & end
auto time_span = static_cast<chrono::duration<double>>(end - start);
cout << "Operation took: " << time_span.count() << " seconds !!!";
}
Solving this 5000 x 5000 system takes 6.4 seconds on average. Doing the same in Matlab takes 0.9 seconds. The matlab code is as follows:
a = rand(5000); b = rand(5000,1);
tic
x = a\b;
toc
According to this flowchart of the backslash operator:
given that a random matrix is not triangular, permuted triangular, hermitian or upper heisenberg, the backslash operator in Matlab uses a LU solver, which I believe is the same solver that I'm using on the C++ code, that is, lu().solve
Probably there is something that I'm missing, because I thought C++ was faster.
I am running it with release mode active on the Configuration Manager
Project Properties - C/C++ - Optimization - /O2 is active
Tried using Enhanced Instructions (SSE and SSE2). SSE actually made it slower and SSE2 barely made any difference.
I am using Community version of Visual Studio, if that makes any difference

First of all, for this kind of operations Eigen is very unlikely to beat MatLab because the later will directly call Intel's MKL which is heavily optimized and multi-threaded. Note that you can also configure Eigen to fallback to MKL, see how. If you do so, you'll end up with similar performance.
Nonetheless, 6.4s is way to much. The Eigen's documentation reports 0.7s for factorizing a 4k x 4k matrix. Running your example on my computer (Haswell laptop #2.6GHz) I got 1.6s (clang 7, -O3 -march=native), and 1s with multithreading enabled (-fopenmp). So make sure you enable all your CPU's feature (AVX, FMA) and openmp. With OpenMP you might need to explicitly reduce the number of openmp threads to the number of physical cores.

basic openmp program runs slower [duplicate]

This question already has answers here:
No performance gain after using openMP on a program optimize for sequential running
(3 answers)
Closed 7 years ago.
i am trying to make my program run faster so, i will use parallel computing. Before that, i tried on simple for loop but it runs slower.
before open mp :
int a[100000] = { 0 };
clock_t begin = clock();
for (int i = 0; i < 100000; i++)
{
a[i] = i;
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("%lf", elapsed_secs);
After open mp :
int a[100000] = { 0 };
clock_t begin = clock();
#pragma omp parallel for
for (int i = 0; i < 100000; i++)
{
a[i] = i;
}
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("%lf", elapsed_secs);

You say that your code runs slower, but you actually don't know that. The reason is that you use clock() for measuring the time, and this function counts the CPU time of the current threads and possibly the one of all threads it spawns. For evaluating speed-ups, what you need to measure is the elapsed wall clock time. And for this purpose, OpenMP offers you omp_get_wtime(). Try using it on your code and then you'll really know whether or not your code gets any sort of benefit from OpenMP.
Now, let's be clear, your code does nothing more than writing in memory. So there is a strong likelihood that you'll saturate your memory bandwidth pretty quickly. Therefore, unless you have multiple memory controllers, it is unlikely you gain much from adding threads in this case. Please have a look at this answer to convince yourself.
And finally, make sure you do something with your data before to exit the code, otherwise, the compiler is likely to just optimise it out, leading to a code doing pretty-much nothing (but doing it very fast).

To be succesfull with your first OpenMP parallel (multi-threaded) code examples you need to improve your test cases from following two perspectives:
Make your examples testable. To do that:
make sure that your code is complex enough to not give compilers any chance to "optimize" the whole loop out (i.e. to prevent compilers from kinda replacing the whole loop with single expression)
you may end up with need to introduce function wrapping your loop and pass argument to this function in runtime (via argc/argv) to make compiler confused, while keeping the code very simple
make sure you use proper compilation flags (-O2 -fopenmp for GCC, some other flags for other compilers)
make sure your loop takes enough time and that you use proper way to measure time spent in the loop (other respondents, including Gilles, have alrady pointed it out very well)
Make sure that your loop is doing enough (ideally computational) work (i.e. additions, multiplications, etc) in every loop iteration, so that various overheads associated with doing some under-the-hood work inside of OpenMP runtime library (required to "schedule"/plan/distribute iterations between threads) are not "bigger" than amount of useful work done in bunch of your loop iterations.
Second and Third wikipedia OpenMP' parallel for examples are already good enough to mostly satisfy given criteria (while your example is not satisfying it). You are at the point where just following wikipedia examples will help you to gain some basic understanding.
After you learn given basics, your next steps would be (a) understanding "Data Races" / "Race Conditions" / "Loop Carried Dependencies" and (b) understanding the "difference" between #pragma omp parallel and #pragma omp for (again, you will need to find simple examples from books or basic OpenMP courses).
(to be honest, all other topics, like OpenMP imbalance, dynamic vs. static, Memory Bandwidth, will make sense only after you spend at least couple days of reading/practicing with simpler notions)

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}

The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}

The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.

Use Threads or Processes, you may want to look to OpenMp

C++11 got support for threading but c++ compilers won't/can't do any threading on their own.

As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

C++ , Timer, Milliseconds

#include <iostream>
#include <conio.h>
#include <ctime>
using namespace std;
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC/1000);
return diffms;
}
int main()
{
clock_t start = clock();
for(int i=0;;i++)
{
if(i==10000)break;
}
clock_t end = clock();
cout << diffclock(start,end)<<endl;
getch();
return 0;
}
So my problems comes to that it returns me a 0, well to be stright i want to check how much time my program does operate...
I found tons of crap over the internet well mostly it comes to the same point of getting a 0 beacuse the start and the end is the same
This problems goes to C++ remeber : <

There are a few problems in here. The first is that you obviously switched start and stop time when passing to diffclock() function. The second problem is optimization. Any reasonably smart compiler with optimizations enabled would simply throw the entire loop away as it does not have any side effects. But even you fix the above problems, the program would most likely still print 0. If you try to imagine doing billions operations per second, throw sophisticated out of order execution, prediction and tons of other technologies employed by modern CPUs, even a CPU may optimize your loop away. But even if it doesn't, you'd need a lot more than 10K iterations in order to make it run longer. You'd probably need your program to run for a second or two in order to get clock() reflect anything.
But the most important problem is clock() itself. That function is not suitable for any time of performance measurements whatsoever. What it does is gives you an approximation of processor time used by the program. Aside of vague nature of the approximation method that might be used by any given implementation (since standard doesn't require it of anything specific), POSIX standard also requires CLOCKS_PER_SEC to be equal to 1000000 independent of the actual resolution. In other words — it doesn't matter how precise the clock is, it doesn't matter at what frequency your CPU is running. To put simply — it is a totally useless number and therefore a totally useless function. The only reason why it still exists is probably for historical reasons. So, please do not use it.
To achieve what you are looking for, people have used to read the CPU Time Stamp also known as "RDTSC" by the name of the corresponding CPU instruction used to read it. These days, however, this is also mostly useless because:
Modern operating systems can easily migrate the program from one CPU to another. You can imagine that reading time stamp on another CPU after running for a second on another doesn't make a lot of sense. It is only in latest Intel CPUs the counter is synchronized across CPU cores. All in all, it is still possible to do this, but a lot of extra care must be taken (i.e. once can setup the affinity for the process, etc. etc).
Measuring CPU instructions of the program oftentimes does not give an accurate picture of how much time it is actually using. This is because in real programs there could be some system calls where the work is performed by the OS kernel on behalf of the process. In that case, that time is not included.
It could also happen that OS suspends an execution of the process for a long time. And even though it took only a few instructions to execute, for user it seemed like a second. So such a performance measurement may be useless.
So what to do?
When it comes to profiling, a tool like perf must be used. It can track a number of CPU clocks, cache misses, branches taken, branches missed, a number of times the process was moved from one CPU to another, and so on. It can be used as a tool, or can be embedded into your application (something like PAPI).
And if the question is about actual time spent, people use a wall clock. Preferably, a high-precision one, that is also not a subject to NTP adjustments (monotonic). That shows exactly how much time elapsed, no matter what was going on. For that purpose clock_gettime() can be used. It is part of SUSv2, POSIX.1-2001 standard. Given that use you getch() to keep the terminal open, I'd assume you are using Windows. There, unfortunately, you don't have clock_gettime() and the closest thing would be performance counters API:
BOOL QueryPerformanceFrequency(LARGE_INTEGER *lpFrequency);
BOOL QueryPerformanceCounter(LARGE_INTEGER *lpPerformanceCount);
For a portable solution, the best bet is on std::chrono::high_resolution_clock(). It was introduced in C++11, but is supported by most industrial grade compilers (GCC, Clang, MSVC).
Below is an example of how to use it. Please note that since I know that my CPU will do 10000 increments of an integer way faster than a millisecond, I have changed it to microseconds. I've also declared the counter as volatile in hope that compiler won't optimize it away.
#include <ctime>
#include <chrono>
#include <iostream>
int main()
{
volatile int i = 0; // "volatile" is to ask compiler not to optimize the loop away.
auto start = std::chrono::steady_clock::now();
while (i < 10000) {
++i;
}
auto end = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "It took me " << elapsed.count() << " microseconds." << std::endl;
}
When I compile and run it, it prints:
$ g++ -std=c++11 -Wall -o test ./test.cpp && ./test
It took me 23 microseconds.
Hope it helps. Good Luck!

At a glance, it seems like you are subtracting the larger value from the smaller value. You call:
diffclock( start, end );
But then diffclock is defined as:
double diffclock( clock_t clock1, clock_t clock2 ) {
double diffticks = clock1 - clock2;
double diffms = diffticks / ( CLOCKS_PER_SEC / 1000 );
return diffms;
}
Apart from that, it may have something to do with the way you are converting units. The use of 1000 to convert to milliseconds is different on this page:
http://en.cppreference.com/w/cpp/chrono/c/clock

The problem appears to be the loop is just too short. I tried it on my system and it gave 0 ticks. I checked what diffticks was and it was 0. Increasing the loop size to 100000000, so there was a noticeable time lag and I got -290 as output (bug -- I think that the diffticks should be clock2-clock1 so we should get 290 and not -290). I tried also changing "1000" to "1000.0" in the division and that didn't work.
Compiling with optimization does remove the loop, so you have to not use it, or make the loop "do something", e.g. increment a counter other than the loop counter in the loop body. At least that's what GCC does.

Note: This is available after c++11.
You can use std::chrono library.
std::chrono has two distinct objects. (timepoint and duration). Timepoint represents a point in time, and duration, as we already know the term represents an interval or a span of time.
This c++ library allows us to subtract two timepoints to get a duration of time passed in the interval. So you can set a starting point and a stopping point. Using functions you can also convert them into appropriate units.
Example using high_resolution_clock (which is one of the three clocks this library provides):
#include <chrono>
using namespace std::chrono;
//before running function
auto start = high_resolution_clock::now();
//after calling function
auto stop = high_resolution_clock::now();
Subtract stop and start timepoints and cast it into required units using the duration_cast() function. Predefined units are nanoseconds, microseconds, milliseconds, seconds, minutes, and hours.
auto duration = duration_cast<microseconds>(stop - start);
cout << duration.count() << endl;

First of all you should subtract end - start not vice versa.
Documentation says if value is not available clock() returns -1, did you check that?
What optimization level do you use when compile your program? If optimization is enabled compiler can effectively eliminate your loop entirely.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js