Boost.Compute slower than plain CPU? - c++

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>
namespace compute = boost::compute;
int main()
{
// generate random data on the host
std::vector<float> host_vector(16000);
std::generate(host_vector.begin(), host_vector.end(), rand);
BOOST_FOREACH (auto const& platform, compute::system::platforms())
{
std::cout << "====================" << platform.name() << "====================\n";
BOOST_FOREACH (auto const& device, platform.devices())
{
std::cout << "device: " << device.name() << std::endl;
compute::context context(device);
compute::command_queue queue(context, device);
compute::vector<float> device_vector(host_vector.size(), context);
// copy data from the host to the device
compute::copy(
host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);
auto start = boost::chrono::high_resolution_clock::now();
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(), queue);
auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
std::cout << "-------------------\n";
}
}
std::cout << "====================plain====================\n";
auto start = boost::chrono::high_resolution_clock::now();
std::transform(host_vector.begin(),
host_vector.end(),
host_vector.begin(),
[](float v){ return std::sqrt(v); });
auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
return 0;
}
And here's the sample output on my machine (win7 64-bit):
====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU # 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms
My question is: why is the plain (non-opencl) version faster?

As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).
To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).
Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:
float ans = 0;
compute::transform_reduce(
device_vector.begin(),
device_vector.end(),
&ans,
compute::sqrt<float>(),
compute::plus<float>(),
queue
);
std::cout << "ans: " << ans << std::endl;
You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).

I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-
CPU GPU
copy data to GPU
set up compute code
calculate sqrt calculate sqrt
sum sum
copy data from GPU
Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.
You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.
In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)

You're getting bad results because you're measuring time incorrectly.
OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)
CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.
Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.
In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.
So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.

Related

Why is my thread execution jumping between CPU cores?

I recently started experimenting with std::thread and I tried running a small program that displays the webcam feed in a separate thread and I am using OpenCV. I am just doing this for "educational" purposes. What I noticed was that the thread seemed to keep jumping between cores which striked me as odd since I thought that the overhead of this change would not be worth it from an efficiency/performance side of view. Does anybody know the root/reason for such behavior?
Short disclaimer --> I am new to StackOverflow so if I missed something, please let me know.
A snapshot of my system monitor - Ubuntu
#include <stdio.h>
#include <opencv2/opencv.hpp> //openCV functionality
#include <time.h> //timing functionality
#include <thread>
using namespace cv;
using namespace std;
void webcam_func(){
Mat image;
namedWindow("Display window");
VideoCapture cap(0);
if (!cap.set(CAP_PROP_AUTO_EXPOSURE , 10)){
std::cout <<"Exposure could not be set!" <<std::endl;
//return -1 ;
}
if (!cap.isOpened()) {
cout << "cannot open camera";
}
int i = 0;
while (i < 1000000) {
cap >> image;
Size s = image.size();
int rows = s.height;
int cols = s.width;
imshow("Display window", image);
double fps = cap.get(CAP_PROP_FPS);
//cout << "Frames per second using video.get(CAP_PROP_FPS) : " << fps << endl;
//cout <<"The height of the video is " <<rows <<endl;
//cout <<"The width of the video is " <<cols <<endl;
std::thread::id this_id = std::this_thread::get_id();
std::cout << "thread id --> " << this_id <<std::endl;
waitKey(25);
i++ ;
std::cout <<"Counter value " <<i <<std::endl;
}
}
int main() {
std::thread t1(webcam_func);
while(true){
}
return 0;
}
The default Linux scheduler schedule tasks (eg. threads) for a given quantum (time slice) on available processing units (eg. cores or hardware threads). This quantum can be interrupted if a task enters in sleeping mode or wait for something (inputs, locks, etc.). waitKey(25) exactly does that: it causes your thread to wait for a short period of time. The thread execution is interrupted and a context-switch is done. The OS can execute other tasks during this time. When the computing thread is ready again (because >25 ms has elapsed), the scheduler can schedule it again. It tries to execute the task on the same processing unit so to reduce overheads (eg. cache misses) but the previous processing unit can be still used by another thread when the computing task is being scheduled back. This is unlikely to be the case when there is not many ready tasks or just greedy ones though. Additionally, some processors supports SMT (aka. hyper-threading). For example, many x86-64 Intel processors supports 2 hardware threads per core sharing the same caches. Context-switches between 2 hardware threads lying on the same core are significantly cheaper (eg. far less cache-misses). Also note that the Linux scheduler is not perfect like most other schedulers. In fact, it was bogus few years ago and not even able to fill all available cores when it was possible (see: The Linux Scheduler: a Decade of Wasted Cores). Finally, note that the (direct) overhead of a context-switch is no more than few dozens of micro-seconds on a mainstream Linux PC so having them every few dozens of milliseconds is fine (<1% overhead).

Simple division of labour over threads is not reducing the time taken

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.
So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:
2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream>
#include <math.h>
#include <vector>
#include <thread>
#define randf() ((double) rand()) / ((double) (RAND_MAX))
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
for (auto i = 0u; i < interations; i++)
{
double value = randf();
double calc = std::atan(value);
(void) calc;
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << std::endl;
}
int main()
{
// Note these numbers vary by about probably due to godbolt servers load (?)
// 1 Threads takes: ~8s
// 2 Threads takes: ~8s
// 3 Threads takes: ~8s
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Seed rand
std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
Why might this be happening?
What am I doing wrong?
How can I improve this?
The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.
You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.
Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.
What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.
No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.
It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

MPI: time for output increases when the number of processors increase

I have a problem printing a sparse matrix in a c++/mpi program that I hope you could help me solve.
Problem: I need to print a sparse matrix as a list of 3-ples (x, y, v_xy) in a .txt file in a program that has been parallelized with MPI. Since I am new to MPI, I decided not to deal with the parallelized IO instructions provided by the library and let the master processor (0 in my case) print the output. However, the time for printing the matrix increases when I increase the number of processors:
1 processor: 11,7 secs
2 processors: 26,4 secs
4 processors: 25,4 secs
I have already verified that the output is exactly the same in the three cases. Here is the relevant section of the code:
if (rank == 0)
{
sw.start();
std::ofstream ofs_output(output_file);
targets.print(ofs_output);
ofs_output.close();
sw.stop();
time_output = sw.get_duration();
std::cout << time_output << std::endl;
}
My stopwatch sw is measuring wall clock time using the gettimeofday function.
The print method for the targets matrix is the following:
void sparse_matrix::print(std::ofstream &ofs)
{
int temp_row;
for (const_iterator iter_row = _matrix.begin(); iter_row != _matrix.end(); ++iter_row)
{
temp_row = (*iter_row).get_key();
for (value_type::const_iterator iter_col = (*iter_row).get_value().begin();
iter_col != (*iter_row).get_value().end(); ++iter_col)
{
ofs << temp_row << "," << (*iter_col).get_key() << "," << (*iter_col).get_value() << std::endl;
}
}
}
I do not understand what is causing the slow-down since only processor 0 does the output and this is the very last operation of the program: all the other processors are done while processor 0 prints the output. Do you have any idea?
Well, I finally understood what was causing the problem. Running my program, parallelized on MPI, on a linux virtual machine drastically increased the time for printing a large amount of data in a .txt file when increasing the number of cores used. The problem is caused by the virtual machine, which does not behave correctly when using MPI. I tested the same program on a physical 8-core machine and the time for printing the output does not increase with the number of cores used.

boost chrono endTime - startTime returns negative in boost 1.51

The boost chrono library vs1.51 on my macbook pro returns negative times when I substract endTime - startTime. If you print the timepoints you see that the end time is earlier than the startTime. How can this happen?
typedef boost::chrono::steady_clock clock_t;
clock_t clock;
// Start time measurement
boost::chrono::time_point<clock_t> startTime = clock.now();
short test_times = 7;
// Spend some time...
for ( int i=0; i<test_times; ++i )
{
xnodeptr spResultDoc=parser.parse(inputSrc);
xstring sXmlResult = spResultDoc->str();
const char16_t* szDbg = sXmlResult.c_str();
BOOST_CHECK(spResultDoc->getNodeType()==xnode::DOCUMENT_NODE && sXmlResult == sXml);
}
// Stop time measurement
boost::chrono::time_point<clock_t> endTime = clock.now();
clock_t::duration elapsed( endTime - startTime);
std::cout << std::endl;
std::cout << "Now time: " << clock.now() << std::endl;
std::cout << "Start time: " << startTime << std::endl;
std::cout << "End time: " << endTime << std::endl;
std::cout << std::endl << "Total Parse time: " << elapsed << std::endl;
std::cout << "Avarage Parse time per iteration: " << (boost::chrono::duration_cast<boost::chrono::milliseconds>(elapsed) / test_times) << std::endl;
I tried different clocks but no difference.
Any help would be appreciated!
EDIT: Forgot to add the output:
Now time: 1 nanosecond since boot
Start time: 140734799802912 nanoseconds since boot
End time: 140734799802480 nanoseconds since boot
Total Parse time: -432 nanoseconds
Avarage Parse time per iteration: 0 milliseconds
Hyperthreading or just scheduling interference, the Boost implementation punts monotonic support to the OS:
POSIX: clock_gettime (CLOCK_MONOTONIC) although it still may fail due to kernel errors handling hyper-threading when calibrating the system.
WIN32: QueryPerformanceCounter() which on anything but Nehalem architecture or newer is not going to be monotonic across cores and threads.
OSX: mach_absolute_time(), i.e. the steady & high resolution clocks are the same. The source code shows that it uses RDTSC thus strict dependency upon hardware stability: i.e. no guarantees.
Disabling hyperthreading is a recommended way to go, but say on Windows you are really limited. Aside of dropping timer resolution the only available method is direct access to the underlying hardware timers whilst ensuring thread affinity.
It looks like a good time to submit a bug to Boost, I would recommend:
Win32: Use GetTick64Count(), as discussed here.
OSX: Use clock_get_time (SYSTEM_CLOCK) according to this question.

Timing woes with clock_gettime() CUDA

I wanted to write an CUDA code where I could see firsthand the benefits that CUDA offered for speeding up applications.
Here is is a CUDA code I have written using Thrust ( http://code.google.com/p/thrust/ )
Briefly, all that the code does is create two 2^23 length integer vectors,one on the host and one on the device identical to each other, and sorts them. It also (attempts to) measure time for each.
On the host vector I use std::sort. On the device vector I use thrust::sort.
For compilation I used
nvcc sortcompare.cu -lrt
The output of the program at the terminal is
Desktop: ./a.out
Host Time taken is: 19 . 224622882 seconds
Device Time taken is: 19 . 321644143 seconds
Desktop:
The first std::cout statement is produced after 19.224 seconds as stated. Yet the second std::cout statement (even though it says 19.32 seconds) is produced immediately after the first
std::cout statement. Note that I have used different time_stamps for measurements in clock_gettime() viz ts_host and ts_device
I am using Cuda 4.0 and NVIDIA GTX 570 compute capability 2.0
#include<iostream>
#include<vector>
#include<algorithm>
#include<stdlib.h>
//For timings
#include<time.h>
//Necessary thrust headers
#include<thrust/sort.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/copy.h>
int main(int argc, char *argv[])
{
int N=23;
thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
thrust::device_vector<int>D(1<<N);//The same on the device.
thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting
//Set the host_vector elements.
for (int i = 0; i < H.size(); ++i) {
H[i]=rand();//Set the host vector element to pseudo-random number.
}
//Sort the host_vector. Measure time
// Reset the clock
timespec ts_host;
ts_host.tv_sec = 0;
ts_host.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock
thrust::sort(H.begin(),H.end());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;
D=H; //Set the device vector elements equal to the host_vector
//Sort the device vector. Measure time.
timespec ts_device;
ts_device.tv_sec = 0;
ts_device.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock
thrust::sort(D.begin(),D.end());
thrust::copy(D.begin(),D.end(),dummy.begin());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;
return 0;
}
You are not checking the return value of clock_settime. I would guess it is failing, probably with errno set to EPERM or EINVAL. Read the documentation and always check your return values!
If I'm right, you are not resetting the clock as you think you are, hence the second timing is cumulative with the first, plus some extra stuff you don't intend to count at all.
The right way to do this is to call clock_gettime only, storing the result first, doing the computation, then subtracting the original time from the end time.