C++ AMP crashing on hardware (GeForce GTX 660)

C++ AMP crashing on hardware (GeForce GTX 660) - c++

I’m having a problem writing some C++ AMP code. I have included a sample.
It runs fine on emulated accelerators but crashes the display driver on my hardware (windows 7, NVIDIA GeForce GTX 660, latest drivers) but I can see nothing on wrong with my code.
Is there a problem with my code or is this a hardware/driver/complier issue?
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <amp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// Prints "NVIDIA GeForce GTX 660"
concurrency::accelerator_view target_view = concurrency::accelerator().create_view();
std::wcout << target_view.accelerator.description << std::endl;
// lower numbers do not cause the issue
const int x = 2000;
const int y = 30000;
// 1d array for storing result
std::vector<unsigned int> resultVector(y);
Concurrency::array_view<unsigned int, 1> resultsArrayView(resultVector.size(), resultVector);
// 2d array for data for processing
std::vector<unsigned int> dataVector(x * y);
concurrency::array_view<unsigned int, 2> dataArrayView(y, x, dataVector);
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
resultsArrayView.extent,
// Define the code to run on each thread on the accelerator.
[=](concurrency::index<1> idx) restrict(amp)
{
concurrency::array_view<unsigned int, 1> buffer = dataArrayView[idx[0]];
unsigned int bufferSize = buffer.get_extent().size();
// needs both loops to cause crash
for (unsigned int outer = 0; outer < bufferSize; outer++)
{
for (unsigned int i = 0; i < bufferSize; i++)
{
// works without this line, also if I change to buffer[0] it works?
dataArrayView[idx[0]][0] = 0;
}
}
// works without this line
resultsArrayView[0] = 0;
});
std::cout << "chash on next line" << std::endl;
resultsArrayView.synchronize();
std::cout << "will never reach me" << std::endl;
system("PAUSE");
return 0;
}

It is very likely that your computation exceeds permitted quantum time (default 2 seconds). After that time the operating systems comes in and restarts the GPU forcefully, this is called Timeout Detection and Recovery (TDR). The software adapter (reference device) does not have the TDR enabled, that is why the computation can exceed permitted quantum time.
Does your computation really require 3000 threads (variable x), each performing 2000 * 3000 (x * y) loop iterations? You can chunk your computation, such that each chunks takes less than 2 seconds to compute. You can also consider disabling TDR or exceeding the permitted quantum time to fit your need.
I highly recommend reading a blog post on how to handle TDRs in C++ AMP, which explains TDR in details: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/07/handling-tdrs-in-c-amp.aspx
Additionally, here is the separate blog post on how to disable the TDR on Windows 8:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/disabling-tdr-on-windows-8-for-your-c-amp-algorithms.aspx

Related

Does cache behaviour explain decreasing C++ write times to free store?

We are profiling a complex C++ program that performs multiple iterations of an algorithm. Timing is critical and we want to minimise the execution time of each iteration. The algorithm is such that the execution time should be very similar for each iteration, but we find that the execution time decreases with successive iterations. We suspect that the cache is responsible for this but can't fully explain what we see based on our understanding of caches.
We are running the code on an Intel Xeon processor with Centos 7.6, compiled by g++ 7.3.1.
We managed to demonstrate the behaviour using the simple program shown below:
#include <vector>
#include <fstream>
#include <array>
#include <chrono>
#include <iostream>
int main()
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
const unsigned NUM_BUFFERS = 200;
const unsigned BUFFER_SIZE_BYTES = 1024 * 1024;
const unsigned NUM_TRIALS = 50;
std::vector<uint8_t*> buffers;
for (int buff=0; buff<NUM_BUFFERS; ++buff)
buffers.push_back(new uint8_t[BUFFER_SIZE_BYTES]);
std::vector<double> tAll; // Records execution time for each buffer write
tAll.resize(NUM_TRIALS*NUM_BUFFERS);
unsigned indt = 0;
// For each trial
for ( unsigned indTrial=0; indTrial<NUM_TRIALS; indTrial++ )
{
// For all buffers
for ( unsigned indBuffer=0; indBuffer<NUM_BUFFERS; indBuffer++ )
{
t1 = std::chrono::high_resolution_clock::now();
// Increment contents of entire buffer
uint8_t* p_buff = buffers.at(indBuffer);
for ( unsigned ind=0; ind<BUFFER_SIZE_BYTES; ind++ )
{
p_buff[ind]++;
}
t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
tAll.at(indt++) = duration.count();
}
}
// Write execution times to a file
std::ofstream fp;
fp.open("TEST_ARTEMIS.TXT");
double max=0;
for ( unsigned ind=0; ind<tAll.size(); ind++ )
{
fp << tAll[ind] << std::endl;
}
}
This program increments every byte of a series of 200 off 1MB buffers.The process is repeated 50 times.
The time for each complete write of each buffer is written to a file. If we plot those times, and zoom in to the first 250 buffer writes, we see:
The first buffer write takes ~10ms, the next few take ~3ms, the next 200 take ~2.5ms, and the time then drops to 2ms.
We don't think this behaviour can be explained just by simple cache behaviour as the L2/L3 caches are not large enough to contain all the buffers, so cache writes should be happening throughout the experiment. It's as though the memory gets 'warmed up' and gets faster with time.
Can anyone suggest an explanation for what we are seeing please?

Why 8 threads is slower than 2 threads?

I have to apologize for my poor English first.
I'm learning hardware transactional memory now and I'm using the spin_rw_mutex.h in TBB to implement the transaction block in C++. speculative_spin_rw_mutex is a class in the spin_rw_mutex.h is a mutex which have already implemented the RTM interface of intel TSX.
The example I used to test RTM is very simple. I created the Account class and I transfer money from one account to another randomly. All accounts are in an accounts array and the size is 100. The random function is in boost.(I think STL has the same random function). The transfer function is protected with the speculative_spin_rw_mutex. I used tbb::parallel_for and tbb::task_scheduler_init to control concurrency. All transfer methods are called in the lambda of paraller_for. The total transfer times is 1 million. The strange thing is when the task_scheduler_init is set as 2 the program is the fastest (8 seconds). In fact my CPU is i7 6700k which has 8 threads. In the range of 8 and 50,000, the performance of the program is almost no change (11 to 12 seconds). When I increase the task_scheduler_init to 100,000, the run time will increase to about 18 seconds.
I tried to use profiler to analyze the program and I found the hotspot function is the mutex. However I think the rate of transaction roll-back is not so high. I don't know why the program is so slow.
Somebody says that the false sharing slows down the performance, as a result, I tried to use
std::vector> cache_aligned_accounts(AccountsSIZE,Account(1000));
to replace the orignal array
Account* accounts[AccountsSIZE];
to avoid the false sharing. It seems nothing changed;
Here is my new codes.
#include <tbb/spin_rw_mutex.h>
#include <iostream>
#include "tbb/task_scheduler_init.h"
#include "tbb/task.h"
#include "boost/random.hpp"
#include <ctime>
#include <tbb/parallel_for.h>
#include <tbb/spin_mutex.h>
#include <tbb/cache_aligned_allocator.h>
#include <vector>
using namespace tbb;
tbb::speculative_spin_rw_mutex mu;
class Account {
private:
int balance;
public:
Account(int ba) {
balance = ba;
}
int getBalance() {
return balance;
}
void setBalance(int ba) {
balance = ba;
}
};
//Transfer function. Using speculative_spin_mutex to set critical section
void transfer(Account &from, Account &to, int amount) {
speculative_spin_rw_mutex::scoped_lock lock(mu);
if ((from.getBalance())<amount)
{
throw std::invalid_argument("Illegal amount!");
}
else {
from.setBalance((from.getBalance()) - amount);
to.setBalance((to.getBalance()) + amount);
}
}
const int AccountsSIZE = 100;
//Random number generater and distributer
boost::random::mt19937 gener(time(0));
boost::random::uniform_int_distribution<> distIndex(0, AccountsSIZE - 1);
boost::random::uniform_int_distribution<> distAmount(1, 1000);
/*
Function of transfer money
*/
void all_transfer_task() {
task_scheduler_init init(10000);//Set the number of tasks can be run together
/*
Initial accounts, using cache_aligned_allocator to avoid false sharing
*/
std::vector<Account, cache_aligned_allocator<Account>> cache_aligned_accounts(AccountsSIZE,Account(1000));
const int TransferTIMES = 10000000;
//All transfer tasks
parallel_for(0, TransferTIMES, 1, [&](int i) {
try {
transfer(cache_aligned_accounts[distIndex(gener)], cache_aligned_accounts[distIndex(gener)], distAmount(gener));
}
catch (const std::exception& e)
{
//cerr << e.what() << endl;
}
//std::cout << distIndex(gener) << std::endl;
});
std::cout << cache_aligned_accounts[0].getBalance() << std::endl;
int total_balance = 0;
for (size_t i = 0; i < AccountsSIZE; i++)
{
total_balance += (cache_aligned_accounts[i].getBalance());
}
std::cout << total_balance << std::endl;
}

As Intel TSX works on cache line granularity, false sharing is definitely things to start with. Unfortunately, cache_aligned_allocator does not what you are probably expecting, i.e. it aligned whole std::vector, but you need individual Account to occupy whole cache line to prevent false sharing.

While I can't reproduce your benchmark, I see here two possible causes for this behavior:
"Too many cooks boil the soup": you use a single spin_rw_mutex that is locked by all the transfers by all the threads. Seems to me that your transfers execute sequentially. This would explain why the profile sees a hot point there. The Intel page warns against performance degradation in such case.
Throughput vs. speed: On an i7, in a couple of benchmarks, I could notice that when you use more cores, each core runs a little bit slower, so that overall time of fixed siez loops runs longer. However, counting the overall throughput (i.e. the total number of transactions that happen in all these parallel loops) the throughput is much higher (although not fully proportinally to the number of cores).
I'd rather opt for the first case, but the second is not to eliminate.

opencv much slower in multithreading

Im writting a console application that uses multithreading. Each thread process a set of images using opencv functions.
If the function that uses opencv functions is executed in a single thread I get a reference computational time. If I execute this function from multiple threads the function (individually in each thread) is much slower (nearly double), when it should be nearly the same.
¿Does opencv parallelizes, serializes or blocks itself the execution?.
I have test the aplication using opencv libraries compiled WITH_TBB and without TBB and the result is almost the same. I don't know if it may have any inffluence, but I have seen also that some functions like cv::threshold or cv::findcontours create 12 additional subprocesses when beein executed. If The open cv calls are commented the time is the same for all threads and is the same to the obtained in a single thread execution, so in this case the multithreading is working well. The question is if there is maybe an opencv compilation option or a function call that allows to obtain the same time in multithreading and in single threading execution??.
EDIT
This is the result of increasing the number of threads (cores) in a 4 cores CPU, executing with 1, 2, 3 and 4 cores the same function. Each core process 768 images with 1600x1200 resolution in a for loop. Inside the loop the function causing the increasing delay is called. I shoud expect that, independently of the number of cores the time is approx the same obtained for a single thread (35000ms) or 10% more, but, as can be seen the time raises up when the number of threads is increased, I can not find why...
TIMES: (Sorry, the system not allow me to upload images to the posts)
time in File No. 3 --> 35463
Mean time using 1 cores is: 47ms
time in File No. 3 --> 42747
time in File No. 3 --> 42709
Mean time using 2 cores is: 28ms
time in File No. 3 --> 54587
time in File No. 3 --> 54595
time in File No. 3 --> 54437
Mean time using 3 cores is: 24ms
time in File No. 3 --> 68751
time in File No. 3 --> 68865
time in File No. 3 --> 68878
time in File No. 3 --> 68622
Mean time using 4 cores is: 22ms
If no opencv code is used insithe the function, the time, as expected, is similar for all the cases 1, 2 3 or 4 threads but when an open cv function is used, for example only with a a simple call to:
img.convertTo(img,CV_32F);
beeing img a cv::Mat, the time increases when the number of threads is increased. I have made test also disabling the hiper-threading option in the CPU Bios. In that case all the times decrease, been the time with 1 thread 25.000ms, but the problem of time increase is still present (33sec with 2 threads, 43 with 3, 57 with 4)... I dont know if this tells you something
Edit 2
A mcve:
#include "stdafx.h"
#include <future>
#include <chrono>
#include "Filter.h"
#include <iostream>
#include <future>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
long long Ticks();
int WithOpencv(cv::Mat img);
int With_OUT_Opencv(cv::Mat img);
int TestThreads (char *buffer,std::string file);
#define Blur3x3(matrix,f,c) ((matrix[(f-1)*1600+(c-1)] + matrix[(f-1)*1600+c] + matrix[(f-1)*1600+(c+1)] + matrix[f*1600+(c-1)] + matrix[f*1600+c] + matrix[f*1600+(c+1)] + matrix[(f+1)*1600+(c-1)] + matrix[(f+1)*1600+c] + matrix[(f+1)*1600+(c+1)])/9)
int _tmain(int argc, _TCHAR* argv[])
{
std::string file="Test.bmp";
auto function = [&](char *buffer){return TestThreads(buffer,file);};
char *buffers[12];
std::future<int> frames[12];
DWORD tid;
int i,j;
int nframes = 0;
int ncores;
cv::setNumThreads(8);
for (i=0;i<8;i++) buffers[i] = new char[1000*1024*1024];
for (j=1;j<9;j++)
{
ncores = j;
long long t = Ticks();
for (i=0;i<ncores;i++) frames[i] = std::async(std::launch::async,function,buffers[i]);
for (i=0;i<ncores;i++) nframes += frames[i].get();
t = Ticks() - t;
std::cout << "Mean time using " << ncores << " cores is: " << t/nframes << "ms" << std::endl << std::endl;
nframes = 0;
Sleep(2000);
}
for (int i=0;i<8;i++) delete buffers[i];
return NULL;
return 0;
}
int TestThreads (char *buffer,std::string file)
{
long long ta;
int res;
char *ruta=new char[file.length() + 1];
strcpy(ruta,file.c_str());
cv::Mat img (1200, 1600, CV_8UC1);
img=cv::imread(file);
ta = Ticks();
for (int i=0;i<15;i++) {
//Uncomment this and comment next line to test without opencv calls. With_OUT_Opencv implements simple filters with direct operations over mat data
//res = With_OUT_Opencv(img);
res = WithOpencv(img);
}
ta = Ticks() - ta;
std::cout << "Time in file No. 3--> " << ta << std::endl;
return 15;
}
int WithOpencv(cv::Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
return 0;
}
int With_OUT_Opencv(cv::Mat img){
unsigned char *baux1 = new unsigned char[1600*1200];
unsigned short *baux2 = new unsigned short[1600*1200];
unsigned char max=0;
int f,c,i;
unsigned char threshold = 177;
for (f=1;f<1199;f++) // Bad Blur filters
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = Blur3x3(img.data,f,c);
baux1[f*1600+c] = baux1[f*1600+c] * baux1[f*1600+c];
baux2[f*1600+c] = img.data[f*1600+c] * img.data[f*1600+c];
}
}
for (f=1;f<1199;f++)
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = sqrt(Blur3x3(baux2,f,c) - baux1[f*1600+c]);
if (baux1[f*1600+c] > max) max = baux1[f*1600+c];
}
}
threshold = threshold * ((float)max/255.0); // Bad Norm/Bin
for (i=0;i<1600*1200;i++)
{
if (baux1[i]>threshold) baux1[i] = 1;
else baux1[i] = 0;
}
delete []baux1;
delete []baux2;
return 0;
}
long long Ticks()
{
static long long last = 0;
static unsigned ticksPerMS = 0;
LARGE_INTEGER largo;
if (last==0)
{
QueryPerformanceFrequency(&largo);
ticksPerMS = (unsigned)(largo.QuadPart/1000);
QueryPerformanceCounter(&largo);
last = largo.QuadPart;
return 0;
}
QueryPerformanceCounter(&largo);
return (largo.QuadPart-last)/ticksPerMS;
}

I'm confused as to what your question is.
Your initial question suggested that running x number of iterations in serial is considerably faster than running them in parallel. Note: when the same target function is used. And you're wondering why running the same target function is considerably slower in a multithreaded scenario.
However, I now see that your example is comparing the performance of OpenCV with some other custom code. Is that what your question is about?
Related to the question as I initially thought the question was, the answer is: no, running the target function in serial is not considerably faster than running it in parallel. See results and code below.
Results
eight threads took 4104.38 ms
single thread took 7272.68 ms
four threads took 3687 ms
two threads took 4500.15 ms
(on a Apple MBA 2012 i5 & opencv3)
Test code
#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace std::chrono;
using namespace cv;
class benchmark {
time_point<steady_clock> start = steady_clock::now();
string title;
public:
benchmark(const string& title) : title(title) {}
~benchmark() {
auto diff = steady_clock::now() - start;
cout << title << " took " << duration <double, milli> (diff).count() << " ms" << endl;
}
};
template <typename F>
void repeat(unsigned n, F f) {
while (n--) f();
};
int targetFunction(Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
//imshow("WithOpencv", img_bin);
return 0;
}
void runTargetFunction(int nIterations, int nThreads, const Mat& img) {
int nIterationsPerThread = nIterations / nThreads;
vector<thread> threads;
auto targetFunctionFn = [&img]() {
targetFunction(img);
};
setNumThreads(nThreads);
repeat(nThreads, [&] {
threads.push_back(thread([=]() {
repeat(nIterationsPerThread, targetFunctionFn);
}));
});
for(auto& thread : threads)
thread.join();
}
int main(int argc, const char * argv[]) {
string file = "../../opencv-test/Test.bmp";
auto img = imread(file);
const int nIterations = 64;
// let's run using eight threads
{
benchmark b("eight threads");
runTargetFunction(nIterations, 8, img);
}
// let's run using a single thread
{
benchmark b("single thread");
runTargetFunction(nIterations, 1, img);
}
// let's run using four threads
{
benchmark b("four threads");
runTargetFunction(nIterations, 4, img);
}
// let's run using a two threads
{
benchmark b("two threads");
runTargetFunction(nIterations, 2, img);
}
return 0;
}

You are measuring three things:
The time that all threads need to complete the whole task divided by the size of the whole task.
The time required by each individual thread to complete its part of the task.
The time required to complete the whole task.
You are observing that the first time is going down from 47ms to 22ms when increasing the number of threads. That is good! At the same time you are realizing that the time requried by an individual thread increases from 35463 to about 68751 (whatever units). Finally, you are realizing that the overall executing time goes up.
Regarding the second measurement: When increasing the number of threads, the individual threads need longer to perform there respective operations. Two possible explanations:
Your threads are competing for memory bus bandwidth.
Your threads are triggering computations that are multi-threaded by themselves, so effectively they are competing with each other for CPU time.
Now for the question why the overall working time increases. The reason is simple: You are not only increasing the number of threads, but you are increasing the work load at the same rate. If your threads were not competing with each other at all and there would be no overhead involved, N threads would require the same time to do N times the work. It does not, so you are noticing a slow down.

Weird OpenCL calls side effect on C++ for loop performance

I'm working on a C++ project using OpenCL. I'm using the CPU as an OpenCL device with the intel OpenCL runtime
I noticed a weird side effect in calling OpenCL functions. Here is a simple test:
#include <iostream>
#include <cstdio>
#include <vector>
#include <CL/cl.hpp>
int main(int argc, char* argv[])
{
/*
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
*/
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
It's a simple loop that reads a file char by char and adds them so the compiler doesn't try to optimize it out.
My system is a Core i7-4770K, 2TB HDD 16GB DDR3 running Ubuntu 14.10. The program above, with a 100MB file as input, takes around 770ms. This is consistent with my HDD speed. So far so good.
If you now invert the comments and run only the OpenCL calls region, it takes around 200ms. Again, so far, so good.
Buf if you uncomment all, the program takes more than 2000ms. I would expect 770ms + 200ms, but it is 2000ms. You can even notice an increased delay between the output messages in the for loop. The two regions (OpenCL calls and reading chars) are supposed to be independent.
I don't understand why using OpenCL interferes with a simple C++ for loop performance. It's not a simple OpenCL initialization delay.
I'm compiling this example with:
g++ weird.cpp -O2 -lOpenCL -o weird
I also tried using Clang++, but it happens the same.

This was an interesting one. It's because getc is made threadsafe version at the point when the queue is instantiated and so the time increase is the grab-release cycle of the locks - I'm not sure why/how this occurs but that is the decisive point on the AMD OpenCL SDK with intel CPUs. I was quite amazed I had essentially the same times as OP.
https://software.intel.com/en-us/forums/topic/337984
You can try a remedy for this specific problem by just changing getc to getc_unlocked.
It brought it back down to 930 ms for me - that time increase over 750ms is mainly spent in platform and context creation lines.

I believe that the effect is caused by the OpenCL objects still being in scope, and therefore not being deleted before the for loop. They may be affecting the other computation because of considerations needed. For example, running the example as you gave it yields the following times on my system (g++ 4.2.1 with O2 on Mac OSX):
CL: 0.012s
Loop: 14.447s
Both: 14.874s
But putting the OpenCL code into its own anonymous scope, therefore automatically calling the destructors before the loops seems to get rid of the problem. Using the code:
#include <iostream>
#include <cstdio>
#include <vector>
#include "cl.hpp"
int main(int argc, char* argv[])
{
{
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
}
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
I get the timings:
CL: 0.012s
Loop: 14.635s
Both: 14.648s
Which seems to add linearly. The effect is pretty small compared to other effects on the system, such as CPU load from other processes, but it seems to be gone when adding the anonymous scope. I'll do some profiling and add it as an edit if it produces anything of interest.

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?

I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ AMP crashing on hardware (GeForce GTX 660) - c++

Related

Does cache behaviour explain decreasing C++ write times to free store?

Why 8 threads is slower than 2 threads?

opencv much slower in multithreading

Weird OpenCL calls side effect on C++ for loop performance

How do you find out what parts of code are creating the most virtual memory?

Categories

Resources