Weird OpenCL calls side effect on C++ for loop performance - c++

I'm working on a C++ project using OpenCL. I'm using the CPU as an OpenCL device with the intel OpenCL runtime
I noticed a weird side effect in calling OpenCL functions. Here is a simple test:
#include <iostream>
#include <cstdio>
#include <vector>
#include <CL/cl.hpp>
int main(int argc, char* argv[])
{
/*
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
*/
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
It's a simple loop that reads a file char by char and adds them so the compiler doesn't try to optimize it out.
My system is a Core i7-4770K, 2TB HDD 16GB DDR3 running Ubuntu 14.10. The program above, with a 100MB file as input, takes around 770ms. This is consistent with my HDD speed. So far so good.
If you now invert the comments and run only the OpenCL calls region, it takes around 200ms. Again, so far, so good.
Buf if you uncomment all, the program takes more than 2000ms. I would expect 770ms + 200ms, but it is 2000ms. You can even notice an increased delay between the output messages in the for loop. The two regions (OpenCL calls and reading chars) are supposed to be independent.
I don't understand why using OpenCL interferes with a simple C++ for loop performance. It's not a simple OpenCL initialization delay.
I'm compiling this example with:
g++ weird.cpp -O2 -lOpenCL -o weird
I also tried using Clang++, but it happens the same.

This was an interesting one. It's because getc is made threadsafe version at the point when the queue is instantiated and so the time increase is the grab-release cycle of the locks - I'm not sure why/how this occurs but that is the decisive point on the AMD OpenCL SDK with intel CPUs. I was quite amazed I had essentially the same times as OP.
https://software.intel.com/en-us/forums/topic/337984
You can try a remedy for this specific problem by just changing getc to getc_unlocked.
It brought it back down to 930 ms for me - that time increase over 750ms is mainly spent in platform and context creation lines.

I believe that the effect is caused by the OpenCL objects still being in scope, and therefore not being deleted before the for loop. They may be affecting the other computation because of considerations needed. For example, running the example as you gave it yields the following times on my system (g++ 4.2.1 with O2 on Mac OSX):
CL: 0.012s
Loop: 14.447s
Both: 14.874s
But putting the OpenCL code into its own anonymous scope, therefore automatically calling the destructors before the loops seems to get rid of the problem. Using the code:
#include <iostream>
#include <cstdio>
#include <vector>
#include "cl.hpp"
int main(int argc, char* argv[])
{
{
cl_int status;
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<cl::Device> devices;
platforms[1].getDevices(CL_DEVICE_TYPE_CPU, &devices);
cl::Context context(devices);
cl::CommandQueue queue = cl::CommandQueue(context, devices[0]);
status = queue.finish();
printf("Status: %d\n", status);
}
int ch;
int b = 0;
int sum = 0;
FILE* f1;
f1 = fopen(argv[1], "r");
while((ch = fgetc(f1)) != EOF)
{
sum += ch;
b++;
if(b % 1000000 == 0)
printf("Char %d read\n", b);
}
printf("Sum: %d\n", sum);
}
I get the timings:
CL: 0.012s
Loop: 14.635s
Both: 14.648s
Which seems to add linearly. The effect is pretty small compared to other effects on the system, such as CPU load from other processes, but it seems to be gone when adding the anonymous scope. I'll do some profiling and add it as an edit if it produces anything of interest.

Related

Multithreaded C++ program using 30% CPU in Windows (compiled with MinGW), but 100% in Linux

I have written a C++ program for solving a difficult optimization problem using multiple processors. Its basic structure can be seen in the snippet below. The paralellization is made in a simple way using glib, by spawning threads with g_thread_new.
The program was originally developed in Linux, where htop shows that it uses 100% of all cores. But in Windows the CPU usage peaks at around 30-40% in a quad-core computer with 4 processors + 4 virtual processors. I have compiled it in Windows using MinGW and g++.
Why is the performance so degraded under Windows? Is this caused by the fact that I compiled the program using MinGW?
#include <gtk/gtk.h>
#include <thread>
using namespace std;
void intensive_function() {
//... heavy computations
return;
}
static gpointer worker(gpointer data) {
intensive_function();
return NULL;
}
int main(int argc, char *argv[]) {
int processors = thread::hardware_concurrency();
for(int i = 0; i < processors; i++) {
GThread *thread;
thread = g_thread_new("worker", worker, NULL);
g_thread_unref(thread);
}
}
Try to check value:
int processors = thread::hardware_concurrency();
the value can be other than processors/cores amount.

How do you find out what parts of code are creating the most virtual memory?

I have a program that starts up and within about 5 minutes the virtual size of process is about 13 gigs. It runs on Linux, uses boost, gnu c++ library and various other 3rd party libraries.
After 5 minutes size stays at 13 gigs and rss size steady at around 5 gigs.
I can't just run it in a debugger because at startup about 30 threads are started, each of which starts running its own code, that does various allocations. So stepping through and checking virtual memory at different parts of code at each breakpoint is not feasible.
I thought of changing program to start each thread one at a time to make it easier to track allocation of memory, but before doing this are there any good tools?
Valgrind is fairly slow, maybe tcmalloc could provide the info?
I would use valgrind (perhaps run it an entire night) or else use Boehm GC.
Alternatively, use the proc(5) filesystem to understand (e.g. thru /proc/$pid/statm & /proc/$pid/maps) when a lot of memory gets allocated.
The most important is to find memory leaks. If the memory don't grow after startup it is less an issue.
Perhaps adding instance counters to each class might help (use atomic integers or mutexes to serialize them).
If the program's source code is big (e.g. a million of source lines) so that spending several days/weeks is worth the effort, perhaps customizing the GCC compiler (e.g. with MELT) might be relevant.
a std::set minibenchmark
You mentioned big std::set based upon million rows.
#include <set>
#include <string>
#include <string.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <time.h>
class MyElem
{
int _n;
char _s[16-sizeof(_n)];
public:
MyElem(int k) : _n(k)
{
snprintf (_s, sizeof(_s), "%d", k);
};
~MyElem()
{
_n=0;
memset(_s, 0, sizeof(_s));
};
int n() const
{
return _n;
};
std::string str() const
{
return std::string(_s);
};
bool less(const MyElem&x) const
{
return _n < x._n;
};
};
bool operator < (const MyElem& l, const MyElem& r)
{
return l.less(r);
}
typedef std::set<MyElem> MySet;
void bench (int cnt, MySet& set)
{
for (long i=0; i<(long)cnt*1024; i++)
set.insert(MyElem(i));
time_t now = 0;
time (&now);
set.insert (((now) & 0xfffffff) * 100);
}
int main (int argc, char** argv)
{
MySet s;
clock_t cstart, cend;
int c = argc>1?atoi(argv[1]):256;
if (c<16) c=16;
printf ("c=%d Kiter\n", c);
cstart = clock();
bench (c, s);
cend = clock();
int x = getpid();
char cmdbuf[64];
snprintf(cmdbuf, sizeof(cmdbuf), "pmap %d", x);
printf ("running %s\n", cmdbuf);
fflush (NULL);
system(cmdbuf);
putchar('\n');
printf ("at end c=%d Kiter clockdiff=%.2f millisec = %.f µs/Kiter\n",
c, (cend-cstart)*1.0e-3, (double)(cend-cstart)/c);
if (s.find(x) != s.end())
printf("set has %d\n", x);
else
printf("set don't contain %d\n", x);
return 0;
}
Notice the 16 bytes sizeof(MyElem). On Debian/Sid/AMD64 with GCC 4.8.1 (intel i3770K processor, 16Gbytes RAM) and compiling that bench with g++ -Wall -O1 tset.cc -o ./tset-01
With 32768 thousands of iterations, so 32M elements:
total 2109592K
(last line above given by pmap)
at end c=32768 Kiter clockdiff=16470.00 millisec = 503 µs/Kiter
Then the implicit time from my zsh
./tset-01 32768 16.77s user 0.54s system 99% cpu 17.343 total
This is about 2.1Gbytes. so perhaps 64.3 bytes per element & set member overhead (since sizeof(MyElem)==16 the set seems to have a non-negligible cost of perhaps 6 words per element)

C++ AMP crashing on hardware (GeForce GTX 660)

I’m having a problem writing some C++ AMP code. I have included a sample.
It runs fine on emulated accelerators but crashes the display driver on my hardware (windows 7, NVIDIA GeForce GTX 660, latest drivers) but I can see nothing on wrong with my code.
Is there a problem with my code or is this a hardware/driver/complier issue?
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <amp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// Prints "NVIDIA GeForce GTX 660"
concurrency::accelerator_view target_view = concurrency::accelerator().create_view();
std::wcout << target_view.accelerator.description << std::endl;
// lower numbers do not cause the issue
const int x = 2000;
const int y = 30000;
// 1d array for storing result
std::vector<unsigned int> resultVector(y);
Concurrency::array_view<unsigned int, 1> resultsArrayView(resultVector.size(), resultVector);
// 2d array for data for processing
std::vector<unsigned int> dataVector(x * y);
concurrency::array_view<unsigned int, 2> dataArrayView(y, x, dataVector);
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
resultsArrayView.extent,
// Define the code to run on each thread on the accelerator.
[=](concurrency::index<1> idx) restrict(amp)
{
concurrency::array_view<unsigned int, 1> buffer = dataArrayView[idx[0]];
unsigned int bufferSize = buffer.get_extent().size();
// needs both loops to cause crash
for (unsigned int outer = 0; outer < bufferSize; outer++)
{
for (unsigned int i = 0; i < bufferSize; i++)
{
// works without this line, also if I change to buffer[0] it works?
dataArrayView[idx[0]][0] = 0;
}
}
// works without this line
resultsArrayView[0] = 0;
});
std::cout << "chash on next line" << std::endl;
resultsArrayView.synchronize();
std::cout << "will never reach me" << std::endl;
system("PAUSE");
return 0;
}
It is very likely that your computation exceeds permitted quantum time (default 2 seconds). After that time the operating systems comes in and restarts the GPU forcefully, this is called Timeout Detection and Recovery (TDR). The software adapter (reference device) does not have the TDR enabled, that is why the computation can exceed permitted quantum time.
Does your computation really require 3000 threads (variable x), each performing 2000 * 3000 (x * y) loop iterations? You can chunk your computation, such that each chunks takes less than 2 seconds to compute. You can also consider disabling TDR or exceeding the permitted quantum time to fit your need.
I highly recommend reading a blog post on how to handle TDRs in C++ AMP, which explains TDR in details: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/07/handling-tdrs-in-c-amp.aspx
Additionally, here is the separate blog post on how to disable the TDR on Windows 8:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/disabling-tdr-on-windows-8-for-your-c-amp-algorithms.aspx

clock_gettime fails in chrooted Debian etch with CLOCK_PROCESS_CPUTIME_ID

I have setup a chrooted Debian Etch (32bit) under Ubuntu 12.04 (64bit), and it appears that clock_gettime() works with CLOCK_MONOTONIC, but fails with both CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID. The errno is set to EINVAL, which according to the man page means that "The clk_id specified is not supported on this system."
All three clocks work fine outside the chrooted Debian and in 64bit chrooted Debian etch.
Can someone explains to me why this is the case and how to fix it?
Much appreciated.
I don't know the cause yet, but I have ideas that won't fit in the comment box.
First, you can make the test program simpler by compiling it as C instead of C++ and not linking it to libpthread. -lrt should be good enough to get clock_gettime. Also, compiling it with -static could make tracing easier since the dynamic linker startup stuff won't be there.
Static linking might even change the behavior of clock_gettime. It's worth trying just to find out whether it works around the bug.
Another thing I'd like to see is the output of this vdso-bypassing test program:
#define _GNU_SOURCE
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <sys/syscall.h>
int main(void)
{
struct timespec ts;
if(syscall(SYS_clock_gettime, CLOCK_PROCESS_CPUTIME_ID, &ts)) {
perror("clock_gettime");
return 1;
}
printf("CLOCK_PROCESS_CPUTIME_ID: %lu.%09ld\n",
(unsigned long)ts.tv_sec, ts.tv_nsec);
return 0;
}
with and without -static, and if it fails, add strace.
Update (actually, skip this. go to the second update)
A couple more simple test ideas:
compile and run a 32-bit test program in the Ubuntu host system, by adding -m32 to the gcc command. It's possible that the kernel's 32-bit compatibility mode is causing the error. If that's the case, then the 32-bit version will fail no matter which libc it gets linked to.
take the non-static test programs you compiled under Debian, copy them to the Ubuntu host system and try to run them there. Change in behavior will point to libc as the cause.
Then it's time for the hard stuff. Looking at disassembled code and maybe single-stepping it in gdb. Instead of having you do that on your own, I'd like to get a copy of the code you're running. Upload a a static-compiled failing test program somewhere I can get it. Also a copy of the 32-bit vdso provided by your kernel might be interesting. To extract the vdso, run the following program (compiled in the 32-bit chroot) which will create a file called vdso.dump, and upload that too.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int getvseg(const char *which, const char *outfn)
{
FILE *maps, *outfile;
char buf[1024];
void *start, *end;
size_t sz;
void *copy;
int ret;
char search[strlen(which)+4];
maps = fopen("/proc/self/maps", "r");
if(!maps) {
perror("/proc/self/maps");
return 1;
}
outfile = fopen(outfn, "w");
if(!outfile) {
perror(outfn);
fclose(maps);
return 1;
}
sprintf(search, "[%s]\n", which);
while(fgets(buf, sizeof buf, maps)) {
if(strlen(buf)<strlen(search) ||
strcmp(buf+strlen(buf)-strlen(search),search))
continue;
if(sscanf(buf, "%p-%p", &start, &end)!=2) {
fprintf(stderr, "weird line in /proc/self/maps: %s", buf);
continue;
}
sz = (char *)end - (char *)start;
/* copy because I got an EFAULT trying to write directly from vsyscall */
copy = malloc(sz);
if(!copy) {
perror("malloc");
goto fail;
}
memcpy(copy, start, sz);
if(fwrite(copy, 1, sz, outfile)!=sz) {
if(ferror(outfile))
perror(outfn);
else
fprintf(stderr, "%s: short write", outfn);
free(copy);
goto fail;
}
free(copy);
goto success;
}
fprintf(stderr, "%s not found\n", which);
fail:
ret = 1;
goto out;
success:
ret = 0;
out:
fclose(maps);
fclose(outfile);
return ret;
}
int main(void)
{
int ret = 1;
if(!getvseg("vdso", "vdso.dump")) {
printf("vdso dumped to vdso.dump\n");
ret = 0;
}
if(!getvseg("vsyscall", "vsyscall.dump")) {
printf("vsyscall dumped to vsyscall.dump\n");
ret = 0;
}
return ret;
}
Update 2
I reproduced this by downloading an etch libc. It's definitely caused be glibc stupidity. Instead of a simple syscall wrapper for clock_gettime it has a big wad of preprocessor spaghetti culminating in "you can't use clockid's that we didn't pre-approve". You're not going to get it to work with that old glibc. Which brings us to the question I didn't want to ask: why are you trying to use an obsolete version of Debian anyway?

How to get a "bus error"?

I am trying very hard to get a bus error.
One way is misaligned access and I have tried the examples given here and here, but no error for me - the programs execute just fine.
Is there some situation which is sure to produce a bus error?
This should reliably result in a SIGBUS on a POSIX-compliant system.
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
int main() {
FILE *f = tmpfile();
int *m = mmap(0, 4, PROT_WRITE, MAP_PRIVATE, fileno(f), 0);
*m = 0;
return 0;
}
From the Single Unix Specification, mmap:
References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal.
Bus errors can only be invoked on hardware platforms that:
Require aligned access, and
Don't compensate for an unaligned access by performing two aligned accesses and combining the results.
You probably do not have access to such a system.
Try something along the lines of:
#include <signal.h>
int main(void)
{
raise(SIGBUS);
return 0;
}
(I know, probably not the answer you want, but it's almost sure to get you a "bus error"!)
As others have mentioned this is very platform specific. On the ARM system I'm working with (which doesn't have virtual memory) there are large portions of the address space which have no memory or peripheral assigned. If I read or write one of those addresses, I get a bus error.
You can also get a bus error if there's actually a hardware problem on the bus.
If you're running on a platform with virtual memory, you might not be able to intentionally generate a bus error with your program unless it's a device driver or other kernel mode software. An invalid memory access would likely be trapped as an access violation or similar by the memory manager (and it never even has a chance to hit the bus).
on linux with an Intel CPU try this:
int main(int argc, char **argv)
{
# if defined i386
/* enable alignment check (AC) */
asm("pushf; "
"orl $(1<<18), (%esp); "
"popf;");
# endif
char d[] = "12345678"; /* yep! - causes SIGBUS even on Linux-i386 */
return 0;
}
the trick here is to set the "alignment check" bit in one of the CPUs "special" registers.
see also: here
I am sure that you must be using x86 machines.
X86 cpu does not generate bus error unless its AC flag in EFALAGS register is set.
Try this code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *p;
__asm__("pushf\n"
"orl $0x40000, (%rsp)\n"
"popf");
/*
* malloc() always provides aligned memory.
* Do not use stack variable like a[9], depending on the compiler you use,
* a may not be aligned properly.
*/
p = malloc(sizeof(int) + 1);
memset(p, 0, sizeof(int) + 1);
/* making p unaligned */
p++;
printf("%d\n", *(int *)p);
return 0;
}
More about this can be found at http://orchistro.tistory.com/206
Also keep in mind that some operating systems report "bus error" for errors other than misaligned access. You didn't mention in your question what it was you were actually trying to acheive. Maybe try thus:
int *x = 0;
*x=1;
the Wikipedia page you linked to mentions that access to non-existant memory can also result is a bus error. You might have better luck with loading a known-invalid address into a pointer and dereferwncing that.
How about this? untested.
#include<stdio.h>
typedef struct
{
int a;
int b;
} busErr;
int main()
{
busErr err;
char * cPtr;
int *iPtr;
cPtr = (char *)&err;
cPtr++;
iPtr = (int *)cPtr;
*iPtr = 10;
}
int main(int argc, char **argv)
{
char *bus_error = new char[1];
for (int i=0; i<1000000000;i++) {
bus_error += 0xFFFFFFFFFFFFFFF;
*(bus_error + 0xFFFFFFFFFFFFFF) = 'X';
}
}
Bus error: 10 (core dumped)
Simple, write to memory that isn't yours:
int main()
{
char *bus_error = 0;
*bus_error = 'X';
}
Instant bus error on my PowerPC Mac [OS X 10.4, dual 1ghz PPC7455's], not necessarily on your hardware and/or operating system.
There's even a wikipedia article about bus errors, including a program to make one.
For 0x86 arch:
#include <stdio.h>
int main()
{
#if defined(__GNUC__)
# if defined(__i386__)
/* Enable Alignment Checking on x86 */
__asm__("pushf\norl $0x40000,(%esp)\npopf");
# elif defined(__x86_64__)
/* Enable Alignment Checking on x86_64 */
__asm__("pushf\norl $0x40000,(%rsp)\npopf");
# endif
#endif
int b = 0;
int a = 0xffffff;
char *c = (char*)&a;
c++;
int *p = (int*)c;
*p = 10; //Bus error as memory accessed by p is not 4 or 8 byte aligned
printf ("%d\n", sizeof(a));
printf ("%x\n", *p);
printf ("%x\n", p);
printf ("%x\n", &a);
}
Note:If asm instructions are removed, code wont generate the SIGBUS error as suggested by others.
SIGBUS can occur for other reason too.
Bus errors occur if you try to access memory that is not addressable by your computer. For example, your computer's memory has an address range 0x00 to 0xFF but you try to access a memory element at 0x0100 or greater.
In reality, your computer will have a much greater range than 0x00 to 0xFF.
To answer your original post:
Tell me some situation which is sure to produce a bus error.
In your code, index into memory way outside the scope of the max memory limit. I dunno ... use some kind of giant hex value 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF indexed into a char* ...