C++: STXXL and VS runtime error in simple code - c++

I have the following code which is a pretty simple test, but the VS refuses to run it:
stxxl::syscall_file OutputFile("Data/test.bin", stxxl::file::RDWR | stxxl::file::CREAT | stxxl::file::DIRECT);
typedef stxxl::VECTOR_GENERATOR<struct Rectangle, 8, 2, 524288>::result vector_type;
vector_type rects(&OutputFile);
the program produces a runtime error in a memory location in the 3rd line . What am I doing wrong? I'm compiling the program for 64-bit platforms. In the Debug mode if I press continue the program resumes and executes without problem.

Consider the following example:
#include <stxxl/io>
#include <stxxl/vector>
#include <iostream>
struct Rectangle {
int x;
Rectangle() = default;
};
int main() {
stxxl::syscall_file OutputFile("/tmp/test.bin", stxxl::file::RDWR |
stxxl::file::CREAT | stxxl::file::DIRECT);
typedef stxxl::VECTOR_GENERATOR<Rectangle, 8, 2, 524288>::result vector_type;
vector_type rects(&OutputFile);
Rectangle my_rectangle;
for (std::size_t i = 0; i < 1024 * 1024 * 1024; ++i)
rects.push_back(my_rectangle);
return 0;
}
An error can easily be provoked when there is not enough space left on the device. Can you post your runtime error?

Related

OpenAcc error from simple loop: illegal address during kernel execution

I am getting "call to cuMemcpyDtoHsync returned error 700: Illegal address during kernel execution" error when I try to parallelize this simple loop.
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
return 0;
}
Compiled with: $ pgc++ -fast -acc -std=c++11 -Minfo=accel -o test test.cpp
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
T3 std::__copy_move_a2<(bool)0, const float *, decltype((std::allocator_traits<std::allocator<float>>::_S_pointer_helper<std::allocator<float>>((std::allocator<float>*)0)))>(T2, T2, T3):
1, include "vector"
64, include "stl_vector.h"
/usr/bin/ld: error in /tmp/pgc++cAUEgAXViQSY.o(.eh_frame); no .eh_frame_hdr table will be created.
$ ./test
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
The code runs normally without the #pragma, but I would like to make it parallel. What am I doing wrong?
Try compiling with "-ta=tesla:managed".
The problem here is that you aren't explicitly managing the data movement between the host and device and the compiler can't implicitly manage it for you since a std::vector is just a class of pointers so the compiler can't tell the size of the data. Hence, the device is using host addresses thus causing the illegal memory accesses.
While you can manage the data yourself by grabbing the vector's raw pointers and then using a data clause to copy the vector's data as well as copying the vector itself to the device, it's much easier to use CUDA Unified Memory (i.e. the "managed" flag) and have the CUDA runtime manage the data movement for you.
As Jerry notes, it's generally not recommended to use vectors in parallel code since they are not thread safe. In this case it's fine, but you may encounter other issues especially if you try to push or pop data. Better to use arrays. Plus arrays are easier to manage between the host and device copies.
% cat test.cpp
#include <vector>
#include <iostream>
using namespace std;
int main() {
vector<float> xF = {0, 1, 2, 3};
#pragma acc parallel loop
for (int i = 0; i < 4; ++i) {
xF[i] = 0.0;
}
for (int i = 0; i < 4; ++i) {
std::cout << xF[i] << std::endl;
}
return 0;
}
% pgc++ -ta=tesla:cc70,managed -Minfo=accel test.cpp --c++11 ; a.out
main:
6, Accelerator kernel generated
Generating Tesla code
9, #pragma acc loop gang, vector(4) /* blockIdx.x threadIdx.x */
6, Generating implicit copy(xF)
std::vector<float, std::allocator<float>>::operator [](unsigned long):
1, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating Tesla code
0
0
0
0

delete and memory management

I've found unexpected results about memory management running the following (sample) code:
#include <stdint.h>
#include <iostream>
#include <vector>
#define BIGNUM 100000000
// sample struct
struct Coordinate {
uint64_t x;
uint64_t y;
uint64_t z;
Coordinate() {
x = 1ULL;
y = 2ULL;
z = 3ULL;
}
};
int main() {
std::vector<Coordinate*>* coordinates = new std::vector<Coordinate*>();
for (int i = 0; i < BIGNUM; ++i)
coordinates->push_back(new Coordinate());
// block 1
for(std::vector<Coordinate*>::iterator it = coordinates->begin(); it != coordinates->end(); ++it)
delete(*it);
// block 2
delete(coordinates);
std::cout << "end\n";
std::cin.get();
return 0;
}
On my Ubuntu 14.04:
The command ps aux --sort -rss was performed on std::cin.get(); 4 times, with small differences:
1) program as is
2) with block 1 commented (basically no delete on every vector's element)
3) with block 2 commented (so no delete on vector)
4) both both blocks 1 and 2 commented.
With my (big) surprise test 1) and 2) have almost the same RSS / VSZ results. In simple words it seems that delete(*it); doesn't work properly (doesn't free memory). Same conclusion can be achieved with 3) and 4).
On Windows XP (running in VirtualBox) everything is fine and memory is 0-2 MB running the program as is.
Just because delete frees memory doesn't mean that the memory is immediately released back to the operating system for general use. Memory management on modern OSes is just not that simple.
There's nothing wrong here other than your assumptions!

C++/RAII: Could this cause a memory leak?

I have a weird problem. I have written some MEX/Matlab-functions using C++. On my computer everything works fine. However, using the institute's cluster, the code sometimes simply stops running without any error (a core file is created which says "CPU limit time exceeded"). Unfortunately, on the cluster I cannot really debug my code and I also cannot reproduce the error.
What I do know is, that the error only occurs for very large runs, i.e., when a lot of memory is required. My assumption is therefore that my code has some memoryleaks.
The only real part where I could think of is the following bit:
#include <vector>
using std::vector;
vector<int> createVec(int length) {
vector<int> out(length);
for(int i = 0; i < length; ++i)
out[i] = 2.0 + i; // the real vector is obviously filled different, but it's just simple computations
return out;
}
void someFunction() {
int numUser = 5;
int numStages = 3;
// do some stuff
for(int user = 0; user < numUser; ++user) {
vector< vector<int> > userData(numStages);
for(int stage = 0; stage < numStages; ++stage) {
userData[stage] = createVec(42);
// use the vector for some computations
}
}
}
My question now is: Could this bit produce memory leaks or is this save due to RAII (which I would think it is)? Question for the MATLAB-experts: Does this behave any different when run as a mex file?
Thanks

C++ AMP crashing on hardware (GeForce GTX 660)

I’m having a problem writing some C++ AMP code. I have included a sample.
It runs fine on emulated accelerators but crashes the display driver on my hardware (windows 7, NVIDIA GeForce GTX 660, latest drivers) but I can see nothing on wrong with my code.
Is there a problem with my code or is this a hardware/driver/complier issue?
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <amp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// Prints "NVIDIA GeForce GTX 660"
concurrency::accelerator_view target_view = concurrency::accelerator().create_view();
std::wcout << target_view.accelerator.description << std::endl;
// lower numbers do not cause the issue
const int x = 2000;
const int y = 30000;
// 1d array for storing result
std::vector<unsigned int> resultVector(y);
Concurrency::array_view<unsigned int, 1> resultsArrayView(resultVector.size(), resultVector);
// 2d array for data for processing
std::vector<unsigned int> dataVector(x * y);
concurrency::array_view<unsigned int, 2> dataArrayView(y, x, dataVector);
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
resultsArrayView.extent,
// Define the code to run on each thread on the accelerator.
[=](concurrency::index<1> idx) restrict(amp)
{
concurrency::array_view<unsigned int, 1> buffer = dataArrayView[idx[0]];
unsigned int bufferSize = buffer.get_extent().size();
// needs both loops to cause crash
for (unsigned int outer = 0; outer < bufferSize; outer++)
{
for (unsigned int i = 0; i < bufferSize; i++)
{
// works without this line, also if I change to buffer[0] it works?
dataArrayView[idx[0]][0] = 0;
}
}
// works without this line
resultsArrayView[0] = 0;
});
std::cout << "chash on next line" << std::endl;
resultsArrayView.synchronize();
std::cout << "will never reach me" << std::endl;
system("PAUSE");
return 0;
}
It is very likely that your computation exceeds permitted quantum time (default 2 seconds). After that time the operating systems comes in and restarts the GPU forcefully, this is called Timeout Detection and Recovery (TDR). The software adapter (reference device) does not have the TDR enabled, that is why the computation can exceed permitted quantum time.
Does your computation really require 3000 threads (variable x), each performing 2000 * 3000 (x * y) loop iterations? You can chunk your computation, such that each chunks takes less than 2 seconds to compute. You can also consider disabling TDR or exceeding the permitted quantum time to fit your need.
I highly recommend reading a blog post on how to handle TDRs in C++ AMP, which explains TDR in details: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/07/handling-tdrs-in-c-amp.aspx
Additionally, here is the separate blog post on how to disable the TDR on Windows 8:
http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/disabling-tdr-on-windows-8-for-your-c-amp-algorithms.aspx

XCode Debugging

#include <iostream>
#include <vector>
using namespace std;
typedef struct Record
{
std::string name;
bool isVisible;
int index;
}Record;
vector<Record> recordVector;
int main (int argc, char * const argv[])
{
Record tmpRecord = {"c++", true, 1};
for (int i = 0 ; i < 15; ++i) {
recordVector.push_back(tmpRecord);
}
return 0;
}
When I am debugging this and hover my cursor at recordVector variable to see the entire contents of this, its showing just 10(0-9) only, also its also not showing full contents in memory browser also. Although this vector has 15 contents in it.
Any clue for tweaking out this will be greatly appreciated.
Be sure you are using the "Debug" build configuration. Debug builds generate debug symbols and disable code optimization, otherwise the information shown by the debugger may be inaccurate.
You can found more info about this topic in the Mac OS X Reference Library.