I have a weird problem. I have written some MEX/Matlab-functions using C++. On my computer everything works fine. However, using the institute's cluster, the code sometimes simply stops running without any error (a core file is created which says "CPU limit time exceeded"). Unfortunately, on the cluster I cannot really debug my code and I also cannot reproduce the error.
What I do know is, that the error only occurs for very large runs, i.e., when a lot of memory is required. My assumption is therefore that my code has some memoryleaks.
The only real part where I could think of is the following bit:
#include <vector>
using std::vector;
vector<int> createVec(int length) {
vector<int> out(length);
for(int i = 0; i < length; ++i)
out[i] = 2.0 + i; // the real vector is obviously filled different, but it's just simple computations
return out;
}
void someFunction() {
int numUser = 5;
int numStages = 3;
// do some stuff
for(int user = 0; user < numUser; ++user) {
vector< vector<int> > userData(numStages);
for(int stage = 0; stage < numStages; ++stage) {
userData[stage] = createVec(42);
// use the vector for some computations
}
}
}
My question now is: Could this bit produce memory leaks or is this save due to RAII (which I would think it is)? Question for the MATLAB-experts: Does this behave any different when run as a mex file?
Thanks
Related
I have always been confused and never understood how the alloc map-type of the map clause of the target (or target data) construct works.
What is my application - I would like to have a temporary array on a device, which is used only on the device, is initialized on the device, read on the device, everything on the device. The host does not touch the contents of the array at all. For the sake of simplicity, I have the following code, which copies an array to another array via a temporary array (using just a single team and thread, but that does not matter):
#include <cstdio>
int main()
{
const int count = 10;
int * src = new int[count];
int * tmp = new int[count];
int * dst = new int[count];
for(int i = 0; i < count; i++) src[i] = i;
for(int i = 0; i < count; i++) printf(" %3d", src[i]); printf("\n");
#pragma omp target map(to:src[0:count]) map(from:dst[0:count]) map(alloc:tmp[0:count])
{
for(int i = 0; i < count; i++) tmp[i] = src[i];
for(int i = 0; i < count; i++) dst[i] = tmp[i];
}
for(int i = 0; i < count; i++) printf(" %3d", dst[i]); printf("\n");
delete[] src;
delete[] tmp;
delete[] dst;
return 0;
}
This code works when using pgc++ -mp=gpu on Nvidia and on Intel gpu using icpx -fiopenmp -fopenmp-targets=spir64.
But the thing is, I don't want to allocate the tmp array on the host. If I just use int * tmp = nullptr, on nvidia the code fails (on intel it still works). If I leave the tmp uninitialized (using just int * tmp;, and removing the delete), the execution fails on Intel too. If I do not even declare the tmp variable, compilation fails (which kinda makes sense). I made sure it runs on the device (really offloads the code, doesn't fallback to cpu) using OMP_TARGET_OFFLOAD=MANDATORY.
This was weird to me, since I don't use the tmp array on the host at all. As I understand it, the tmp array is allocated on the device and then in the kernel the device array is used. Is that right? Why do I have to allocate and/or initialize the pointer on the host if I don't use it on the host?
So my question is: what are the exact requirements to use map(alloc) in OpenMP offloading? How does it work? How should I use it? I would appreciate an example and references from tutorials/documentation.
I wasn't able to find any useful information regarding this. The standard was not helpful at all, and the tutorials I attended and watched did not go into such depth.
I understand that the code should work even without OpenMP enabled (as if the pragmas were just ignored), so let's assume there is an #ifdef to actually allocate the tmp array if OpenMP is disabled.
I am also aware of manual memory management via omp_target_alloc(), omp_target_memcpy() and omp_target_free(), but I wanted to use the target map(alloc).
I am reading the standard 5.2, using pgc++ 22.2-0 and icpx 2022.0.0.20211123.
After learning from a previous question that my VS 2017 C++ AMP project was basically sound, that the error messages while correct were masking the real problem, and that the issue was certain lines of code, I rewrote the code as below. By commenting out various lines at a time, I learned that
extent<2> e(M,N);
index<2> idx(0,0);
will build and execute, that code like
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++)
will build but will throw an exception if run, and that code like
c[idx] = a[idx] + b[idx];
will not even build. Note that I have not as yet invoked any parallel functions.
This leads me to ask: does Concurrency Runtime or C++ AMP require that GPU hardware be installed to build and/or execute properly?
My machine has two multi-core CPU processors, but the GPU hardware hasn't been installed yet. Still, I thought I would be able to use the the parallelism constructs to take advantage of the processors I do have.
#include "pch.h"
#include <iostream>
#include "amp.h"
#include <vector>
using namespace Concurrency;
int main() {
const int M = 1024; const int N = 1024; //row, col for vector
std::vector<int> vA(M*N); std::vector<int> vB(M*N); //vectors to add
std::vector<int> vC(M*N); //vector for result
for (int i = 0; i < M; i++) { vA[i] = i; } //populate vectors
for (int j = N - 1; j >= 0; j--) { vB[j] = j; }
extent<2> e(M, N); //uses AMP constructs but
index<2> idx(0, 0); //no parallel functions invoked
array_view<int, 2> a(e, vA), b(e, vB);
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++) {
for (idx[1] = 0; idx[1] < e[1]; idx[1]++) {
c[idx] = a[idx] + b[idx];
c(idx[0], idx[1]) = a(idx[0], idx[1]) + b(idx[0], idx[1]);
}
}
}
No, GPU hardware is not required. After starting a successfully compiled program, without GPU hardware, the system created a "software" GPU as shown in the output when debugging.
'Amp2.exe' (Win32): Loaded 'C:\Windows\SysWOW64\d3d11ref.dll'. [...]
GPU Device Created.
I used the available GPU diagnostics tool to look at performance.
I am new to c++ programming and StackOverflow, but I have some experience with core Java. I wanted to participate in programming Olympiads and I choose c++ because c++ codes are generally faster than that of an equivalent Java code.
I was solving some problems involving recursion and DP at zonal level and I came across this question called Sequence game
But unfortunately my code doesn't seem to work. It exits with exit code 3221225477, but I can't make anything out of it. I remember Java did a much better job of pointing out my mistakes, but here in c++ I don't have a clue of what's happening. Here's the code btw,
#include <iostream>
#include <fstream>
#include <cstdio>
#include <algorithm>
#include <vector>
#include <set>
using namespace std;
int N, minimum, maximum;
set <unsigned int> result;
vector <unsigned int> integers;
bool status = true;
void score(unsigned int b, unsigned int step)
{
if(step < N)
{
unsigned int subtracted;
unsigned int added = b + integers[step];
bool add_gate = (added <= maximum);
bool subtract_gate = (b <= integers[step]);
if (subtract_gate)
subtracted = b - integers[step];
subtract_gate = subtract_gate && (subtracted >= minimum);
if(add_gate && subtract_gate)
{
result.insert(added);
result.insert(subtracted);
score(added, step++);
score(subtracted, step++);
}
else if(!(add_gate) && !(subtract_gate))
{
status = false;
return;
}
else if(add_gate)
{
result.insert(added);
score(added, step++);
}
else if(subtract_gate)
{
result.insert(subtracted);
score(subtracted, step++);
}
}
else return;
}
int main()
{
ios_base::sync_with_stdio(false);
ifstream input("input.txt"); // attach to input file
streambuf *cinbuf = cin.rdbuf(); // save old cin buffer
cin.rdbuf(input.rdbuf()); // redirect cin to input.txt
ofstream output("output.txt"); // attach to output file
streambuf *coutbuf = cout.rdbuf(); // save old cout buffer
cout.rdbuf(output.rdbuf()); // redirect cout to output.txt
unsigned int b;
cin>>N>>b>>minimum>>maximum;
for(unsigned int i = 0; i < N; ++i)
cin>>integers[i];
score(b, 0);
set<unsigned int>::iterator iter = result.begin();
if(status)
cout<<*iter<<endl;
else
cout<<-1<<endl;
cin.rdbuf(cinbuf);
cout.rdbuf(coutbuf);
return 0;
}
(Note: I intentionally did not use typedef).
I compiled this code with mingw-w64 in a windows machine and here is the Output:
[Finished in 19.8s with exit code 3221225477] ...
Although I have an intel i5-8600, it took so much time to compile, much of the time was taken by the antivirus to scan my exe file, and even sometimes it keeps on compiling for long without any intervention from the anti-virus.
(Note: I did not use command line, instead I used used sublime text to compile it).
I even tried tdm-gcc, and again some other peculiar exit code came up. I even tried to run it on a Ubuntu machine, but unfortunately it couldn't find the output file. When I ran it on a Codechef Online IDE, even though it did not run properly, but the error message was less scarier than that of mingw's.
It said that there was a run-time error and "SIGSEGV" was displayed as an error code. Codechef states that
A SIGSEGV is an error(signal) caused by an invalid memory reference or
a segmentation fault. You are probably trying to access an array
element out of bounds or trying to use too much memory. Some of the
other causes of a segmentation fault are : Using uninitialized
pointers, dereference of NULL pointers, accessing memory that the
program doesn’t own.
It's been a few days that I am trying to solve this, and I am really frustrated by now. First when i started solving this problem I used c arrays, then changed to vectors and finally now to std::set, while hopping that it will solve the problem, but nothing worked. I tried a another dp problem, and again this was the case.
It would be great if someone help me figure out what's wrong in my code.
Thanks in advance.
3221225477 converted to hex is 0xC0000005, which stands for STATUS_ACCESS_VIOLATION, which means you tried to access (read, write or execute) invalid memory.
I remember Java did a much better job of pointing out my mistakes, but here in c++ I don't have a clue of what's happening.
When you run into your program crashing, you should run it under a debugger. Since you're running your code on Windows, I highly recommend Visual Studio 2017 Community Edition. If you ran your code under it, it would point exact line where the crash happens.
As for your crash itself, as PaulMcKenzie points out in the comment, you're indexing an empty vector, which makes std::cin write into out of bounds memory.
integers is a vector which is a dynamic contiguous array whose size is not known at compile time here. So when it is defined initially, it is empty. You need to insert into the vector. Change the following:
for(unsigned int i = 0; i < N; ++i)
cin>>integers[i];
to this:
int j;
for(unsigned int i = 0; i < N; ++i) {
cin>> j;
integers.push_back(j);
}
P.W's answer is correct, but an alternative to using push_back is to pre-allocate the vector after N is known. Then you can read from cin straight into the vector elements as before.
integers = vector<unsigned int>(N);
for (unsigned int i = 0; i < N; i++)
cin >> integers[i];
This method has the added advantage of only allocating memory for the vector once. The push_back method will reallocate if the underlying buffer fills up.
Solution: Apparently the culprit was the use of floor(), the performance of which turns out to be OS-dependent in glibc.
This is a followup question to an earlier one: Same program faster on Linux than Windows -- why?
I have a small C++ program, that, when compiled with nuwen gcc 4.6.1, runs much faster on Wine than Windows XP (on the same computer). The question: why does this happen?
The timings are ~15.8 and 25.9 seconds, for Wine and Windows respectively. Note that I'm talking about the same executable, not only the same C++ program.
The source code is at the end of the post. The compiled executable is here (if you trust me enough).
This particular program does nothing useful, it is just a minimal example boiled down from a larger program I have. Please see this other question for some more precise benchmarking of the original program (important!!) and the most common possibilities ruled out (such as other programs hogging the CPU on Windows, process startup penalty, difference in system calls such as memory allocation). Also note that while here I used rand() for simplicity, in the original I used my own RNG which I know does no heap-allocation.
The reason I opened a new question on the topic is that now I can post an actual simplified code example for reproducing the phenomenon.
The code:
#include <cstdlib>
#include <cmath>
int irand(int top) {
return int(std::floor((std::rand() / (RAND_MAX + 1.0)) * top));
}
template<typename T>
class Vector {
T *vec;
const int sz;
public:
Vector(int n) : sz(n) {
vec = new T[sz];
}
~Vector() {
delete [] vec;
}
int size() const { return sz; }
const T & operator [] (int i) const { return vec[i]; }
T & operator [] (int i) { return vec[i]; }
};
int main() {
const int tmax = 20000; // increase this to make it run longer
const int m = 10000;
Vector<int> vec(150);
for (int i=0; i < vec.size(); ++i)
vec[i] = 0;
// main loop
for (int t=0; t < tmax; ++t)
for (int j=0; j < m; ++j) {
int s = irand(100) + 1;
vec[s] += 1;
}
return 0;
}
UPDATE
It seems that if I replace irand() above with something deterministic such as
int irand(int top) {
static int c = 0;
return (c++) % top;
}
then the timing difference disappears. I'd like to note though that in my original program I used a different RNG, not the system rand(). I'm digging into the source of that now.
UPDATE 2
Now I replaced the irand() function with an equivalent of what I had in the original program. It is a bit lengthy (the algorithm is from Numerical Recipes), but the point was to show that no system libraries are being called explictly (except possibly through floor()). Yet the timing difference is still there!
Perhaps floor() could be to blame? Or the compiler generates calls to something else?
class ran1 {
static const int table_len = 32;
static const int int_max = (1u << 31) - 1;
int idum;
int next;
int *shuffle_table;
void propagate() {
const int int_quo = 1277731;
int k = idum/int_quo;
idum = 16807*(idum - k*int_quo) - 2836*k;
if (idum < 0)
idum += int_max;
}
public:
ran1() {
shuffle_table = new int[table_len];
seedrand(54321);
}
~ran1() {
delete [] shuffle_table;
}
void seedrand(int seed) {
idum = seed;
for (int i = table_len-1; i >= 0; i--) {
propagate();
shuffle_table[i] = idum;
}
next = idum;
}
double frand() {
int i = next/(1 + (int_max-1)/table_len);
next = shuffle_table[i];
propagate();
shuffle_table[i] = idum;
return next/(int_max + 1.0);
}
} rng;
int irand(int top) {
return int(std::floor(rng.frand() * top));
}
edit: It turned out that the culprit was floor() and not rand() as I suspected - see
the update at the top of the OP's question.
The run time of your program is dominated by the calls to rand().
I therefore think that rand() is the culprit. I suspect that the underlying function is provided by the WINE/Windows runtime, and the two implementations have different performance characteristics.
The easiest way to test this hypothesis would be to simply call rand() in a loop, and time the same executable in both environments.
edit I've had a look at the WINE source code, and here is its implementation of rand():
/*********************************************************************
* rand (MSVCRT.#)
*/
int CDECL MSVCRT_rand(void)
{
thread_data_t *data = msvcrt_get_thread_data();
/* this is the algorithm used by MSVC, according to
* http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators */
data->random_seed = data->random_seed * 214013 + 2531011;
return (data->random_seed >> 16) & MSVCRT_RAND_MAX;
}
I don't have access to Microsoft's source code to compare, but it wouldn't surprise me if the difference in performance was in the getting of thread-local data rather than in the RNG itself.
Wikipedia says:
Wine is a compatibility layer not an emulator. It duplicates functions
of a Windows computer by providing alternative implementations of the
DLLs that Windows programs call,[citation needed] and a process to
substitute for the Windows NT kernel. This method of duplication
differs from other methods that might also be considered emulation,
where Windows programs run in a virtual machine.[2] Wine is
predominantly written using black-box testing reverse-engineering, to
avoid copyright issues.
This implies that the developers of wine could replace an api call with anything at all to as long as the end result was the same as you would get with a native windows call. And I suppose they weren't constrained by needing to make it compatible with the rest of Windows.
From what I can tell, the C standard libraries used WILL be different in the two different scenarios. This affects the rand() call as well as floor().
From the mingw site... MinGW compilers provide access to the functionality of the Microsoft C runtime and some language-specific runtimes. Running under XP, this will use the Microsoft libraries. Seems straightforward.
However, the model under wine is much more complex. According to this diagram, the operating system's libc comes into play. This could be the difference between the two.
While Wine is basically Windows, you're still comparing apples to oranges. As well, not only is it apples/oranges, the underlying vehicles hauling those apples and oranges around are completely different.
In short, your question could trivially be rephrased as "this code runs faster on Mac OSX than it does on Windows" and get the same answer.
I have a procedure that calculates the outcome of a game by random sample; it is passed a number of iterations, runs a loop of that size storing the outcomes in a local variable (subHits), then after the loop is done, adds the totals from the local variables into a class level member variable (m_Hits), to wit:
void Game::LogOutcomes(long periodSize) {
int subHits[11];
for (int i = 0; i < 11; ++i) {
subHits[i] = 0;
}
for (int iters = 0; iters < periodSize; ++iters) {
// ... snipped out code calculating rankIndex by random sample.
++subHits[rankIndex];
}
for (int i = 0; i < 11; ++i) {
m_Hits[i] += subHits[i];
}
}
.. of course, it uses a local variable as temporary storage for purposes of running the procedure in parallel, which I invoke with:
dispatch_queue_t globalQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(m_BatchSize / m_PeriodSize, globalQ, ^(size_t periodCount) {
LogBonusWager(m_PeriodSize);
});
.. and it seems to work perfectly (all results are sufficiently close to statistically expected value). I can't help but think there's something wrong, because nowhere am I specifically 'locking' the class level variable when updating it with the contents of the local variable, and that I'm getting the right results through sheer good fortune.
Is there something I'm missing?
You're getting lucky. You should either have a dedicated (serial) queue for updating the shared state, or use OSAtomicAddSize to add to it. Without this you'll be losing updates occasionally.