DPC++ & MPI, buffer, shared memory, variable declare - c++

I am new to DPC++, and I try to develop a MPI based DPC++ Poisson solver. I read the book and am very confused about the buffer and the pointer with the shared or host memoery. What is the difference between those two things, and what should I use when I develop the code.
Now, I use the buffer initialized by an std::array with const size for serial code and it works well. However when I couple the DPC++ code with MPI, I have to declare a local length for each device, but I fail to do that. Here I attach my code
define nx 359
define ny 359
constexpr int local_len[2];
global_len[0] = nx + 1;
global_len[1] = ny + 1;
for (int i = 1; i < process; i++)
{
if (process % i == 0)
{
px = i;
py = process / i;
config_e = 1. / (2. * (global_len[1] * (px - 1) / py + global_len[0] * (py - 1) / px));
}
if (config_e >= cmax)
{
cmax = config_e;
cart_num_proc[0] = px;
cart_num_proc[1] = py;
}
}
local_len[0] = global_len[0] / cart_num_proc[0];
local_len[1] = global_len[1] / cart_num_proc[1];
constexpr int lx = local_len[0];
constexpr int ly = local_len[1];
queue Q{};
double *m_cellValue = malloc_shared<double>(size, Q);
I got the error
error: default initialization of an object of const type 'const int[2]'
error: cannot assign to variable 'local_len' with const-qualified type 'const int[2]'
main.cpp:52:18: error: cannot assign to variable 'local_len' with const-qualified type 'const int[2]'
Is there any way to just define a variable size array to do the parallel for in DPC++?

You are too eager in using constexpr. Remove all three occurrences in this code, and it should compile. So this has nothing to do with DPC++.

Related

Function started with std::async crashes after quite a few iterations

I am trying to develop a simple evolution algorithm in C++. To make my calculations faster I decided to use async functions to run multiple calculations at once:
std::vector<std::future<int> > compute(8);
unsigned nptr = 0;
int syncp = 0;
while(nptr != network::networks.size()){
compute.at(syncp) = std::async(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
syncp++;
if(syncp == 8){
syncp = 0;
for(unsigned i = 0; i < 8; i++){
compute.at(i).get();
}
}
nptr++;
}
This is how I start my calculating function. The function is called analyse, and for each "network" it assigns a score depending on how good it identifies the image.
This is part of the analyse function:
for(unsigned i = 0; i < entry.size(); i++){
double sum = 0;
data * d = &entry.at(i);
pattern * p = &pattern::patterns.at(d->patNo);
int sx = iWidth;
int sy = iHeight;
if(d->xPercentage*iWidth + d->xSpan*iWidth < sx) sx = d->xPercentage*iWidth + d->xSpan*iWidth;
if(d->yPercentage*iHeight + d->xSpan*iWidth < sy) sy = d->yPercentage*iHeight + d->xSpan*iWidth;
int xdisp = sx-d->xPercentage*iWidth;
int ydisp = sy-d->yPercentage*iHeight;
for(int x = d->xPercentage*iWidth; x < sx; x++){
for(int y = d->yPercentage*iHeight; y < sy; y++){
double xpl = x-d->xPercentage*iWidth;
double ypl = y-d->yPercentage*iHeight;
xpl /= xdisp;
ypl /= ydisp;
unsigned idx = (unsigned)(xpl*(p->width) + ypl*(p->height)*(p->width));
if(idx >= p->lweight.size()) idx = p->lweight.size()-1;
double weight = p->lweight.at(idx) - 5;
if(imageData[y*iWidth+x])
sum += weight;
else
sum -= 2*weight;
}
}
digitWeight[d->digit-1] += sum;
}
}
Now, there is no need to analyse the function itself - I'm sure it works, I have tested it on a single thread, and it runs just fine. The only problem is, after some time of execution, I get errors like segmentation fault, or vector range check error.
They mostly happen at this line:
digitWeight[d->digit-1] += sum;
Now, you can be sure that d->digit-1 is a valid range for this array.
The problem is that the value of the d pointer is different than it was here:
data * d = &entry.at(i);
It magically changes during the execution of the function, and starts pointing to different data, leading to errors. I have tried saving the value of d->digit to some variable and later use this variable, and it worked fine for just a while longer, before crashing on another shared resource, imageData this time.
I'm thinking this might be something related to data sharing - all async functions share the same array of data - it's a static vector. But this data is only read, not written anywhere, so why would it stop working? I know of something called mutex locking, but this would make no sense to lock this async functions, as it would run just as slow as a single threaded program would run.
I have also tried running the functions like this:
std::vector<std::thread*> threads(8);
unsigned nptr = 0;
int threadp = 0;
while(nptr != network::networks.size()){
threads.at(threadp) = new std::thread(&network::analyse, &network::networks.at(nptr), data, width, height, sw, dFnum.at(idx));
threadp++;
if(threadp == 8){
threadp = 0;
for(unsigned i = 0; i < 8; i++){
if(threads.at(i)->joinable()) threads.at(i)->join();
delete threads.at(i);
}
}
nptr++;
}
and it did work for a second, but after some time a very similar error appeared.
Data is a structure containing 7 integers, one of which is an ID of
pattern, and pattern is a class that contains two integers - width and height
and vector of chars.
Why does it happen on read-only data and how can I prevent it?
Here is an example of what happens:

Vector.push_back Issue

The following question concerns the use of a vector and memcpy. Vector functions being used are, .push_back, .data(), .size().
Information about the msg.
a)
#define BUFFERSIZE 8<<20
char* msg = new char[(BUFFERSIZE / 4)];
Question: Why doesn't code block b) work?
When I run the code below, using vector.push_back in a for loop, it causes the software I'm working with to stop working. I'm not sending the "msg" nor am I reading it, I'm just creating it.
b)
mVertex vertex;
vector<mVertex>mVertices;
for (int i = 0; i < 35; i++)
{
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
}
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size() * sizeof(mVertex))); // size
Screenshot of the error message from the software
By adding +1 to mVertices.size() at the very last row, the software works fine. See the example code below.
c)
mVertex vertex;
vector<mVertex>mVertices;
for (int i = 0; i < 35; i++)
{
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
}
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size()+1 * sizeof(mVertex))); // size
The code also work, if I remove the for loop.
d)
mVertex vertex;
vector<mVertex>mVertices;
vertex.posX = 2.0;
vertex.posY = 2.0;
vertex.posZ = 2.0;
vertex.norX = 2.0;
vertex.norY = 2.0;
vertex.norZ = 2.0;
vertex.U = 0.5;
vertex.V = 0.5;
mVertices.push_back(vertex);
memcpy(msg, // destination
mVertices.data(), // content
(mVertices.size() * sizeof(mVertex))); // size
The problem is a basic macro issue: macros define text replacements, not logical or arithmetic expressions.
#define BUFFERSIZE 8<<20
creates a text macro. When you use it in this expression (I've removed the redundant parentheses):
char* msg = new char[BUFFERSIZE / 4];
the preprocessor replaces BUFFERSIZE with 8 << 20, so it's as if you had written
char* msg = new char[8 << 20 / 4];
and the problem is that 8 << 20 / 4 is 256. That's because the expression is evaluated as 8 << (20/4), where presumably you intended it to be (8 << 20) / 4. To fix that (and you should always do this with macros), put parentheses around the expression in the macro itself:
#define BUFFERSIZE (8<<20)
Incidentally, that's why using a named variable (whether constexpr or otherwise) makes the problem go away: the variable gets the value 8 << 20, not the text, so all is good.
The define doesn't do what you think.
#define BUFFERSIZE 8<<20
yields BUFFERSIZE / 4 == 8 << 20 / 4 == 8 << 5 == 256. So the allocated memory in msg is to small to hold mVertices and
memcpy(msg, mVertices.data(), (mVertices.size() * sizeof(mVertex)));
writes into wrong memory. This can produce runtime errors.
You should use constexpr instead of define to avoid such problems.

Assignment of local variables causes Audio to stop processing in JUCE

EDIT: This turned out to be an uninitialized variable creating chaotic behavior. See this post about getting more compiler warnings for JUCE
I was attempting to create a basic synthesizer and I quickly ran into an absurd problem when simply attempting to assign a value to a newly declared variable.
After following along with the JUCE simple sine synthesis tutorial I ran into the problem. This is the basic code of my getNextAudioBlock() function when it is producing white noise. Note how there are four integers declared and assigned throughout:
const int numChannels = bufferToFill.buffer->getNumChannels();
const int numSamples = bufferToFill.numSamples;
for (int channel = 0; channel < numChannels; channel++){
float* const buffer = bufferToFill.buffer -> getWritePointer(channel, bufferToFill.startSample);
for (int sample; sample < numSamples; sample++){
buffer[sample] = (randomGen.nextFloat() * 2.0f - 1.0f);
}
}
However, as soon as I attempt to add another int I no longer get sound. Just simply adding the line int unusedVariable = 0; anywhere in the getNextAudioBlock() function but before the buffer[sample] assignment immediately returns from the function and it therefore produces no audio.
If I simply declare the new variable (int unusedVariable;) then it still works. It is only specifically the assignment part that causes the error. Also, if I declare the variable as a global member then the assignment within the function works just fine.
To reiterate, this works:
buffer[sample] = (randomGen.nextFloat() * 2.0f - 1.0f;
This works:
int unusedVariable;
buffer[sample] = (randomGen.nextFloat() * 2.0f - 1.0f;
But this doesn't:
int unusedVariable = 0;
buffer[sample] = (randomGen.nextFloat() * 2.0f - 1.0f;
My only idea was that allocating new memory on the Audio thread causes the error but I have seen declaration and assignment done in other online sources and even in my exact same function with numChannels, numSamples, channel, and sample all allocated and assigned just fine. I also considered that it has something to do with using the Random class, but I get the same problem even when it is generating sine waves.
EDIT: Here is the exact code copied from the project. Right here nextSample is declared globally, as the buffer does not get filled when it is declared locally
void MainContentComponent::getNextAudioBlock (const AudioSourceChannelInfo& bufferToFill)
{
const int numChannels = bufferToFill.buffer->getNumChannels();
const int numSamples = bufferToFill.numSamples;
for (int channel = 0; channel < numChannels; channel++){
float* const buffer = bufferToFill.buffer -> getWritePointer (channel, bufferToFill.startSample);
for (int sample; sample < numSamples; sample++){
// nextSample = (randomGen.nextFloat() * 2.0f - 1.0f); // For Randomly generated White Noise
nextSample = (float) std::sin (currentAngle);
currentAngle += angleDelta;
buffer[sample] = nextSample * volumeLevel;
}
}
}
I created a new AudioApplication project in the Projucer and pasted this block of code into the getNextAudioBlock() method (adding sensible member variables as you're referencing them here).
The compiler pointed at the problem right away -- the loop variable sample below isn't initialized (and C++ won't default init it for you), so if the memory used by that variable happened to have contained a value that's less than the buffer size, you'll generate some audio; if not, the buffer passed into this function is unaffected because the loop never runs.
for (int sample; sample < numSamples; sample++){
nextSample = (randomGen.nextFloat() * 2.0f - 1.0f); // For Randomly generated White Noise
//nextSample = (float) std::sin (currentAngle);
//currentAngle += angleDelta;
buffer[sample] = nextSample * volumeLevel;
}
see if changing that to for (int sample=0; doesn't fix things for you.

Function to initialise a dynamic array inside a class

as an exercise, i'm translating my master's thesis finite-difference time-domain code for simulation of wave propagation from matlab to c++ and i've come across the following problem.
i would like to create a class that corresponds to a non-physical absorbing layer called cpml. the size of the layer depends on the desired parameters of the simulation, so the arrays that define the absorbing layer have to be dynamic.
#ifndef fdtd_h
#define fdtd_h
#include <cmath>
#include <iostream>
#include <sstream>
using namespace std;
class cpml {
public:
int thickness;
int n_1, n_2, n_3;
double cut_off_freq;
double kappa_x_max, sigma_x_1_max, sigma_x_2_max, alpha_x_max;
double *kappa_x_tau_xy, *sigma_x_tau_xy, *alpha_x_tau_xy;
void set_cpml_parameters_tau_xy();
};
void cpml::set_cpml_parameters_tau_xy(){
double temp1[thickness], temp2[thickness], temp3[thickness];
for(int j = 1; j < thickness; j++){
temp1[j] = 1 + kappa_x_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1);
temp2[j] = sigma_x_1_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1 + n_2);
temp3[j] = alpha_x_max * pow((double)(j - 0.5) / (double)(thickness - 1), n_3);
}
kappa_x_tau_xy = temp1;
sigma_x_tau_xy = temp2;
for(int i = 1; i < thickness; i++){
cout << sigma_x_tau_xy[i] << endl;
}
alpha_x_tau_xy = temp3;
}
#endif /* fdtd_h */
when i call the function cpml::set_cpml_parameters_tau_xy() in my main function, the first value of the array sigma_x_tau_xy is correct. however, the further values aren't.
#include "fdtd.h"
using namespace std;
int main() {
cpml cpml;
int cpml_thickness = 10;
cpml.thickness = cpml_thickness;
int n_1 = 3, n_2 = 0, n_3 = 3;
cpml.n_1 = n_1; cpml.n_2 = n_2; cpml.n_3 = n_3;
double cut_off_freq = 1;
cpml.cut_off_freq = cut_off_freq;
double kappa_x_max = 0;
double sigma_x_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x), sigma_x_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x);
double alpha_x_max = 2 * PI * cpml.cut_off_freq;
double kappa_y_max = 0;
double sigma_y_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y), sigma_y_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y);
double alpha_y_max = 2 * PI * cpml.cut_off_freq;
cpml.kappa_x_max = kappa_x_max; cpml.sigma_x_1_max = sigma_x_1_max; cpml.sigma_x_2_max = sigma_x_2_max; cpml.alpha_x_max = alpha_x_max;
cpml.kappa_y_max = kappa_y_max; cpml.sigma_y_1_max = sigma_y_1_max; cpml.sigma_y_2_max = sigma_y_2_max; cpml.alpha_y_max = alpha_y_max;
cpml.set_cpml_parameters_tau_xy();
for(int j = 1; j < cpml.thickness; j++){
cout << *(cpml.sigma_x_tau_xy + j) << endl;
}
}
what am i doing wrong and how do i make the dynamic array members of the class cpml contain the correct values when called in the main function?
Two problems: The lesser of them is that your program is technically not a valid C++ program, since C++ doesn't have variable-length arrays (which your arrays temp1, temp2 and temp3 are).
The more serious problem is that you save pointers to local variables. When a function returns, local variables go out of scope and no longer exist. Pointers to them will become invalid, and using those pointers will lead to undefined behavior.
Both problems are easily solved by using std::vector instead of arrays and pointers.
You cannot declare an array in C++ without a "constant" expression for its size (the bounds must be known at compile time). That means this code is invalid:
double temp1[thickness], temp2[thickness], temp3[thickness];
What you should instead do is the following:
class cmpl
{
//...
std::vector<double> kappa_x_tau_xy, sigma_x_tau_xy, alpha_x_tau_xy;
// ...
};
void cpml::set_cpml_parameters_tau_xy(){
alpha_x_tau_xy.resize(thickness);
kappa_x_tau_xy.resize(thickness);
sigma_x_tau_xy.resize(thickness);
//...
std::vector will handle all the dynamic allocation under the hood for you. If your code compiled, it was because you were using a nonstandard GCC extension for variable length arrays. Turn your warnings up -Wall -pedantic -Werror when you compile and it should complain more.
Note that you also have issues in array bounds. Whereas Matlab is 1-indexed, C++ is 0-indexed, so you'll need to do this, too:
for(int j = 0; j < thickness; j++){
alpha_x_tau_xy[j] = 1 + kappa_x_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1);
kappa_x_tau_xy = sigma_x_1_max * pow((double)(thickness - j - 0.5) / (double)(thickness - 1), n_1 + n_2);
sigma_x_tau_xy = alpha_x_max * pow((double)(j - 0.5) / (double)(thickness - 1), n_3);
}
You have a similar issue in main:
for(int j = 1; j < cpml.thickness; j++){
cout << *(cpml.sigma_x_tau_xy + j) << endl;
}
Should become:
for(int j = 0; j < cpml.thickness; j++){
cout << cpml.sigma_x_tau_xy[j] << endl;
}
Additional Notes:
Your code is very unstructured. Consider putting all of the cmpl-related getting and setting into the cmpl class ([Encapsulation])(https://en.wikipedia.org/wiki/Encapsulation_(computer_programming)). This will make it easer for the client (you in this case) to interact with the object.
This will include hiding your class data as protected or private and exposing functions to get and set those variables (don't forget const where appropriate).
Add a constructor to initialize all of the fields at once. As it stands now, your class consists of mostly uninitialized garbage for much of its lifetime. If someone where to prematurely try to access a field, you're in Undefined Behavior territory.
std::endl is good for printing newline characters, but restrict that to Debug-only code. The reason being is that it flushes the buffer every time its called, which can make your code overall slower if it's printing a lot. Use a newline character "\n" instead for Release.
An additional benefit of std::vector is that it makes copying and assigning to a cmpl well behaved. Otherwise, the compiler will generate a copy constructor and copy assignment, which when used will be a shallow copy instead of the deep copy that you'd want.
After restructuring your class, your main might look something like this:
int main() {
int cpml_thickness = 10;
int n_1 = 3, n_2 = 0, n_3 = 3;
double cut_off_freq = 1;
double kappa_x_max = 0;
double sigma_x_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x), sigma_x_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_x);
double alpha_x_max = 2 * PI * cut_off_freq;
double kappa_y_max = 0;
double sigma_y_1_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y), sigma_y_2_max = 0.8 * (n_1 + 1) / (sqrt(simulation_medium.mu/simulation_medium.rho) * simulation_grid.big_delta_y);
double alpha_y_max = 2 * PI * cut_off_freq;
cpml cpml(cpml_thickness, n_1, n_2, n_3, cut_off_freq, kappa_x_max, kappa_y_max, sigma_x_1_max, sigma_x_2max, alpha_x_max, alpha_y_max);
cpml.set_cpml_parameters_tau_xy();
cpml.PrintSigmaTauXY(std::cout);
}
Which is arguably better. (You might use a getter to get sigma_tau_xy from the class and then print it yourself, though). And then you can think about how to simplify things even further by creating objects that represent the logical groupings of alpha_x_max and alpha_y_max etc. This could be a std::pair or a full-on struct with its own getters and setters. Now their own logic is grouped together and is easy to pass around/reference/think about. Your constructor for cmpl also becomes simpler, where you accept a single parameter that represents both x and y instead of separate ones for both.
Matlab doesn't really encourage an Object-Oriented approach in my (admittedly breif) experience, but in C++ it's easy.

thrust::sort_by_key system_error at memory location

I am writing a program using cuda.
The problem is the following:
I have two arrays in *cu file:
particle* particles;
int* grid_ind;
Place on GPU is allocated for them:
void mallocCUDA(int particlesNumber) {
cudaMalloc((void**)&particles, particlesNumber * sizeof(particle));
cudaMalloc((void**)&grid_ind, particlesNumber * sizeof(int));
}
Both arrays are filled (confirmed). particles in its own init method and grid_ind :
__global__ void findParticleCell(particle* particles, int particlesNumber, int* grid_ind) {
int index = blockDim.x*blockIdx.x + threadIdx.x;
if (index < particlesNumber) {
int x, y, z;
x = (int)(2.0f * (particles[index].predicted_p.x + 2));
y = (int)(2.0f * (particles[index].predicted_p.y + 2));
z = (int)(2.0f * (particles[index].predicted_p.z + 2));
int grid_index = (BOX_X + 2) * 2 * (BOX_Y + 2) * 2 * z + y * 2 * (BOX_X + 2) + x;
grid_ind[index] = grid_index;
}
}
It is called in the following method:
void findNeighbors(int particlesNumber) {
dim3 blocks = dim3((particlesNumber + threadsPerBlock - 1) / threadsPerBlock); // threadsPerBlock = 128 if that matters at all
dim3 threads = dim3(threadsPerBlock);
findParticleCell << <blocks, threads >> > (particles, particlesNumber, grid_ind);
thrust::device_ptr<int> t_grid_ind = thrust::device_pointer_cast(grid_ind);
thrust::device_ptr<particle> t_particles = thrust::device_pointer_cast(particles);
thrust::sort_by_key(t_grid_ind, t_grid_ind + particlesNumber, t_particles);
}
The problem is that the sort method is causing
Microsoft C++ exception: thrust::system::system_error at memory location
for some reason. I have tried to resolve this for a couple of days now without any luck. Why does that exception occur?
So I tried this code on the other PC and it worked without any problem.
Other people suggested that the problem on my PC might be in CUDA version / Video driver and whatever else.
Anyway, thats crazy...
I am sorry that I won't try to reinstall those things to check, since I found out that a custom sort method works not much longer than thrust::sort.