I have a program that was originally being executed sequentially and now I'm trying to parallelize it via OpenMP Offloading. The thing is that when I use the update clause, depending on the case, if I include the size of the array I want to move it returns an incorrect result, but other times it works. For example, this pragma:
#pragma omp target update from(image[:bands])
Is not the same as:
#pragma omp target update from(image)
What I want to do is move the whole thing. Suppose the variable was originally declared in the host as follows:
double* image = (double*)malloc(bands*sizeof(double));
And that these update pragmas are being called inside a target data region where the variable image has been mapped like this:
#pragma omp target data map(to: image[:bands]) {
// the code
}
I want to move it to the host to do some work that cannot be done in the device. Note: The same thing may happen with the "to" update pragmas, not only the "from".
Well I don't know why anyone from OpenMP answered this question, as the answer was pretty simple (I say this because they don't have a forum anymore and this is supposed to be the best place to ask questions about OpenMP...). If you want to copy data dynamically allocated using pointers you have to use the omp_target_memcpy() function.
Related
I'm trying to use openMP to parallelize some sections of a relatively complex simulation model of a car I have been programming in C++.
The whole model is comprised of several nested classes. Each instance of the class "Vehicle" has four instances of a class "Suspension", and each of them has one instance of the class Tyre. There's quite a bit more to it but it shouldn't be relevant to the problem.
I'm trying to parallelize the update of the Suspension on every integration step with a code that looks like follows. This code is part of another class containing other simuation data, including one or several cars.
for (int iCar = 0; iCar < this->numberOfCars; iCar++) {
omp_set_num_threads(4);
#pragma omp parallel for schedule(static, 1)
for (int iSuspension = 0; iSuspension < 4; iSuspension++) {
this->cars[iCar].suspensions[iSuspension].update();
}
}
I've actually simplified it a bit and changed the variable names hoping to make it a bit more understandable (and not being masking the problem by doning so!)
The method "update" just computes some data of the corresponding suspension on each time step and saves it in several proporties of its own instance of the Suspension class. All instances of the class Suspension are independent of each other, so that every call to the method "update" accesses only to data contained in the same instance of "Suspension".
The behaviour that I'm getting using the debugger can be described as follows:
The first time the loop is run (at the first time step of the simulation) it runs ok. Always. All four suspensions are updated correctly.
The second time the loop is run, or at the latest on the third, at least one of the suspensions become updated with correpted data. It's quite common that two of the suspension become exactly the same (corrupted) data, which shouldn't be possible, as they are configured from the start with slightly different parameters.
If I run it with one loop instead of four (omp_set_num_threads(1)) it works flawlessly. Needless to say, the same applies when I run it without any openMP preprocessor directives.
I'm aware it may not be possible to figure out a solution to the problem without knowing how the rest of the program works, but I hope somebody can at least tell if there's any reason why you just can't access to properties and methods of a class within a parallel openMP loop the way I'm trying to do it.
I'm using W10 and Visual Studio 2017 Community. I tried to compile the project with and without optimizations, with no difference.
Thanks a lot in advance!
Please excuse me if this question has been answered before, I cannot figure out which are the right keywords.
I want to run in parallel a lot of calls to linux commands using openmp. I need to guarantee in some how, that each worker wait until the command finish and the command can take different time to finish. To simplify the issue, I am trying to generate the names of the files on which the command will run but each file name is been generated more than once, but the names of the file are unique. how can I modify the following lines of code to achieve an unique call by file name (Therefore a unique call to the command) using OpenMP?
omp_set_num_threads(8);
#pragma omp parallel for private(command, dirname) shared(i_traj) schedule(dynamic)
for(i_traj=0; i_traj<G.size(); i_traj++)
{
//command will contain the comand line.
snprintf(dirname1,sizeof(dirname1), "ID%i_Trajectory_%i",ID,G[i_traj].ID);
dirname = string(dirname1);
/*Initializing the trajectories*/
cout<<"Going to: "<<G[i_traj].folder_addr<<endl;
}
This section of the code will be executed in a fuction and not in the main program. Is possible to do the same using MPICH2?
UPDATE:
The problem has to do with my computer rather than with the code because the code works properly using another machine. Any suggestion?
UPGRADE:
Trying to follow the reccomendations of Gilles, I upgraded the code as follows:
#include <iostream>
#include <string>
using namespace std;
#define LARGE_NUMBER 100
double item[LARGE_NUMBER];
void process(int ID, nucleus &tr)
{
char dirname1[40];
string command;
string script_folder;
snprintf(dirname1,sizeof(dirname1), "ID%i_Trajectory_%i",ID,tr.ID);
string dirname;
dirname = string(dirname1);
/*Initializing the trajectories*/
cout<<"Running: "<<dirname<<endl;
script_folder = "./"+ dirname;
chdir(script_folder.c_str());
//command = "qsub " + dirname+"_PBS" + ".sh";
command = "gamess-2013 " + dirname + ".inp 01 1 ";
printf ("Checking if processor is available...");
if (system(NULL)) puts ("Ok");
else exit (EXIT_FAILURE);
if(!tr.runned)
{
int fail= system(command.c_str());
tr.runned=true;
}
chdir("../");
return;
}
int main() {
#pragma omp parallel
{
#pragma omp single
{
int i;
for (i=0; i<LARGE_NUMBER; i++)
#pragma omp task
// i is firstprivate, item is shared
process(i);
}
}
return 0;
}
But the problem of guarantee that each file is processed only once remains. How can I be sure that each task works on a unique file, waiting until the command execution is finished?
Sorry but I really don't understand neither the question you ask, nor its context. This sentence especially puzzles me a lot:
To simplify the issue, I am trying to generate the names of the files on which the command will run but each file name is been generated more than once, but the names of the file are unique.
Anyway, all that to say that my answer is likely to just miss the point. However, I still can report that your code snippet has a major issue: you explicitly declare shared the index i_traj of the loop that you try to parallelise. This makes no sense, since if there is one variable you want to be private in an OpenMP parallel loop, this is the loop index. Moreover, the OpenMP standard explicitly forbids it section 2.14.1.1. (emphasis are mine)
The loop iteration variable(s) in the associated for-loop(s) of a for
or parallel for construct is (are) private.
[...]
Variables with predetermined data-sharing attributes may not be listed
in data-sharing attribute clauses, except for the cases listed below.
For these exceptions only, listing a predetermined variable in a
data-sharing attribute clause is allowed and overrides the variable’s
predetermined data-sharing attributes.
Follows a list of exceptions where making shared the "loop iteration variable(s)" is not mentioned.
So again, my answer might just completely miss the point, but you definitely have a problem here, which you'd better fix before to try to go any deeper.
I want to place a function void loadableSW (void) at a specific location:0x3FF802. In another function residentMain() I will jump to this location using pointer to function. How to declare function
loadableSW to accomplish this. I have attached the skeleton of residentMain for clarity.
Update: Target hardware is TMS320C620xDSP. Since this is an aerospace project, deterministic
behaviour is a desirable design objective. Ideally, they would like to know what portion of memory contains what at a particular time. The solution as I just got to know is to define a section in memory in the linker file. The section shall start at 0x3FF802 (Location where to place the function). Since the size of the loadableSW function is known, the size of the memory section can also be determined. And then the directive #pragma CODESECTION ("function_name", "section_name") can place that function in the specified section.
Since pragma directives are not permissible in test scripts, I am wondering if there is any other way to do this without using any linker directives.
Besides I am curious. Is there any placement syntax for functions in C++? I know there is one for objects, but functions?
void residentMain (void)
{
void (*loadable_p) (void) = (void (*) (void)) 0x3FF802;
int hardwareOK = 0;
/*Code to check hardware integrity. hardwareOK = 1 if success*/
if (hardwareOK)
{
loadable_p (); /*Jump to Loadable Software*/
}
else
{
dspHalt ();
}
}
I'm not sure about your OS/toolchain/IDE, but the following answer should work:
How to specify a memory location at which function will get stored?
There is just one way I know of and it is shown in the first answer.
UPDATE
How to define sections in gcc:
variables:
http://mcuoneclipse.com/2012/11/01/defining-variables-at-absolute-addresses-with-gcc/
methods (section ("section-name")): http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Function-Attributes.html#Function%20Attributes
How to place a function at a particular address in C?
Since pragma directives are not permissible in test scripts, I am wondering if there is any other way to do this without using any linker directives.
If your target supports PC-relative addressing and you can ensure it is pure, then you can use a memcpy() to relocate the routine.
How to run code from RAM... has some hints on this. If you can not generate PC-relative/relocatable code, then you absolutely can not do this with out the help of the linker. That is the definition of a linker/loader, to fix up addresses.
Which can take you to a different concept. Do not fully link your code. Instead defer the address fixup until loading. Then you must write a loader to place the code at run-time; but from your aerospace project comment, I think that complexity and analysis are also important so I don't believe you would accept that. You also need double the storage, etc.
I'm working on a number crunching app using the CUDA framework. I have some static data that should be accessible to all threads, so I've put it in constant memory like this:
__device__ __constant__ CaseParams deviceCaseParams;
I use the call cudaMemcpyToSymbol to transfer these params from the host to the device:
void copyMetaData(CaseParams* caseParams)
{
cudaMemcpyToSymbol("deviceCaseParams", caseParams, sizeof(CaseParams));
}
which works.
Anyways, it seems (by trial and error, and also from reading posts on the net) that for some sick reason, the declaration of deviceCaseParams and the copy operation of it (the call to cudaMemcpyToSymbol) must be in the same file. At the moment I have these two in a .cu file, but I really want to have the parameter struct in a .cuh file so that any implementation could see it if it wants to. That means that I also have to have the copyMetaData function in the a header file, but this messes up linking (symbol already defined) since both .cpp and .cu files include this header (and thus both the MS C++ compiler and nvcc compiles it).
Does anyone have any advice on design here?
Update: See the comments
With an up-to-date CUDA (e.g. 3.2) you should be able to do the memcpy from within a different translation unit if you're looking up the symbol at runtime (i.e. by passing a string as the first arg to cudaMemcpyToSymbol as you are in your example).
Also, with Fermi-class devices you can just malloc the memory (cudaMalloc), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache. See the CUDA Programming Guide for more info. Note: you would need to compile with -arch=sm_20.
If you're using pre-Fermi CUDA, you will have found out by now that this problem doesn't just apply to constant memory, it applies to anything you want on the CUDA side of things. The only two ways I have found around this are to either:
Write everything CUDA in a single file (.cu), or
If you need to break out code into separate files, restrict yourself to headers which your single .cu file then includes.
If you need to share code between CUDA and C/C++, or have some common code you share between projects, option 2 is the only choice. It seems very unnatural to start with, but it solves the problem. You still get to structure your code, just not in a typically C like way. The main overhead is that every time you do a build you compile everything. The plus side of this (which I think is possibly why it works this way) is that the CUDA compiler has access to all the source code in one hit which is good for optimisation.
I have a very difficult problem I'm trying to solve: Let's say I have an arbitrary instruction pointer. I need to find out if that instruction pointer resides in a specific function (let's call it "Foo").
One approach to this would be to try to find the start and ending bounds of the function and see if the IP resides in it. The starting bound is easy to find:
void *start = &Foo;
The problem is, I don't know how to get the ending address of the function (or how "long" the function is, in bytes of assembly).
Does anyone have any ideas how you would get the "length" of a function, or a completely different way of doing this?
Let's assume that there is no SEH or C++ exception handling in the function. Also note that I am on a win32 platform, and have full access to the win32 api.
This won't work. You're presuming functions are contigous in memory and that one address will map to one function. The optimizer has a lot of leeway here and can move code from functions around the image.
If you have PDB files, you can use something like the dbghelp or DIA API's to figure this out. For instance, SymFromAddr. There may be some ambiguity here as a single address can map to multiple functions.
I've seen code that tries to do this before with something like:
#pragma optimize("", off)
void Foo()
{
}
void FooEnd()
{
}
#pragma optimize("", on)
And then FooEnd-Foo was used to compute the length of function Foo. This approach is incredibly error prone and still makes a lot of assumptions about exactly how the code is generated.
Look at the *.map file which can optionally be generated by the linker when it links the program, or at the program's debug (*.pdb) file.
OK, I haven't done assembly in about 15 years. Back then, I didn't do very much. Also, it was 680x0 asm. BUT...
Don't you just need to put a label before and after the function, take their addresses, subtract them for the function length, and then just compare the IP? I've seen the former done. The latter seems obvious.
If you're doing this in C, look first for debugging support --- ChrisW is spot on with map files, but also see if your C compiler's standard library provides anything for this low-level stuff -- most compilers provide tools for analysing the stack etc., for instance, even though it's not standard. Otherwise, try just using inline assembly, or wrapping the C function with an assembly file and a empty wrapper function with those labels.
The most simple solution is maintaining a state variable:
volatile int FOO_is_running = 0;
int Foo( int par ){
FOO_is_running = 1;
/* do the work */
FOO_is_running = 0;
return 0;
}
Here's how I do it, but it's using gcc/gdb.
$ gdb ImageWithSymbols
gdb> info line * 0xYourEIPhere
Edit: Formatting is giving me fits. Time for another beer.