How to implement user-defined reduction with OpenACC? - c++

Is there a way to implement a user-defined reduction with OpenACC similar to declare reduction in OpenMP?
So that I could write something like
#pragma acc loop reduction(my_function:my_result)
Or what would be the appropriate way to implement efficient reduction without the predefined operators?

User defined reductions aren't yet part of the OpenACC standard. While I'm not part of the OpenACC technical committee, I believe they have received requests for this but not sure if it's something being considered for the 3.0 standard.
Since the OpenACC standard is largely user driven, I'd suggest you send a note to the OpenACC folks requesting this support. The more folks that request it, the more likely it is to be adopted in the standard.
Contact info for OpenACC can be found at the bottom of https://www.openacc.org/about

Related

Is combining std::execution and OpenMP advisable?

I use OpenMP since some time now. Recently, on a new project, I choose to use c++17, for some features.
Because of that, I have been concerned by std::execution which allow to parallelize algorithms. That seems really powerful and elegant, But their are a lot of feature of OpenMP really useful that are not easy to use with algorithms (barrier, SIMD, critical, etc..).
So I think to mix the std::execution::par (or unseq_par) with OpenMP. Is it a good idea, or should i stay only with OpenMP?
Unfortunately this is not officially supported. It may or may not work, depending on the implementation, but it is not portable.
Only the most recent version, OpenMP 5.0, even defines the interaction with C++11. In general, using anything from C++11 and forward "may result in unspecified behavior". While future versions of the OpenMP specification are expected to address the following features, currently their use may result in unspecified behavior.
Alignment support
Standard layout types
Allowing move constructs to throw
Defining move special member functions
Concurrency
Data-dependency ordering: atomics and memory model
Additions to the standard library
Thread-local storage
Dynamic initialization and destruction with concurrency
C++11 library
While C++17 and its specific high-level parallelism support is not mentioned, it is clear from this list, that it is unsupported.

Portable random number generation with OpenACC

Is there a portable way to generate random numbers with OpenACC? I know that it is possible to directly use cuRand but then I am restricted to Nvidia GPUs.
Another option seems to be generating numbers on the host and then moving them to the device, but that does not seem like the best solution performance-wise.
Is there a better way?
Using a random number generator in parallel is inherently a tricky operation. RNGs carry a state variable which needs private to each parallel thread. Using the system's "rand" is problematic in that it uses a global state variable which gives undefined behavior (a race-condition) when used in a parallel context. More parallel RNGs such as "rand_r" and Boost's PRNG should be used instead.
A second issue is that RNGs are not always portable. Different platforms may implement "rand" in ways. As you determined, there isn't a "rand" call available on NVIDIA devices and instead you need to use cuRAND calls.
OpenACC is designed to help exploit parallelism in your code in a platform agnostic way. Hence, platform specific operations such a parallel RNG are difficult to define within the standard itself. Perhaps something can be done, especially for something as useful as a PRNG, and I would suggest you contact the OpenACC standards committee requesting this support (feedback_at_openacc_dot_org).
The only truly portable way to do this now, is to write your own parallel RNG and include it as part of your code. Offhand, I don't have an OpenACC example written up and am a bit swamped so don't know if I'll have time, but will do my best to pull one together.
I did write a CUDA C version of a Mersenne Twister algorithm as part of an article I wrote about 8 years ago that might be helpful (though it predates OpenACC). It uses the PGI Accelerator Model at the beginning of the article and CUDA Fortran in the middle so don't worry too much about the content, just the MT source.
https://www.pgroup.com/blogs/posts/tune-gpu-monte-carlo.htm
Source package: https://www.pgroup.com/lit/samples/pginsider/pgi_mc_example.tar.gz

Reentrantlock in openmp

I heard about Reentrantlock recently which are available in Java. But I was trying to implement parallel data structures like priority queues using openmp and C++.
I was curious to know whether a similar equivalent exists in openmp and C++ or whether it can be implemented using pthreads? If there exists such an equivalent, please tell how to use it.
See the description of omp_nest_lock on page 270 (PDF page 279) in the OpenMP 4.5 standard.
A meta-question is "Why are you doing this?"
Why aren't you simply using something like TBB's Concurrent Priority Queue?
Do you need to be using OpenMP for other reasons?
Is this is for your own education?
If not, then TBB might be a simpler approach (it is now Apache Licensed).
(FWIW I work for Intel, who wrote TBB, but I work on OpenMP, not TBB :-))

Can C++ attributes be used to replace OpenMP pragmas?

C++ attributes provide a convenient and standardized way to markup code with extra information to give to the compiler and/or other tools.
Using OpenMP involves adding a lot of #pragma omp... lines into the source (such as to mark a loop for parallel processing). These #pragma lines seem to be excellent candidates for a facility such as generalized attributes.
For example, #pragma omp parallel for might become [[omp::parallel(for)]].
The often inaccurate cppreference.com uses such an attribute as an example here, which confirms it has at least been considered (by someone).
Is there a mapping of OpenMP pragmas to C++ attributes currently available and supported by any/all of the major compilers? If not, are there any plans underway to create one?
This is definitely a possibility and it's even something the OpenMP language committee is looking at. Take a look at OpenMP Technical Report 8 (https://www.openmp.org/wp-content/uploads/openmp-TR8.pdf) page 36, where a syntax for using OpenMP via attributes is proposed. Inclusion in TR8 doesn't guarantee its inclusion in version 5.1, but it shows that it's being discussed. This syntax is largely based on the work done in the original proposal for C++ attributes.
If you have specific feedback on this, I'd encourage you to provide feedback on this via the OpenMP forum (http://forum.openmp.org/forum/viewforum.php?f=26).

C++1z Coroutines a language feature?

Why will coroutines (as of now in the newest drafts for C++1z) be implemented as a core language feature (fancy keywords and all) as opposed to a library extension?
There already exist a couple of implementations for them (Boost.Coroutine, etc), some of which can be made platform independent, from what i have read. Why has the committee decided to bake it into the core language itself?
I'm not saying they shouldn't but Bjarne Stroustrup himself mentioned in some talk (don't know which one any more) that new features should be implemented in libraries as far as possible instead of touching the core language.
So is there a good reason to do so? What are the benefits?
While there are library implementation of coroutines, these tend to have specific restrictions. For example, a library implementation cannot detect what variables need to be maintained when a coroutine is suspended. It is possible to work around this need, e.g., by making the used variables explicit in some form. However, when coroutines should behave like normal functions as much as possible, it should be possible to define local variables.
I don't think any of the implementers of Boost coroutines thinks that their respective library interface is ideal. While it is the best which can be achieved in the currently language, the overall use can be improved.
At CppCon 2015, Gor Nishanov from Microsoft made the argument that C++ Coroutines can be a negative overhead abstraction. The paper from his talk is here.
If you take a look at his example, the ability to use a coroutine simplified the control flow of the network code, and when implemented at the compiler level gives you smaller code that has twice the throughput of the original. He makes the argument that really the ability to yield should be a feature of a C++ function.
They have an initial implementation in Visual Studio 2015, so you can give it a try it out for your use-case and see how it compares to the boost implementation. It looks like they are still trying to hash out if they will use the Async/Yield keywords though, so keep an eye on where the standard goes.
The resumable functions proposal for C++ can be found here and the update here. Unfortunately, it didn't make it into c++17, but is now a technical specification p0057r2. On the upside, it looks like their is support in clang with the -fcoroutines_ts flag and in Visual Studio 2015 Update 2. The keywords also have a co_ prepended to them. So co_await, co_yield etc.
Coroutines are a built in feature in golang, D, python, C#, and will be in the new Javascript standard(ECMA6). If C++ comes up with a more efficient implementation, I wonder if it would displace golang adoption.
resumable functions from C++1z support stackless context switching, while boost.coroutine(2) provide stackfull context switching.
The difference is that with stackful context switching the stack frames of function called within the coroutine remain intact on suspending the context, while the stack frames of sub-routiens are removed on suspending a resumable fucntion (C++1z).