Why would clEnqueueNDRangeKernel crash with a floating-point exception? - c++

I'm trying to launch a certain kernel using clEnqueueNDRangeKernel(), from within a C++ program. But instead of enqueuing, or returning an error, it gets a floating-point exception signal (SIGFPE).
For IP reasons which I can't go into, it's difficult for me to provide an example triggering this signal. But - there doesn't seem to be any legitimate reason for this to occur. Are there known cases of that function itself actually performing an invalid floating-point operation?

tl;dr: It's a divide-by zero due to a problem with your dimensions.
With NVIDIA CUDA's OpenCL library (at least with v11.2.152), passing workgroup dimensions and overall grid dimensions of different dimensionality may indeed trigger such a situation. OpenCL users have reported similar behavior in the past.
NVIDIA has deplorably chosen to provide its libraries as closed-source only, so I can only speculate about the reason, but it seems to be the following: When you construct a cl::NDRange with two dimensions, the third value in the international representation is initialized to 0 (explicitly and necessarily, or just sometimes). Now, if you read the documentation for clEnqueueNDRangeKernel() carefully, you'll notice that the both the global dimensions and the local dimensions must have the same dimensionality, i.e. same number of dimensions; clEnqueueNDRangeKernel() probably assumes this is the case, and calculates the number of grid blocks (number of workgroups) in the third dimension using global_dims_it_got[i] / local_dims_it_got[i] for every dimension of the global dimensions it received. Thus when the global dimensionality is higher, it ends up dividing by 0. That triggers SIGFPE, which, despite its name, is not only used for invalid floating-point operations, but rather any arithmetic error.

Related

FORTRAN 77 Divide By Zero Behavior

I am working on re-engineering an old FORTRAN77 program to Python for a while now. I'm running into an issue, though: when dividing by zero, it appears that the FORTRAN program just continues processing the data without issue. However, predictably it causes an issue in Python. I'm not able to find a discussion about this on any official channel for F77, and I only have an old version of the source code for the program I am translating that I can't get to compile.
TL;DR: How does F77 handle division by zero for the following cases?:
REAL division
INT division
The numerator is also zero (e.g. 0./0.)
Yes, I also have code that does nothing when a divide by zero error is encountered. Usually, it is the programmers responsibility to ensure that the results are either expected (the target variable's value is unchanged) or an error is thrown etc. In other words, you need to inspect any division operation for a possible zero divisor. Modern operating systems throw an internal exception on divide by zero (and assign NAN to the target variable if the system would pause under these circumstances), most older Fortran code is written such that divide by zero doesn't matter.

Can I make OpenMP to revert to ideal # of threads after using omp_set_num_threads?

Is there a way to make OpenMP revert the number of threads (for the next time it's used) back to the default after the application has already called omp_set_num_threads() with a specific number?
For example, is there a special code (e.g. 0 or -1) I supply to omp_set_num_threads?
Or should I just try doing something like omp_set_num_threads(omp_get_max_threads())?
I am making the assumption that the default number is whatever the implementation of OpenMP deems as "optimal". But I don't know what, if anything, the default is guaranteed to be or even what it should be. All I know is that I have an application that calls omp_set_num_threads(4) for one specific OpenMP block which I must not edit (for now). But I'd like to prevent that one setting from affecting other OpenMP blocks in my code.
I've had this problem before. (Disclaimer: I work with MSVC, which currently only implements the OpenMP 2.0 standard). To the best of my knowledge, there is nothing in the OpenMP 2.0 standard that allows you to find out this default value. omp_get_max_threads() is not required to return it (all subsequent emphasis mine):
The omp_get_max_threads function returns an integer that is guaranteed to be at least as large as the number of threads that would be used to form a team if a parallel region without a num_threads clause were to be encountered at that point in the code.
In other words, it might return a number that is larger than the currently set (or default) value.
There is no special value for omp_set_num_threads either:
The omp_set_num_threads function sets the default number of threads to use for subsequent parallel regions that do not specify a num_threads clause. [...] The value of the parameter num_threads must be a positive integer.
And if you get it wrong, it's up to the implementation what will happen:
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads requested for the parallel region exceeds the number that the run-time system can supply, the behavior of the program is implementation-defined. An implementation may, for example, interrupt the execution of the program, or it may serialize the parallel region.
You might find more precise (and less unsettling) information in the documentation of your OpenMP implementation. However, in the case of MSVC, that documentation is just a verbatim copy of the OpenMP 2.0 standard...
Since you are in the business of modifying the number of threads this way, I would like to preemptively caution about the interaction of omp_set_dynamic with omp_get_num_threads within MSVC:
Why does omp_set_dynamic(1) never adjust the number of threads (in Visual C++)?

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

SciPy Quad Integration: Accuracy Warning

I am currently trying to compute out an integral with scipy.integrate.quad, and for certain values I get the following error:
/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/scipy/integrate/quadrature.py:616: AccuracyWarning: divmax (20) exceeded. Latest difference = 2.005732e-02
AccuracyWarning)
Looking at the documentation, it isn't entirely clear to me why this warning is raised, what it means, and why it only occurs for certain values.
in scipy.integrate.quad there's a lower-level call to a difference function that is iterated over, and it's iterated over divmax times. In your case, the default for divmax=20. There are some functions that you can override this default -- for example scipy.integrate.quadrature.romberg allows you to set divmax (default here is 10) as a keyword. The warning is thrown when the tolerances aren't met for the difference function. Setting divmax to a higher value will run longer, and hopefully meet the tolerance requirements.

Signaling or catching 'nan' as they occur in computations in numerical code base in c++

We have numerical code written in C++. Rarely but under certain specific inputs, some of the computations result in an 'nan' value.
Is there a standard or recommended method by which we can stop and alert the user when a certain numerical calculation results in an 'nan' being generated? (under debug mode).Checking for each result if it is equal to 'nan' seems impractical given the huge sizes of matrices and vectors.
How do standard numerical libraries handle this situation? Could you throw some light on this?
NaN is propagated, when applied to a numeric operation. So, it is enough to check the final result for being a NaN. As for, how to do it -- if building for >= C++11, there is std::isnan, as Goz noticed. For < C++11 - if want to be bulletproof - I would personally do bit-checking (especially, if there may be an optimization involved). The pattern for NaN is
? 11.......1 xx.......x
sign bit ^ ^exponent^ ^fraction^
where ? may be anything, and at least one x must be 1.
For platform dependent solution, there seams to be yet another possibility. There is the function feenableexcept in glibc (probably with the signal function and the compiler option -fnon-call-exceptions), which turns on a generation of the SIGFPE sinals, when an invalid floating point operation occure. And the function _control87 (probably with the _set_se_translator function and compiler option /EHa), which allows pretty much the same in VC.
Although this is a nonstandard extension originally from glibc, on many systems you can use the feenableexcept routine declared in <fenv.h> to request that the machine trap particular floating-point exceptions and deliver SIGFPE to your process. You can use fedisableexcept to mask trapping, and fegetexcept to query the set of exceptions that are unmasked. By default they are all masked.
On older BSD systems without these routines, you can use fpsetmask and fpgetmask from <ieeefp.h> instead, but the world seems to be converging on the glibc API.
Warning: glibc currently has a bug whereby (the C99 standard routine) fegetenv has the unintended side effect of masking all exception traps on x86, so you have to call fesetenv to restore them afterward. (Shows you how heavily anyone relies on this stuff...)
On many architectures, you can unmask the invalid exception, which will cause an interrupt when a NaN would ordinarily be generated by a computation such as 0*infinity. Running in the debugger, you will break on this interrupt and can examine the computation that led to that point. Outside of a debugger, you can install a trap handler to log information about the state of the computation that produced the invalid operation.
On x86, for example, you would clear the Invalid Operation Mask bit in FPCR (bit 0) and MXCSR (bit 7) to enable trapping for invalid operations from x87 and SSE operations, respectively.
Some individual platforms provide a means to write to these control registers from C, but there's no portable interface that works cross-platform.
Testing f!=f might give problems using g++ with -ffast-math optimization enabled: Checking if a double (or float) is NaN in C++
The only foolproof way is to check the bitpattern.
As to where to implement the checks, this is really dependent on the specifics of your calculation and how frequent Nan errors are i.e. performance penalty of continuing tainted calculations versus checking at certain stages.