Selective nvidia #pragma optionNV(unroll all)

Selective nvidia #pragma optionNV(unroll all) - opengl

I'm playing around with nvidia's unroll loops directive, but haven't seen a way to turn it on selectively.
Lets say I have this...
void testUnroll()
{
#pragma optionNV(unroll all)
for (...)
...
}
void testNoUnroll()
{
for (...)
...
}
Here, I'm assuming both loops end up being unrolled. To stop this I think the solution will involve resetting the directive after the block I want affected, for example:
#pragma optionNV(unroll all)
for (...)
...
#pragma optionNV(unroll default) //??
However I don't know the keyword to reset the unroll behaviour to the initial/default setting. How can this be done? If anyone could also point to some official docs for nvidia's compiler directives that'd be even better.
Currently, it seems only the last #pragma optionNV(unroll *) directive found in the program is used (eg throw one in the last line and it overrides everything above it).

According to this post on the NVidia forums, having no keyword afterwards will set it to default behavior:
#pragma unroll 1 will prevent the compiler from ever unrolling a loop.
If no number is specified after #pragma unroll, the loop is completely unrolled if its trip count is constant, otherwise it is not unrolled at all.
I'm not sure if it works on GLSL, but you can maybe try:
#pragma optionNV(unroll)
If anyone tries this, let us know if it works!

I don't remember where I found this, but I can confirm that this works on an Nvidia 1070 with the 435 driver on Linux with OpenGL 4.6:
#pragma optionNV(inline 0)
In my case link time is reduced in almost an 20X, and performance drops around 50%, very useful for making small tweaks to shaders in development.

Related

How to properly implement a cross-platform spinlock in c++

Essentially, my question is:
What does an "good" implementation of a spinlock look like in c++ which works on the "usual" CPU/OS/Compiler combinations (x86 & arm, Windows & Linux, msvc & clang & g++ (maybe also icc) ).
Explanation:
As I wrote in the answer to a different question, it is fairly easy to write a working spinlock in c++11. However, as pointed out (in the comments as well as in e.g. spinlock-vs-stdmutextry-lock), such an implementation comes with some performance problems in case of congestion, which imho can only be solved by using platform specific instructions (intrinsics / os primitives / assembly?).
I'm not looking for a super optimized version (I expect that would only make sense if you have very precise knowledge about the exact platform and workload and need every last bit of efficiency) but something that lives around the mythical 20/80 tradeoff point i.e. I want to avoid the most important pitfalls in most cases while still keeping the solution as simple and understandable as possible.
In general, I'd expect the result to look something like thist:
#include <atomic>
#ifdef _MSC_VER
#include <Windows.h>
#define YIELD_CPU YieldProcessor();
#elif defined(...)
#define YIELD_CPU ...
...
#endif
class SpinLock {
std::atomic_flag locked = ATOMIC_FLAG_INIT;
public:
void lock() {
while (locked.test_and_set(std::memory_order_acquire)) {
YIELD_CPU;
}
}
void unlock() {
locked.clear(std::memory_order_release);
}
};
But I don't know
if a YIELD_CPU macro inside the loop is all that's needed or if there are any other problematic aspects (e.g. can/should we indicate if we expect the test_and_set to succeed most of the time)
what the appropriate mapping for YIELD_CPU on the different CPU/OS/Compiler combinations is (and if possible I'd like to avoid dragging in a heavy weight header like Windows.h)
Note: I'm also interested in answers that only cover a subset of the mentioned platforms, but might not mark them as the accepted answer and/or merge them into a separate community answer.

Are comparison between macro values bad in embedded programming?

I am building a program that needs to run on an ARM.
The processor has plenty of resources to run the program, so this question is not directly related to this type of processor, but is related to non powerful ones, where resources and computing power are 'limited'.
To print debug informations (or even to activate portions of code) I am using a header file where I define macros that I set to true or false, like this:
#define DEBUG_ADCS_OBC true
and in the main program:
if (DEBUG_ADCS_OBC == true) {
printf("O2A ");
for (j = 0; j < 50; j++) {
printf("%x ", buffer_obc[jj]);
}
}
Is this a bad habit? Are there better ways to do this?
In addition, will having these IF checks affect performances in a measurable way?
Or is it safe to assume that when the code is compiled the IFs are somehow removed from the flow, as the comparison is made between two values that cannot change?

Since the expression DEBUG_ADCS_OBC == true can be evaluated at compile time, optimizing compilers will figure out that the branch is either always taken or is always bypassed, and eliminate the condition altogether. Therefore, there is zero runtime cost to the expression when you use an optimized compiler.
If you are compiling with all optimization turned off, use conditional compilation instead. This will do the same thing an optimizing compiler does with a constant expression, but at the preprocessor stage. Hence the compiler will not "see" the conditional even with optimization turned off.
Note 1: Since DEBUG_ADCS_OBC has a meaning of boolean variable, use DEBUG_ADCS_OBC without == true for a somewhat cleaner look.
Note 2: Rather than defining the value in the body of your program, consider passing a value on the command line, for example -DDEBUG_ADCS_OBC=true. This lets you change the debug setting without modifying your source code, simply by manipulating the make file or one of its options.

The code you are using is evaluated everytime when your program reaches this line. Since every change of DEBUG_ADCS_OBC will require a recompile of your code, you should use #ifdef/#ifndef expressions instead. The advantage of them is, that they are only evaluated once at compile time.
Your code segment could look like the following:
Header:
//Remove this line if debugging should be disabled
#define DEBUG_DCS_OBS
Source:
#ifdef DEBUG_DCS_OBS
printf("O2A ");
for (j = 0; j < 50; j++) {
printf("%x ", buffer_obc[jj]);
}
#endif

The problem with getting the compiler to do this is the unnecessary run-time test of a constant expression. An optimising compiler will remove this, but equally it may issue warnings about constant expressions or when the macro is undefined, issue warnings about unreachable code.
It is not a matter of "bad in embedded programming", it bears little merit in any programming domain.
The following is the more usual idiom, will not include unreachable code in the final build and in an appropriately configured a syntax highlighting editor or IDE will generally show you which code sections are active and which are not.
#define DEBUG_ADCS_OBC
...
#if defined DEBUG_ADCS_OBC
printf("O2A ");
for (j = 0; j < 50; j++)
{
printf("%x ", buffer_obc[jj]);
}
#endif

I'll add one thing that didn't see being mentioned.
If optimizations are disabled on debug builds, and even if runtime performance impact is insignificant, code is still included. As a result debug builds are usually bigger than release builds.
If you have very limited memory, you can run into situation where release build fits in the device memory and debug build does not.
For this reason I prefer compile time #if over runtime if. I can keep the memory usage between debug and release builds closer to each other, and it's easier to keep using the debugger at the end of project.

The optimizer will solve the extra resources problem as mentioned in the other replies, but I want to add another point. From the code readability point of view this code will be repeated a lot of times, so you can consider creating your specific printing macros. Those macros is what should be enclosed by the debug enable or disable macros.
#ifdef DEBUG_DCS_OBS
myCustomPrint //your custom printing code
#else
myCustomPrint //No code here
#end
Also this will decrease the probability of the macro to be forgotten in any file which will cause a real optimization problem.

Floating Point introspection in VS2010 - how do I check without breaking?

I've sort of gone around the houses here, and I'd thought I'd found a solution. It certainly seems to correctly identify the problems I know about, but also leads to unexplained crashes in about half of all system test cases.
The problem is that our code needs to call client code as a dll. We have control over our code, but not the clients', and experience has shown that their code isn't always flawless. I have protected against segmentation faults by exiting the program with a clear message of what might have been going wrong, but I've also had a few divide-by-zero exceptions coming from the clients' code, which I would like to identify and then exit.
What I've been wanting to do is:
Just before running the clients' dll, switching on floating point
introspection.
Run client code.
Check for any problems.
Switch off introspection for speed.
There is theoretically a number of ways of doing this, but many don't seem to work for VS2010.
I have been trying to use the floating_point pragma:
#pragma float_control(except, on, push)
// run client code
#pragma float_control(pop)
__asm fwait; // This forces the floating point unit to synchronise
if (_statusfp() & _SW_ZERODIVIDE)
{
// abort the program
}
This should be OK in theory, and in practice it works well 50% of the time.
I'm thinking that the problem might be the floating_point control stays on, and causes problems elsewhere in the code.
According to microsoft.com:
"The /fp:precise, /fp:fast, /fp:strict and /fp:except switches control
floating-point semantics on a file-by-file basis. The float_control
pragma provides such control on a function-by-function basis."
However, during compilation I get the warning:
warning C4177: #pragma 'float_control' should only be used at global
scope or namespace scope
Which on the face of it is a direct contradiction.
So my question is:
Is the documentation correct, or is the warning (I'm betting on the warning)?
Is there a reliable and safe way of doing this?
Should I be doing this at all, or is it just too dangerous?

You tried
#pragma float_control(except, on, push)
// run client code
#pragma float_control(pop)
That's not how it works. It's a compiler directive, and it means
#pragma float_control(except, on, push)
// This entire function is compiled with float_control exceptions on.
// Therefore, the pragma has to appear outside the function, at global scope.
#pragma float_control(pop)
And of course, this setting affects only the function(s) being compiled, no any functions that they may call - such as your clients. There is no way that a #pragma can change already compiled code.
So, the answers:
Both are correct
Yes, _controlfp_s
You're missing the SSE2 status, so it's at least incomplete

Danger in using nested comments for quickly (de)activating code blocks in C++

I'm currently using nested comments to quickly activate/deactivate code during testing, the way I'm doing it is like this :
//* First Case, Activated
DoSomething();
/**/
/* Second Case, De-Activated
DoSomethingElse();
/**/
I can activate, deactivate the code blocks by simply adding or deleting a '/'.
The compiler warns me about this, as nested comments are bad, but in reality, is it dangerous to use these comments?

This is how people normally do this:
#if 0
//...
#endif
or
#define TESTWITH
#ifdef TESTWITH
//..
#endif

Yes, you will often end up overcommenting or undercommenting and having code other than you expect active and this will make debugging very confusing. Using // is much more reliable for that - you'll have to type more, but it's more predictable.

Not a direct answer, but have you considered #ifdef instead?
#define DOSOMETHING
#ifdef DOSOMETHING
DoSomething();
#else
DoSomethingElse();
#endif

Why do you feel the need to toggle code blocks on and off so often anyway? Most likely, you're doing something wrong at a higher level.
Perhaps you should be using source control, and simply create two (or more) branches, to test out different versions of your code.
Or perhaps you should refactor your code so that, instead of commenting out a whole block of code, you only need to change a single function call.
There is nothing wrong with abusing nested comments like this, but it makes your code harder to read, and it solves a problem which typically should be solved at a whole different level.

Here's how you can get into trouble:
//* First Case, Activated
DoSomething();
/**/
/* Second Case, De-Activated
/* Comment about DoSomethingElse */
DoSomethingElse();
/**/
Now, your second case will execute, because the closing of the regular comment will close your note off, and the compiler won't detect anything is amiss.
Of course, this can be averted by never using the /**/ style comments, and it depends on your environment if that's a reasonable thing to do. And a syntax-highlighting editor (even including the Stack Overflow answer editor) will tip you off to what's going on. But why introduce the possibility?
This also falls into the category of "cute." You're mixing the two different commenting syntaxes. So long as everyone understands what's going on and plays by the rules, you'll be fine. But as soon as somebody doesn't you're in trouble.

Visual Studio 2005 C compiler problem when optimizing a switch statement

General Question which may be of interest to others:
I ran into a, what I believe, C++-compiler optimization (Visual Studio 2005) problem with a switch statement. What I'd want to know is if there is any way to satisfy my curiosity and find out what the compiler is trying to but failing to do. Is there any log I can spend some time (probably too much time) deciphering?
My specific problem for those curious enough to continue reading - I'd like to hear your thoughts on why I get problems in this specific case.
I've got a tiny program with about 500 lines of code containing a switch statement. Some of its cases contain some assignment of pointers.
double *ptx, *pty, *ptz;
double **ppt = new double*[3];
//some code initializing etc ptx, pty and ptz
ppt[0]=ptx;
ppt[1]=pty; //<----- this statement causes problems
ppt[2]=ptz;
The middle statement seems to hang the compiler. The compilation never ends. OK, I didn't wait for longer than it took to walk down the hall, talk to some people, get a cup of coffee and return to my desk, but this is a tiny program which usually compiles in less than a second. Remove a single line (the one indicated in the code above) and the problem goes away, as it also does when removing the optimization (on the whole program or using #pragma on the function).
Why does this middle line cause a problem? The compilers optimizer doesn't like pty.
There is no difference in the vectors ptx, pty, and ptz in the program. Everything I do to pty I do to ptx and ptz. I tried swapping their positions in ppt, but pty was still the line causing a problem.
I'm asking about this because I'm curious about what is happening. The code is rewritten and is working fine.
Edit:
Almost two weeks later, I check out the closest version to the code I described above and I can't edit it back to make it crash. This is really annoying, embarrassing and irritating. I'll give it another try, but if I don't get it breaking anytime soon I guess this part of the question is obsolete and I'll remove it. Really sorry for taking your time.

If you need to make this code compilable without changing it too much consider using memcpy where you assign a value to ppt[1]. This should at least compile fine.
However, you problem seems more like another part of the source code causes this behaviour.
What you can also try is to put this stuff:
ppt[0]=ptx;
ppt[1]=pty; //<----- this statement causes problems
ppt[2]=ptz;
in another function.
This should also help compiler a bit to avoid the path it is taking to compile your code.

Did you try renaming pty to something else (i.e. pt_y)? I encountered a couple of times (i.e. with a variable "rect2") the problem that some names seem to be "reserved".

It sounds like a compiler bug. Have you tried re-ordering the lines? e.g.,
ppt[1]=pty;
ppt[0]=ptx;
ppt[2]=ptz;
Also what happens if you juggle about the values that are assigned (which will introduce bugs in your code, but may indicator whether its the pointer or the array that's the issue), e.g.:
ppt[0] = pty;
ppt[1] = ptz;
ppt[2] = ptx;
(or similar).

It's probably due to your declaration of ptx, pty and ptz with them being optimised out to use the same address. Then this action is causing your compiler problems later in your code.
Try
static double *ptx;
static double *pty;
static double *ptz;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js