How to prevent FTZ for a single line in CUDA - c++

I am working on a particle code where flushing-to-zero is extensively used to extract performance. However there is a single floating point comparison statement that I do not wish to be flushed. One solution is to use inline PTX, but it introduces unnecessary instructions since there is no boolean type, but just predicate registers, in PTX:
C++ code:
float a, b;
if ( a < b ) do_something;
// compiles into SASS:
// FSETP.LT.FTZ.AND P0, PT, A, B, PT;
// #P0 DO_SOMETHING
PTX:
float a, b;
uint p;
asm("{.reg .pred p; setp.lt.f32 p, %1, %2; selp %0, 1, 0, p;}" : "=r"(p) : "f"(a), "f"(b) );
if (p) do_something;
// compiled into SASS:
// FSETP.LT.AND P0, PT, A, B, PT;
// SEL R2, RZ, 0x1, !P0;
// ISETP.NE.AND P0, PT, R2, RZ, PT;
// #P0 DO_SOMETHING
Is there a way that I can do the non-FTZ comparison with a single instruction without coding the entire thing in PTX/SASS?

Related

Horizontal min on avx2 8 float register and shuffle paired registers alongside

After ray vs triangle intersection test in 8 wide simd, I'm left with updating t, u and v which I've done in scalar below (find lowest t and updating t,u,v if lower than previous t). Is there a way to do this in simd instead of scalar?
int update_tuv(__m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out)
{
alignas(32) float ts[8];_mm256_store_ps(ts, t);
alignas(32) float us[8];_mm256_store_ps(us, u);
alignas(32) float vs[8];_mm256_store_ps(vs, v);
int min_index{0};
for (int i = 1; i < 8; ++i) {
if (ts[i] < ts[min_index]) {
min_index = i;
}
}
if (ts[min_index] >= *t_out) { return -1; }
*t_out = ts[min_index];
*u_out = us[min_index];
*v_out = vs[min_index];
return min_index;
}
I haven't found a solution that finds the horizontal min t and shuffles/permutes it's pairing u and v along the way other than permuting and min testing 8 times.
First find horizontal minimum of the t vector. This alone is enough to reject values with your first test.
Then find index of that first minimum element, extract and store that lane from u and v vectors.
// Horizontal minimum of the vector
inline float horizontalMinimum( __m256 v )
{
__m128 i = _mm256_extractf128_ps( v, 1 );
i = _mm_min_ps( i, _mm256_castps256_ps128( v ) );
i = _mm_min_ps( i, _mm_movehl_ps( i, i ) );
i = _mm_min_ss( i, _mm_movehdup_ps( i ) );
return _mm_cvtss_f32( i );
}
int update_tuv_avx2( __m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out )
{
// Find the minimum t, reject if t_out is larger than that
float current = *t_out;
float ts = horizontalMinimum( t );
if( ts >= current )
return -1;
// Should compile into vbroadcastss
__m256 tMin = _mm256_set1_ps( ts );
*t_out = ts;
// Find the minimum index
uint32_t mask = (uint32_t)_mm256_movemask_ps( _mm256_cmp_ps( t, tMin, _CMP_EQ_OQ ) );
// If you don't yet have C++/20, use _tzcnt_u32 or _BitScanForward or __builtin_ctz intrinsics
int minIndex = std::countr_zero( mask );
// Prepare a permutation vector for the vpermps AVX2 instruction
// We don't care what's in the highest 7 integer lanes in that vector, only need the first lane
__m256i iv = _mm256_castsi128_si256( _mm_cvtsi32_si128( (int)minIndex ) );
// Permute u and v vector, moving that element to the first lane
u = _mm256_permutevar8x32_ps( u, iv );
v = _mm256_permutevar8x32_ps( v, iv );
// Update the outputs with the new numbers
*u_out = _mm256_cvtss_f32( u );
*v_out = _mm256_cvtss_f32( v );
return minIndex;
}
While relatively straightforward and probably faster than your current method with vector stores followed by scalar loads, the performance of the above function is only great when that if branch is well-predicted.
When that branch is unpredictable (statistically, results in random outcomes), a completely branchless implementation might be a better fit. Gonna be more complicated though, load old values with _mm_load_ss, conditionally update with _mm_blendv_ps, and store back with _mm_store_ss.

Step vs. comparison operator in HLSL?

As an HLSL enthusiast, I've been in the habit of using (float)(x>=y). Usually for 0/1 multiplications for branch avoidance. I just revisited my intrinsic list and saw step(x,y). They sound equivalent in output to me.
Are there any reasons to prefer one of these styles over the other?
I think they're equivalent. This shader:
inline float test1( float x, float y )
{
return (float)( x >= y );
}
inline float test2( float x, float y )
{
return step( x, y );
}
float2 main(float4 c: COLOR0): SV_Target
{
float2 res;
res.x = test1( c.x, c.y );
res.y = test2( c.z, c.w );
return res;
}
Compiles into following DXBC instructions:
ps_4_0
dcl_input_ps linear v0.xyzw
dcl_output o0.xy
dcl_temps 1
ge r0.xy, v0.xwxx, v0.yzyy // If the comparison is true, then 0xFFFFFFFF is returned for that component.
and o0.xy, r0.xyxx, l(0x3f800000, 0x3f800000, 0, 0) // Component-wise logical AND, 0x3f800000 = 1.0f
ret
As you see, the compiler treated both inline functions as equivalents, it even merged then together into a single 2-lane vector comparison.

Segmentation fault using SSE _mm_shuffle_ps

I'm using SSE instruction on my program to increase it performances but it often crashes on a call to _mm_shuffle_ps.
I know that most probably it is due to alignment that needs to be at 16 byte, but I can't really get around this issue.
This is the code I use (my program is compiled at 32bit with VisualStudio 2017):
#define SHUFFLEMASK(A0,A1,B2,B3) ( (A0) | ((A1)<<2) | ((B2)<<4) | ((B3)<<6) )
inline __m128 RotateVector(const __m128& quaternion, const __m128& vector)
{
const uint32 shuffleMask = SHUFFLEMASK(3, 3, 3, 3);
// THE NEXT LINE IS THE ONE CRASHING
const __m128 qw = _mm_shuffle_ps(quaternion, quaternion, shuffleMask);
// The rest isn't useful since it crashes before even getting there
...
}
inline __m128 MakeVectorRegister(float X, float Y, float Z, float W)
{
return _mm_setr_ps(X, Y, Z, W);
}
class Vertex
{
public:
union
{
float vec[3];
struct
{
float x, y, z;
};
};
// Rest of class (only methods, no other attributes)
...
};
__declspec(align(16)) class X
{
...
__m128 _scale;
__m128 _rotation;
...
Vertex TransformVector(const Vertex& vector) const
{
float __declspec(align(16)) vectorData[3];
memcpy(vectorData, &vector.x, sizeof(float) * 3);
// The next line was originally this: const __m128 inputVectorW0 = MakeVectorRegister(((const float*)(&vector.x))[0], ((const float*)(&vector.x))[1], ((const float*)(&vector.x))[2], 0.0f)
const __m128 inputVectorW0 = MakeVectorRegister(((const float*)(vectorData))[0], ((const float*)(vectorData))[1], ((const float*)(vectorData))[2], 0.0f)
const __m128 scaledVec = _mm_mul_ps(_scale, inputVectorW0);
const __m128 rotatedVec = RotateVector(_rotation, scaledVec);
// The rest isn't useful since it crashes before
...
}
}
// Example of usage
int main(...)
{
Vertex v;
X x;
// This crashes calling _mm_shuffle_ps inside RotateVector
Vertex result = x.TransformVector(v);
}

Plotting Euler Integration using Polyline(), C++

So I'm trying to plot the output of this Euler integration function:
typedef double F(double,double);
using std::vector;
void euler(F f, double y0, double a, double b, double h,vector<POINT> Points)
{
POINT Pt;
double y_n = y0;
double t = a;
for (double t = a; t != b; t += h )
{
y_n += h * f(t, y_n);
Pt.x = t; // assign the x value of the point to t.
Pt.y = y_n; // assign the y value of the point to y_n.
Points.push_back(Pt);
}
}
// Example: Newton's cooling law
double newtonCoolingLaw(double, double t)
{
return t; // return statement ends the function; here, it gives the time derivative y' = -0.07 * (t - 20)
}
I'm trying to use the Polyline() function in a Win32 application, so I do this under the case WM_PAINT:
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
//Draw lines to screen.
hPen = CreatePen(PS_SOLID, 1, RGB(255, 25, 5));
SelectObject(hdc, hPen);
using std::vector;
vector<POINT> Points(0);
euler(newtonCoolingLaw, 1, 0, 20, 1,Points);
POINT tmp = Points.at(0);
const POINT* elementPoints[1] = { &tmp };
int numberpoints = (int) Points.size() - 1 ;
Polyline(hdc,elementPoints[1],numberpoints);
When I reroute my I/O to console, here are the outputs for the variables:
I'm able to draw the expected lines to the screen using MovetoEx(hdc,0,0,NULL) and LineTo(hdc,20,20), but for some reason none of these functions will work with my vector<POINT> Points. Any suggestions?
There are multiple things that seem erroneous to me:
1) You should pass the vector by reference or as a return value:
void euler(/*...*/,vector<POINT>& Points)
Currently you are only passing a copy into the function, so the original vector will not be modified.
2) Don't compare doubles for (in-)equality in your for-loop header. Doubles have a limited precision, so if b is much bigger than h, your loop might never terminate, as t might never exactly match b. Compare for "smaller" instead:
for (double t = a; t < b; t += h )
3) Why are you declaring elementPoints as an array of pointers of size 1? Wouldn't a simple pointer do:
const POINT* elementPoints = &tmp ; //EDIT: see point 5)
4) You have an of-by-one error when calling Polyline. If you want to stick with the array at all use.
Polyline(hdc,elementPoints[0],numberpoints);
EDIT: Sorry, I forgot an important one:
5) In your code, elementPoints[0] points to a single double (tmp) and not to the array inside of the vector. This would probably work, if you declared tmpas a reference:
POINT& tmp = Points.at(0); //I'm wondering why this doesn't throw an exception, as the vector should actually be empty here
However, I think what you actually want to do is to get rid of tmp and elementPoints altogether and write in the last line:
Polyline(hdc,&Points[0],(int) Points.size()-1);
//Or probably rather:
Polyline(hdc,&Points[0],(int) Points.size());
Btw.: What is the purpose of the -1?

How use arrays of double[4][4] contained in a vector?

I want to ask at the community my problem.
I have a series of array of double[4][4] in this format:
double T1[4][4] = {
{-0.9827, -0.1811, -0.0388, 0.1234},
{0.0807, -0.2303, -0.9698, 0.1755},
{0.1666, -0.9561, 0.2409, 0.6729},
{0, 0, 0, 1.00000 }};
double T2[4][4] = {
{-0.8524, -0.5029, -0.1432, 0.1963},
{0.1580, 0.0135, -0.9874, 0.1285},
{0.4984, -0.8643, 0.0680, 0.6237},
{0, 0, 0, 1.00000 }};
T3, T4, and so on....
I need to insert all of these arrays in a container, to pickup one at time from another function, that need arrays in that format, because doing these elaborations:
int verifica_punti(punto P, Mat& I, double TC[4][4], const double fc[2],const double KC[5], const double cc[2],const double alpha){
//punto
double P1[4] = {P.x, P.y, P.z, 1.0};
//iniz
double Pc[3] = {TC[0][3], TC[1][3], TC[2][3]};
//calc
for(int i=0; i<3; i++){
for(int j=0; j<3; j++){
Pc[i] += TC[i][j] * P1[j];
}
}
//norm
double PN[2] = { Pc[0]/Pc[2], Pc[1]/Pc[2] };
Now, searching on this site and on internet I've found some examples to do this, but don't work in my case. Using vector, array, queue...I don't understand a thing.
I paste here my code, and tell you to help me fix this problem.
This is my code:
//array of TC
typedef array<array<double,4>,4> Matrix;
//single TC
Matrix T1 = {{
{{-1.0000, 0.0000, -0.0000, 0.1531}},
{{0.0000, 0.0000, -1.0000, 0.1502 }},
{{-0.0000, -1.0000, -0.0000, 1.0790}},
{{0 , 0, 0, 1.0000 }}}};
Matrix T2 = {{
{{-1.0000, 0.0009, 0.0019, 0.1500}},
{{-0.0021, -0.4464, -0.8948, 0.1845}},
{{0.0000, -0.8948, 0.4464, 0.8094 }},
{{ 0, 0, 0, 1.0000 }}}};
etc....then, declare container and fill it:
vector <Matrix> TCS;
TCS.push_back(T1);
TCS.push_back(T2);
TCS.push_back(T3);
TCS.push_back(T4);
TCS.push_back(T5);
TCS.push_back(T6);
TCS.push_back(T7);
TCS.push_back(T8);
TCS.push_back(T9);
Now, for obtain single matrix in double[4][4] format to pass it at that function "verifica_punti" (written before) how can I do?
I need one TC at time, but in the FIFO order (the first that I've pushed, I need to pop and use.
How can I do this? Because I've write
double temp[4][4] = TCS.pop_back()
or double temp[4][4] = TCS[i];
but isn't correct.
I'm on Visual C++ 2010 on windows 7 64bit.
Help me please :-( thanks in advance.
with
typedef array<array<double,4>,4> Matrix;
vector <Matrix> TCS;
You have
//double temp[4][4] = TCS[i]; // Illegal
Matrix m1 = TCS[i]; // legal
const Matrix& m2 = TCS[i]; // legal, and avoid a copy.
Now, you have to change:
int verifica_punti(punto P, Mat& I, double TC[][4], const double fc[], const double KC[], const double cc[], const double alpha);
to
int verifica_punti(punto P, Mat& I, Matrix& TC, const double fc[], const double KC[], const double cc[], const double alpha);
std::array< std::array<double,4>, 4> and double[4][4] are distinct types. The former encupsulates the latter so that it's copyable and can be used in containers and it has practicaly identical interface. But you can't use them interchangeable.
You already have your typedef, so use that:
while (!TCS.empty()) {
// get the last one
Matrix m = TCS.back();
/* do stuff with m */
// pop the last one out
TCS.pop_back();
}