Improve += operator performance

Improve += operator performance - c++

I am writing an application that must be very fast. I use Qt 5.5 with Qt Creator, the 64 bit MSVC2013 compiled version of Qt.
I used very sleepy CS to profile my application, and I saw that the function that took the most exclusive time was an operator+= overload (which is called, as you guess, a lot of times).
Here's the piece of code.
struct Coordinate
{
float x;
float y;
Coordinate operator+=(const Coordinate &coord)
{
this->x += coord.x;
this->y += coord.y;
return (*this);
}
};
I wondered if there were a way to improve performance of a function as simple as this one.

operator+= is not quite defined the way you did. Rather, it should be:
Coordinate& operator+=(const Coordinate &coord);
Note the return value of a reference.
This also has the benefit of not creating another copy.

Check if you are profiling Release configurion and compiler optimizations are enabled. Such calls should be inlined by any decent compiler.

Related

Access floats of CXMMatrix - () operator not working

I'm trying to use updated codes of "Frank Luna" book on directX11 where I using VS2017 with WindowsSDK10. I've read some notes about migration from Frank and did eveything he said in the link below :
http://www.d3dcoder.net/Data/Book4/d3d11Win10.htm
but got stuck here . I know there was same question from #poncho and answered well :
Access floats of XMMatrix - () operator not working
But I have trouble with type CXMMATRIX instead of XMMATRIX and I couldn't get result with the solution provided for him.
So I have to access the rows and columns of an CXMMATRIX :
void ExtractFrustumPlanes(XMFLOAT4 planes[6], CXMMATRIX M)
{
//
// Left
//
planes[0].x = M(0,3) + M(0,0);
planes[0].y = M(1,3) + M(1,0);
planes[0].z = M(2,3) + M(2,0);
planes[0].w = M(3,3) + M(3,0);
...
But I get :
call of an object of a class type without appropriate operator() or
conversion functions to pointer-to-function type
and
term does not evaluate to a function taking 2 arguments
It points to argument M of type CXMMATRIX where defined as below in DirectXMath.h :
// Fix-up for (2nd+) XMMATRIX parameters to pass by reference
typedef const XMMATRIX& CXMMATRIX;
What's all these errors about !?

Frank Luna's book is overall a great introduction to the Direct 11 API, but unfortunately suffers from heavily utilizing the legacy DirectX SDK which is deprecated per MSDN. One of those aspects is that he's actually using the xnamath library (a.k.a. xboxmath version 2) instead of the DirectXMath library (a.k.a. xboxmath version 3)
See Book Recommendations and Introducing DirectXMath
I made a number of changes when reworking the library as DirectXMath. First, the types are actually in C++ namespaces instead of the global namespace. In your headers, you should use full name specification:
#include <DirectXMath.h>
void MyFunction(..., DirectX::CXMMATRIX M);
In your cpp source files you should use:
#include <DirectXMath.h>
using namespace DirectX;
Another change was to strongly discourage the use of 'per-element' access on the XMVECTOR and XMMATRIX data types. As discussed in the DirectXMath Programmers Guide, these types are by design proxies for the SIMD register types which cannot be directly accessed by-element. Instead, you covert to the XMFLOAT4X4 representation which allows per-element access because that's a scalar structure.
You can see this by the fact that the operators you are trying to use are only defined for 'no-intrinsics' mode (i.e. when using scalar instead of SIMD operations like SSE, ARM-NEON, etc.):
#ifdef _XM_NO_INTRINSICS_
float operator() (size_t Row, size_t Column) const { return m[Row][Column]; }
float& operator() (size_t Row, size_t Column) { return m[Row][Column]; }
#endif
Again, by design, this process is a bit 'verbose' because it lets you know it's not free. Some people find this aspect of DirectXMath a little frustrating to use especially when they are first getting started. In that case, I recommend you take a look at the SimpleMath wrapper in the DirectX Tool Kit. You can use the types Vector3, Vector4, Matrix, etc. and they freely convert (through C++ operators and constructors) as needed to XMVECTOR and XMMATRIX. It's not nearly as efficient, but it's a lot more forgiving to use.
The particular function you wrote is also a bit problematic. First, it's a little odd to mix XMFLOAT4 and XMMATRIX parameters. For 'in-register, SIMD-friendly' calling convention, you'd use:
void XM_CALLCONV ExtractFrustumPlanes(XMVECTOR planes[6], FXMMATRIX M)
For details on why, see MSDN.
If you want to just entirely scalar math, use either the non-SIMD types:
void ExtractFrustumPlanes(XMFLOAT4 planes[6], const XMFLOAT4X4& M)
or better yet use SimpleMath so you can avoid having to write explicit conversions to/from XMVECTOR or XMMATRIX
using DirectX::SimpleMath;
void ExtractFrustumPlanes(Vector4 planes[6], const Matrix& M)
Note that the latest version of DirectXMath is on GitHub, NuGet, and vcpkg.

How to replace __ieee754_exp_avx calls from source code or library?

__ieee754_exp_avx from libm*.so being used intensively in a certain source code, I would like to replace it with a faster exp(x) implementation?
custom exp(x):
inline
double exp2(double x) {
x = 1.0 + x / 1024;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x;
return x;
}
What gcc tags should I use to make gcc automatically use a custom exp(x) implementation? If it is not possible with gcc how can I do it then?
https://codingforspeed.com/using-faster-exponential-approximation/

Don't. This function is slower than the native implementation of exp, and is an extremely poor approximation.
First, the speed. My benchmarking indicates that, depending on your compiler and CPU, this implementation of exp2 may be anywhere between 1.5x and 4.5x slower than the native exp. I'm not sure where the web site got their figures -- "360 times faster than the traditional exp" seems absurd, and is completely inconsistent with my tests.
Second, the accuracy. exp2(x) is reasonably close to exp(x) for x ≤ 1, but fails badly for larger values. For instance:
exp(1) = 2.7182818
exp2(1) = 2.7169557 (0.05% too low)
exp(2) = 7.3890561
exp2(2) = 7.3746572 (0.20% too low)
exp(5) = 148.41316
exp2(5) = 146.61829 (1.21% too low)
exp(10) = 22026.466
exp2(10) = 20983.411 (4.74% too low)
exp(20) = 4.851652e+08
exp2(20) = 4.0008755e+08 (17.5% too low)
While the web site you got this function from claims that there is "very good agreement for input smaller than 5", this is simply not true. A 1.21% difference (for x=5) is huge, and is likely to cause significant errors in any calculations using this approximation.

Simply don't. That function looks way slower than the built-in code, and it's definitely not OK with respect to precision.
If you need SIMD (single instruction, multiple data) optimized exp functionality, ie. you're not calculating a single value but a series of those, there's C libraries that do that for you. I'd like to highlight VOLK, the Vector Optimized Library of Kernels, a spin-off of the DSP-intense GNU Radio project.
It implements its own expf (single precision exponentiation – if you're willing to accept errors, there's certainly no reason to lug double precision floats around); here's how that compares on my machine :
RUN_VOLK_TESTS: volk_32f_expfast_32f(131071,1987)
a_avx completed in 60.119ms
a_sse4_1 completed in 62.052ms
u_avx completed in 60.376ms
u_sse4_1 completed in 62.131ms
generic completed in 2383.73ms
So, for 1987 iterations over a vector of 131071 elements, all the SIMD-optimized kernels were faster by a factor of 40 – that's pretty OK, but it's far away from the audacious 360x claim of the website you quote.
The source code of the expfast functions used can be found here.
In its core, that implementation relies on the floating point representation – which is a pretty good idea.
It admits it has a 7% error boundary – that's pretty much!

This is more like a workaround (gainarie):
Place exp2 definition in a .h file:
// exp2.h
#if !defined(__EXP2__H__)
#define __EXP2__H__
inline double exp2(double x) {
x = 1.0 + x / 1024;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x;
return x;
}
#endif //__EXP2__H__
Now, this file must end up included (whether directly or indirectly) in all the .c(xx) files that call exp - which might be a painful job if the existing codebase is large.
Then, when compiling the code, pass -D(preprocessor definition) to gcc (I don't know the minimum version that supports this form; v5.4.0 does) like this: -D'exp(X)=exp2(X)'.
Note: You no longer need libm.so.*(-lm) at link time (at least not as far as exp is concerned), so you can remove it. Actually, it would be a good idea to remove it (temporarily - if you're using other math functions, permanently - otherwise), so that if there are any .c(xx) files that don't include exp2.h, the linker will spit an exp related undefined reference error (if using other math functions, after resolving all these errors by including exp2.h in the appropriate .c(xx) file you must add it back), otherwise you might end up with a mixture of exp/exp2 calls in the code.

Float comparisons failing without any obvious reason (32-bit X86 on Linux)

I have stumbled upon an interesting case of comparing (==, !=) float types.
I encountered this problem while porting my own software from windows to linux. It's a bit of a bummer. The relevant code is the following:
template<class T> class PCMVector2 {
public:
T x, y;
public:
bool operator == ( const PCMVector2<T>& a ) const {
return x == a.x && y == a.y;
}
bool operator != ( const PCMVector2<T>& a ) const {
return x != a.x || y != a.y;
}
// Mutable normalization
PCMVector2<T>& Normalize() {
const T l = 1.0f / Length();
x *= l;
y *= l;
return *this;
}
// Immutable normalization
const PCMVector2<T> Normalized() {
const T l = 1.0f / Length();
return PCMVector2<T>(x*l,y*l);
}
// Vector length
T Length() const { return sqrt(x*x+y*y); }
};
I cleverly designed a unit test functions which check all available functionality regarding those classes, before porting to linux. And, in contrast to msvc, the g++ doesn't complain, but gives incorrect results at runtime.
I was stumped, so I did some additional logging, type-puns, memcmp's, etc. and they all showed that memory is 1:1 the same! Anyone has any ideas about this?
My flags are: -Wall -O2 -j2
Thanks in advance.
EDIT2: The failed test case is:
vec2f v1 = vec2f(2.0f,3.0f);
v1.Normalize(); // mutable normalization
if( v1 != vec2f(2.0f,3.0f).Normalized() ) //immutable normalization
// report failure
Note: Both normalizations are the same, and yield same results (according to memcmp).
RESOLUTION: Turns out that you should never trust the compiler about floating numbers! No matter how sure you are about the memory you compare. Once data goes to the registers, it can change, and you have no control over it. After some digging regarding registers, I found this neat source of information. Hope it's useful to someone in the future.

Floating point CPU registers can be larger than the floating point type you're working with. This is especially true with float which is typically only 32 bits. A calculation will be computed using all the bits, then the result will be rounded to the nearest representable value before being stored in memory.
Depending on inlining and compiler optimization flags, it is possible that the generated code may compare one value from memory with another one from a register. Those may compare as unequal, even though their representation in memory will be bit-for-bit identical.
This is only one of the many reasons why comparing floating-point values for equality is not recommended. Especially when, as in your case, it appears to work some of the time.

What and where is "vmath.h"?

I am reading through the 8th edition of the OpenGL Programming Guide by Shreiner Sellers Kessenich and Licea-Kane, and I keep seeing this "vmath" library being used for vector and matrices work.
I did a google search for vmath.h, but wasn't able to find anything. I did a search on stackoverflow and found one question where it has been used but nothing more.
My question is where or how can I install it or download it. I assumed it was something which came along with freeglut or whatever other opengl stuff I installed with "apt-get install", but apparently not since g++ can't find vmath.h.
Any ideas on how to get it installed?

#Blastfurnace provides the correct address to download. But I still have something to say.
Please use glm instead of vmath.h: http://glm.g-truc.net/0.9.5/index.html
I used vmath.h and found tons of bugs. Some definitions of operator causes recursive function call and stack overflows. Also the conversion between radius and degree is inverted.
line 11:
template <typename T>
inline T radians(T angleInRadians)
{
return angleInRadians * static_cast<T>(180.0/M_PI);
}
line 631:
static inline mat4 perspective(float fovy /* in degrees */, float aspect, float n, float f)
{
float top = n * tan(radians(0.5f*fovy)); // bottom = -top
float right = top * aspect; // left = -right
return frustum(-right, right, -top, top, n, f);
}
Obviously the tangent function accepts a radian input, but the function 'radian' converts the radian to degree instead.
line 137:
inline vecN& operator/=(const vecN& that)
{
assign(*this * that);
return *this;
}
It should be a division instead of a multiplication: assign(*this / that).
line 153:
inline vecN& operator/(const T& that)
{
assign(*this / that);
}
See? Recursive call of operator '/'. At least in Xcode this causes a stack overflow.
These bugs annoy me a lot, while glm library provides almost the same functions but much more stable code. I STRONGLY RECOMMEND you using glm instead of the current buggy vmath.h. Maybe when all these bugs are fixed, a simple vmath.h would be a better choice, while you need to give up at the moment.

The web site for the book can be found at The OpenGL Programming Guide. That page has a link to a .zip file with most of the code from the book. The vmath.h file is in the include directory.

Compiler optimization causing the performance to slow down

I have one strange problem. I have following piece of code:
template<clss index, class policy>
inline int CBase<index,policy>::func(const A& test_in, int* srcPtr ,int* dstPtr)
{
int width = test_in.width();
int height = test_in.height();
double d = 0.0; //here is the problem
for(int y = 0; y < height; y++)
{
//Pointer initializations
//multiplication involving y
//ex: int z = someBigNumber*y + someOtherBigNumber;
for(int x = 0; x < width; x++)
{
//multiplication involving x
//ex: int z = someBigNumber*x + someOtherBigNumber;
if(soemCondition)
{
// floating point calculations
}
*dstPtr++ = array[*srcPtr++];
}
}
}
The inner loop gets executed nearly 200,000 times and the entire function takes 100 ms for completion. ( profiled using AQTimer)
I found an unused variable double d = 0.0; outside the outer loop and removed the same. After this change, suddenly the method is taking 500ms for the same number of executions. ( 5 times slower).
This behavior is reproducible in different machines with different processor types.
(Core2, dualcore processors).
I am using VC6 compiler with optimization level O2.
Follwing are the other compiler options used :
-MD -O2 -Z7 -GR -GX -G5 -X -GF -EHa
I suspected compiler optimizations and removed the compiler optimization /O2. After that function became normal and it is taking 100ms as old code.
Could anyone throw some light on this strange behavior?
Why compiler optimization should slow down performance when I remove unused variable ?
Note: The assembly code (before and after the change) looked same.

If the assembly code looks the same before and after the change the error is somehow connected to how you time the function.

VC6 is buggy as hell. It is known to generate incorrect code in several cases, and its optimizer isn't all that advanced either. The compiler is over a decade old, and hasn't even been supported for many years.
So really, the answer is "you're using a buggy compiler. Expect buggy behavior, especially when optimizations are enabled."
I don't suppose upgrading to a modern compiler (or simply testing the code on one) is an option?
Obviously, the generated assembly can not be the same, or there would be no performance difference.
The only question is where the difference lies. And with a buggy compiler, it may well be some completely unrelated part of the code that suddenly gets compiled differently and breaks. Most likely though, the assembly code generated for this function is not the same, and the differences are just so subtle you didn't notice them.

Declare width and height as const {unsigned} ints. {The unsigned should be used since heights and widths are never negative.}
const int width = test_in.width();
const int height = test_in.height();
This helps the compiler with optimizing. With the values as const, it can place them in the code or in registers, knowing that they won't change. Also, it relieves the compiler of having to guess whether the variables are changing or not.
I suggest printing out the assembly code of the versions with the unused double and without. This will give you an insight into the compiler's thought process.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Improve += operator performance - c++

operator+= is not quite defined the way you did. Rather, it should be: Coordinate& operator+=(const Coordinate &coord); Note the return value of a reference. This also has the benefit of not creating another copy.

Check if you are profiling Release configurion and compiler optimizations are enabled. Such calls should be inlined by any decent compiler.

Related

Access floats of CXMMatrix - () operator not working

How to replace __ieee754_exp_avx calls from source code or library?

Float comparisons failing without any obvious reason (32-bit X86 on Linux)

What and where is "vmath.h"?

Compiler optimization causing the performance to slow down

Categories

Resources