I appear to have coded a class that travels backwards in time. Allow me to explain:
I have a function, OrthogonalCamera::project(), that sets a matrix to a certain value. I then print out the value of that matrix, as such.
cam.project();
std::cout << "My Projection Matrix: " << std::endl << ProjectionMatrix::getMatrix() << std::endl;
cam.project() pushes a matrix onto ProjectionMatrix's stack (I am using the std::stack container), and ProjectionMatrix::getMatrix() just returns the stack's top element. If I run just this code, I get the following output:
2 0 0 0
0 7.7957 0 0
0 0 -0.001 0
-1 -1 -0.998 1
But if I run the code with these to lines after the std::cout call
float *foo = new float[16];
Mat4 fooMatrix = foo;
Then I get this output:
2 0 0 0
0 -2 0 0
0 0 -0.001 0
-1 1 -0.998 1
My question is the following: what could I possibly be doing such that code executed after I print a value changes the value being printed?
Some of the functions I'm using:
static void load(Mat4 &set)
{
if(ProjectionMatrix::matrices.size() > 0)
ProjectionMatrix::matrices.pop();
ProjectionMatrix::matrices.push(set);
}
static Mat4 &getMatrix()
{
return ProjectionMatrix::matrices.top();
}
and
void OrthogonalCamera::project()
{
Mat4 orthProjection = { { 2.0f / (this->r - this->l), 0, 0, -1 * ((this->r + this->l) / (this->r - this->l)) },
{ 0, 2.0f / (this->t - this->b), 0, -1 * ((this->t + this->b) / (this->t - this->b)) },
{ 0, 0, -2.0f / (this->farClip - this->nearClip), -1 * ((this->farClip + this->nearClip) / (this->farClip - this->nearClip)) },
{ 0, 0, 0, 1 } }; //this is apparently the projection matrix for an orthographic projection.
orthProjection = orthProjection.transpose();
ProjectionMatrix::load(orthProjection);
}
EDIT: whoever formatted my code, thank you. I'm not really too good with the formatting here, and it looks much nicer now :)
FURTHER EDIT: I have verified that the initialization of fooMatrix is running after I call std::cout.
UPTEENTH EDIT: Here is the function that initializes fooMatrix:
typedef Matrix<float, 4, 4> Mat4;
template<typename T, unsigned int rows, unsigned int cols>
Matrix<T, rows, cols>::Matrix(T *set)
{
this->matrixData = new T*[rows];
for (unsigned int i = 0; i < rows; i++)
{
this->matrixData[i] = new T[cols];
}
unsigned int counter = 0; //because I was too lazy to use set[(i * cols) + j]
for (unsigned int i = 0; i < rows; i++)
{
for (unsigned int j = 0; j < cols; j++)
{
this->matrixData[i][j] = set[counter];
counter++;
}
}
}
g64th EDIT: This isn't just an output problem. I actually have to use the value of the matrix elsewhere, and it's value aligns with the described behaviours (whether or not I print it).
TREE 3rd EDIT: Running it through the debugger gave me a yet again different value:
-7.559 0 0 0
0 -2 0 0
0 0 -0.001 0
1 1 -0.998 1
a(g64, g64)th EDIT: the problem does not exist compiling on linux. Just on Windows with MinGW. Could it be a compiler bug? That would make me sad.
FINAL EDIT: It works now. I don't know what I did, but it works. I've made sure I was using an up-to-date build that didn't have the code that ensures causality still functions, and it works. Thank you for helping me figure this out, stackoverflow community. As always you've been helpful and tolerant of my slowness. I'll by hypervigilant for any undefined behaviours or pointer screw-ups that can cause this unpredictability.
You're not writing your program instruction by instruction. You are describing its behavior to a C++ compiler, which then tries to express the same in machine code.
The compiler is allowed to reorder your code, as long as the observable behavior does not change.
In other words, the compiler is almost certainly reordering your code. So why does the observable behavior change?
Because your code exhibits undefined behavior.
Again, you are writing C++ code. C++ is a standard, a specification saying what the "meaning" of your code is. You're working under a contract that "As long as I, the programmer, write code that can be interpreted according to the C++ standard, then you, the compiler, will generate an executable whose behavior matches that of my source code".
If your code does anything not specified in this standard, then you have violated this contract. You have fed the compiler code whose behavior can not be interpreted according to the C++ standard. And then all bets are off. The compiler trusted you. It believed that you would fulfill the contract. It analyzed your code and generated an executable based on the assumption that you would write code that had a well-defined meaning. You did not, so the compiler was working under a false assumption. And then anything it builds on top of that assumption is also invalid.
Garbage in, garbage out. :)
Sadly, there's no easy way to pinpoint the error. You can carefully study ever piece of your code, or you can try stepping through the offending code in the debugger. Or break into the debugger at the point where the "wrong" value is seen, and study the disassembly and how you got there.
It's a pain, but that's undefined behavior for you. :)
Static analysis tools (Valgrind on Linux, and depending on your version of Visual Studio, the /analyze switch may or may not be available. Clang has a similar option built in) may help
What is your compiler? If you are compiling with gcc, try turning on thorough and verbose warnings. If you are using Visual Studio, set your warnings to /W4 and treat all warnings as errors.
Once you have done that and can still compile, if the bug still exists, then run the program through Valgrind. It is likely that at some point in your program, at an earlier point, you read past the end of some array and then write something. That something you write is overwriting what you're trying to print. Therefore, when you put more things on the stack, reading past the end of some array will put you in a completely different location in memory, so you are instead overwriting something else. Valgrind was made to catch stuff like that.
Related
I the below code demonstrates strange behaviour when trying to access an out-of-range index in a vector
#include <iostream>
#include <vector>
int main()
{
std::vector<int> a_vector(10, 0);
for(int i = 0; i < a_vector.size(); i++)
{
std::cout << a_vector[i] << ", ";
}
for(int j = 0; j <= a_vector.size(); j++)
{
std::cout << a_vector[i] << ", ";
}
return 0;
}
The first for loop produces the expected 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, output, however the second loop produces 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1318834149,.
The last number produced by the second loop changes every time to code is run, and is always large, unless the length of the vector is between three and seven (inclusive), in which case the last number is 0. It also persists for larger indexes - for example modifying the stop value of the second loop to j <= a_vector.size() + 2000 keeps producing large numbers until index 1139, at which point it reverts to 0.
Where does this number come from, does it mean anything, and most importantly why isn't the code throwing an 'out of range' error, which is what I would expect it to do when asked the access the 11th element of a vector 10 elements long
Did you meant ?
for(int j = 0; j < a_vector.size(); j++)
{
std::cout << a_vector[j] << ", ";
}
Because you're going out of the vector range, wich is an undefined behavior and will return and "random" number everytime your run it.
C++ is powerful, and with great power comes great responsibility. If you go out of range, it lets you. 99.99999999% of the time that isn't a good thing, but it still lets you do it.
As for why it changes every time, the computer is treating the hunk of memory after the end of the array as another int, then displaying it. The value of that int depends on what bits are left in that memory from when it was used last. It might have been used for a string allocated and discarded earlier in the program, it might be padding that the compiler inserts in memory allocations to optimize access, it might be active memory being used by another object. If you have no idea (like in this case), you have no way to know and shouldn't expect any sort of consistent behavior.
(What is the 0.00000001% when is it a good thing, you may ask? Once I intentionally reached beyond the range of an array in a subclass to access some private data in the parent class that had no accessor to fix a bug. This was in a library I had no control over, so I had to do this or live with the bug. Perhaps not exactly a good thing, but since I was confident of the layout of memory in that specific case it worked.)
ADDENDUM:
The case I mentioned was a pragmatic solution to a bad situation. Code needed to ship and the vendor wasn't going to fix a bug for months. Although I was confident of the behavior (I knew the exact target platform, the code was statically linked so wasn't going to be replaced by the user, etc.) this introduced code fragility and new responsibility for the future. Namely, the next time the library was updated it would almost certainly break.
So I commented the heck of the code explaining the exact issue, what I was doing, and when it should be removed. I also used lots of CAPITAL LETTERS in my commit message. And I told all of the other programmers, just in case I got hit by a bus before the bug was fixed. In other words, I exercised the great responsibility needed to wield this great power.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I've been doing some of the LeetCode problems, and I notice that the C solutions are a couple of times faster than the exact same thing in C++. For example:
Updated with a couple of simpler examples:
Given a sorted array and a target value, return the index if the target is found. If not, return the index where it would be if it were inserted in order. You may assume no duplicates in the array. (Link to question on LeetCode)
My solution in C, runs in 3 ms:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
My other C++ solution, exactly the same but as a member function of the Solution class runs in 13 ms:
class Solution {
public:
int searchInsert(int A[], int n, int target) {
int left = 0;
int right = n;
int mid = 0;
while (left<right) {
mid = (left + right) / 2;
if (A[mid]<target) {
left = mid + 1;
}
else if (A[mid]>target) {
right = mid;
}
else {
return mid;
}
}
return left;
}
};
Even simpler example:
Reverse the digits of an integer. Return 0 if the result will overflow. (Link to question on LeetCode)
The C version runs in 6 ms:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
And the C++ version is exactly the same but as a member function of the Solution class, and runs for 19 ms:
class Solution {
public:
int reverse(int x) {
long rev = x % 10;
x /= 10;
while (x != 0) {
rev *= 10L;
rev += x % 10;
x /= 10;
if (rev>(-1U >> 1) || rev < (1 << 31)) {
return 0;
}
}
return rev;
}
};
I see how there would be considerable overhead from using vector of vector as a 2D array in the original example if the LeetCode testing system doesn't compile the code with optimisation enabled. But the simpler examples above shouldn't suffer that issue because the data structures are pretty raw, especially in the second case where all you have is long or integer arithmetics. That's still slower by a factor of three.
I'm starting to think that there might be something odd happening with the way LeetCode do the benchmarking in general because even in the C version of the integer reversing problem you get a huge bump in running time from just replacing the line
if (rev>(-1U >> 1) || rev < (1 << 31)) {
with
if (rev>INT_MAX || rev < INT_MIN) {
Now, I suppose having to #include<limits.h> might have something to do with that but it seems a bit extreme that this simple change bumps the execution time from just 6 ms to 19 ms.
Lately I've been seeing the vector<vector<int>> suggestion a lot for doing 2d arrays in C++, and I've been pointing out to people why this really isn't a good idea. It's a handy trick to know when slapping together temporary code, but there's (almost) never any reason to ever use it for real code. The right thing to do is to use a class that wraps a contiguous block of memory.
So my first reaction might be to point to this as a possible source for the disparity. However you're also using int** in the C version, which is generally a sign of the exact same problem as vector<vector<int>>.
So instead I decided to just compare the two solutions.
http://coliru.stacked-crooked.com/a/fa8441cc5baa0391
6468424
6588511
That's the time taken by the 'C version' vs the 'C++ version' in nanoseconds.
My results don't show anything like the disparity you describe. Then it occurred to me to check a common mistake people make when benchmarking
http://coliru.stacked-crooked.com/a/e57d791876b9252b
18386695
42400612
Notice that the -O3 flag from the first example has become -O0, which disables optimization.
Conclusion: you're probably comparing unoptimized executables.
C++ supports building rich abstractions that don't require overhead, but eliminating the the overhead does require certain code transformations that play havoc with the 'debuggability' of code.
That means debug builds avoid those transformations and therefore C++ debug builds are often slower than debug builds of C style code because C style code just doesn't use much abstraction. Seeing a 130% slowdown such as the above is not at all surprising when timing, for example, machine code that uses function calls in place of simple store instructions.
Some code really needs optimizations in order to have reasonable performance even for debugging, so compilers often offer a mode that applies some optimizations which don't cause too much trouble for debuggers. Clang and gcc use -O1 for this, and you can see that even this level of optimization essentially eliminates the gap in this program between C style code and the more C++ style code:
http://coliru.stacked-crooked.com/a/13967ebcfcfa4073
8389992
8196935
Update:
In those later examples optimization shouldn't make a difference, since the C++ is not using any abstraction beyond what the C version is doing. I'm guessing that the explanation for this is that the examples are being compiled with different compilers or with some other different compiler options. Without knowing how the compilation is done I would say it makes no sense to compare these runtime numbers; LeetCode is clearly not producing an apples to apples comparison.
You are using vector of vector in your C++ code snippet. Vectors are sequence containers in C++ that are like arrays that can change in size. Instead of vector<vector<int>> if you use statically allocated arrays, that would be better. You may use your own Array class as well with operator [] overloaded, but vector has more overhead as it dynamically resizes when you add more elements than its original size. In C++, you use call by reference to further reduce your time if you compare that with C. C++ should run even faster if written well.
Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;
What follows is the part of my kernel that does not behave properly, then an explanation of what I've found while debugging.
__global__ void Mangler(float *matrix, int *map)
{
__shared__ signed int localMap[N];
if(0 == threadIdx.x)
{
for(int i=0; i<N; i++)
localMap[i] = -1;
}
__syncthreads();
int fn = ...; // a lot of code goes into this number, skipped for clarity
int rnumber = threadIdx.x;
int X = atomicCAS(&localMap[fn], -1, rnumber); // Spot of bother 1
if(X == -1) // Spot of bother 2
{
// some code
}
else
{
// other code
}
}
I've found in the documentation that atomicCAS(*address, compare, value) basically returns (and saves to the given address) the result of (old == compare ? value : old), where old is the value at the address before executing the function.
Going with this, I believe that executing int X = atomicCAS(&localMap[fn], -1, rnumber); should have two possible outcomes (according to NVidia Cuda C Programming Guide):
if localMap[fn] == -1 then X should have a value of rnumber and localMap[fn] should have a value of rnumber. This does not happen.
if localMap[fn] != -1 then X should be set to the value of localMap[fn] and said value should be left intact.
What happens instead, as debugging with NSight has shown me, is that X is being assigned -1, while localMap[fn] is being assigned the value of rnumber. I do not understand that, but as you can see in my code, I've changed the if to catch this situation.
Which brings me to spot of bother number 2: though NSight shows the value of X as -1, the if {} is being completely skipped (no breakpoints within hit whatsoever) and execution jumps straight to else.
My questions:
Do I misunderstand atomicCAS completely? yes, I did
What could cause and if which should evaluate as true to jump straight into else in device code?
I'm using NVidia CUDA 5.5, Visual Studio 2012 x64 on Windows 8, NVidia Nsight Monitor Visual Studio Edition 3.1. The GPU for the machine is NVidia GeForce GTX 550 Ti.
I've tried changing the syntax to if(X!=-1); the true branch of the if is still not being executed.
From the doc, atomicCAS returns the old value, that means, that in your list, your two outcomes are wrong. Your X will always be set to the old value of localMap[fn], regardless which value it had. What is set according to the comparison with the -1, is the new value of localMap[fn]. When it is -1, it is set to rnumber, else it is left intact.
So the behaviour you see with the values of X, rnumber and localMap are as expected.
I cannot help with your second problem, as I dont use NSight, and dont know how it works - according to your code, your true branch should be evaluated (but be careful: your false branch also - as it is multi threaded some threads can have the condition evaluated to true, and some to false - my guess/assumption would be that you must tell somehow your debugger which thread/warp/block you want to debug and you looked at the false).
I'm trying to optimize some C++ code for speed, and not concerned about memory usage. If I have some function that, for example, tells me if a character is a letter:
bool letterQ ( char letter ) {
return (lchar>=65 && lchar<=90) ||
(lchar>=97 && lchar<=122);
}
Would it be faster to just create a lookup table, i.e.
int lookupTable[128];
for (i = 0 ; i < 128 ; i++) {
lookupTable[i] = // some int value that tells what it is
}
and then modifying the letterQ function above to be
bool letterQ ( char letter ) {
return lookupTable[letter]==LETTER_VALUE;
}
I'm trying to optimize for speed in this simple region, because these functions are called a lot, so even a small increase in speed would accumulate into long-term gain.
EDIT:
I did some testing, and it seems like a lookup array performs significantly better than a lookup function if the lookup array is cached. I tested this by trying
for (int i = 0 ; i < size ; i++) {
if ( lookupfunction( text[i] ) )
// do something
}
against
bool lookuptable[128];
for (int i = 0 ; i < 128 ; i++) {
lookuptable[i] = lookupfunction( (char)i );
}
for (int i = 0 ; i < size ; i++) {
if (lookuptable[(int)text[i]])
// do something
}
Turns out that the second one is considerably faster - about a 3:1 speedup.
About the only possible answer is "maybe" -- and you can find out by running a profiler or something else to time the code. At one time, it would have been pretty easy to give "yes" as the answer with little or no qualification. Now, given how much faster CPUs have gotten than memory, it's a lot less certain -- you can do a lot of computation in the time it takes to fill one cache line from main memory.
Edit: I should add that in either C or C++, it's probably best to at least start with the functions (or macros) built into the standard library. These are often fairly carefully optimized for the target and (more importantly for most people) support things like switching locales, so you won't be stuck trying to explain to your German users that 'ß' isn't really a letter (and I doubt many will be much amused by "but that's really two letters, not one!)
First, I assume you have profiled the code and verified that this particular function is consuming a noticeable amount of CPU time over the runtime of the program?
I wouldn't create a vector as you're dealing with a very fixed data size. In fact, you could just create a regular C++ array and initialize is at program startup. With a really modern compiler that supports array initializers you even can do something like this:
bool lookUpTable[128] = { false, false, false, ..., true, true, ... };
Admittedly I'd probably write a small script that generates out the code rather then doing it all manually.
For a simple calculation like this, the memory access (caused by a lookup table) is going to be more expensive than just doing the calculation every time.