Not a very good title, but I didn't know what to name it.
Anyway, I am counting the total frames (so I can calculate an average FPS) in my game with a long int. Just in case the game went on reallly long, what should I do to make sure my long int doesn't get incremented past its limit? And what would happen if it did go past its limit?
Thanks.
This problem is present for any kind of counters.
For your specific problem, I wouldn't worry.
A long int counts up to 2 billions (and more) in most worst cases (on 32 bit computers/consoles). Supposing your game is doing 1000 frames per second (which is a lot!), it would take 20000000 seconds to overflow your counter: more than 5000 hours, more than 231 days.
I'm pretty sure something else would cause your game to stop, if you try to run it for that long!
I would instead consider using an exponentially-weighted moving average. That approach will kill two birds with one stone: it will avoid the problem of accumulating a large number, and it will also adapt to recent behavior so that an accumulated average of 100fps in the year 2010 would not skew the average so that a 2fps rate would seem acceptable for a month or so in 2011 :).
Average FPS throughout the length of an entire game doesn't seem to be a very useful statistic. Typically you will wish to measure peaks and valleys, such as highest fps / lowest fps and amount of frames spent below and above threshold values.
In reality though, I would not worry. Even if you were to just use a 32 bit unsigned int, your game could run at 60fps for 19884 hours before it would overflow. You should be fine.
EDIT:
The best way to detect overflow in this case is to check and see if the integer decreased in value after being incremented. If so, you could just keep another counter around which is the number of times you have overflowed.
You could actively check for an overflow in your arithmetic operations. E. g. SafeInt can do that for you. Of course, the performance is worse than for i++.
However, it is unlikely that a 32 bit integer will overflow if you always increment by one.
If long int is 32-bits, the maximum value is 2^31-1, so with 1ms updates it will overflow in 24.9 days, not 231 [2^31/1000/60/60/24].
Hopefully not too OT... generally for games this may not really be an issue, but it is for other applications. A common mistake to be careful of is doing something like
extern volatile uint32_t counter;
uint32_t one_second_elapsed = counter + 1000;
while ( counter < one_second_elapsed ) do_something();
If counter + 1000 overflows, then do_something() will not be called. The way to check this is,
uint32_t start = counter;
while ( counter - start < 1000 ) do_something();
It's probably better to use an average over a small number of frames. You mentioned that you want to calculate an average, but there is really no reason to keep such a larger number of samples around to calculate an average. Just keep a running total of frametimes over some small period of time (where small could be something between 10-50 frames - we typically use 16). You can then use that total to calculate an average frames per second. This method also helps smooth out frame time reports so that the numbers don't jump all over the place. One thing to watch out for though is that if you average over too long of a time period, frame rate spikes become more "hidden", meaning it might be tougher to spot frames which cause framerate to drop if those frames only happen every so often.
Something like this would be totally sufficient I think (non-tested code to follow):
// setup some variables once
const int Max_samples = 16; // keep at most 16 frametime samples
int FPS_Samples = 0;
int Current_sample = 0;
int Total_frametime = 0.0f;
float Frametimes[Max_samples];
for ( int i = 0; i < Max_samples; i++ ) {
Frametimes[i] = 0.0f;
Then when you calculate your frametime, you could do something like this:
// current_frametime is the new frame time for this frame
Total_frametime -= Frametimes[Current_sample];
Total_frametime += current_frametime;
Frametimes[Current_sample] = current_frametime;
Current_sample = ( Current_sample + 1 ) % Max_samples; // move to next element in array
Frames_per_second = Max_samples / Total_frametime;
It's a rough cut and could probably use some error checking, but it gives the general idea.
Related
I am having some difficulty understanding why an extremely simple program I've coded in C++ keeps looping. I'll describe the problem at hand first just to check if maybe my solution is incorrect and then I'll write the code:
The shooting efficiency of a soccer player is the percentage of
goals scored over all the shots on goal taken in all his professional career. It is a rational number between 0 and 100,
rounded to one decimal place. For example, a player who
made 7 shots on goal and scored 3 goals has a shooting
efficiency of 42.9.
Given the shooting efficiency of a player, we want to know which
is the minimum amount of shots on goal needed to get that
number (which must be greater than 0).
What I thought of is that if p is the percentage given, then in order to get the minimum number of shots n, the relationship np <= n must be satisfied since np would be the number of goals scored over a total of n.
I've coded the following program:
int main(){
float efficiency;
cin >> efficiency;
int i = 1;
float tries = i*efficiency;
while(tries > i){
i++;
tries = i*efficiency;
}
cout << i << endl;
return 0;
}
This program never terminates since it keeps looping inside the while, any suggestions on what might be wrong would be really appreciated.
You multiply efficiency after incrementing i. This means efficiency will grow much faster than i as when i increases by 1, efficiency will increase (i+1) times, ending up much larger than i.
How I can set frequency band to my array from KissFFT? Sampling frequency is 44100 and I need to set it to my array realPartFFT. I have no idea, how it works. I need to plot my spectrum chart to see if it counts right. When I plot it now, it still has only 513 numbers on the x axis, without the specified frequency.
int windowCount = 1024;
float floatArray[windowCount], realPartFFT[(windowCount / 2) + 1];
kiss_fftr_cfg cfg = kiss_fftr_alloc(windowCount, 0, NULL, NULL);
kiss_fft_cpx cpx[(windowCount / 2) + 1];
kiss_fftr(cfg, floatArray, cpx);
for (int i = 0; i < (windowCount / 2) + 1; ++)
realPartFFT[i] = sqrtf(powf(cpx[i].r, 2.0) + powf(cpx[i].i, 2.0));
First of all: KissFFT doesn't know anything about the source of the data. You pass it an array of real numbers of a given size N, and you get in return an array of complex values of size N/2+1. The input array may be the whether forecast of the next N hours of the number of sunspots of the past N days. KissFFT doesn't care.
The mapping back to the real world needs to be done by you, so you have to interpret the data. As of you code snippet, you are passing 1024 of floats (I assume that floatArray contains the input data). You then get back an array of 513 (=1024/2+1) pairs of floats.
If you are sampling with 44.1 KHz and pass KissFFT chunks of 1024 (your window size) samples, you will get as highest frequency 22.05 KHz and as lowest frequency about 43 Hz (44,100 / 1024). You can get even lower by passing bigger chunks to KissFFT, but keep in mind that processing time will grow (with the fourth power of N, IIRC)!
Btw: You may consider making your windowSize variable const, to allow the compiler do some optimizations. Optimizations are very valuable when doing number crunching. In this case the effect may be negligible, but it's a good starting point.
I'm working on a bruteforce algorithm for solving a kind of Puzzle.
That puzzle is a rectangle, and for some reasons irrelevant here, the number of possible solutions of a rectangle whose size is width*height is 2^(min(width, height)) instead of 2^(width*height).
Both dimensions can be considered as in range 1..50. (most often below 30 though)
This way, the numbers of solutions is, at worst, 2^50 (about 1 000 000 000 000 000 so). I store solution as an unsigned 64 bits number, a kind of "seed"
I have two working algortihms for bruteforce solving.
Assuming N is min(width, height) and isCorrect(uint64_t) a predicate that returns whether the solution with given seed is correct or not.
The most naive algorithm is roughly this :
vector<uint64_t> solutions;
for (uint64_t i = 0; i < (1 << N); ++i)
{
if (isCorrect(i))
solutions.push_back(i);
}
It works perfectly (assuming predicate is actually implemented :D) but does not profit from multiples cores, so I'd like to have a multi-threadead approach.
I've come across QtConcurrent, which gives concurrent filter and map functions, that automatically create optimal number of threads to share burden.
So I have a new algorithm that is roughly this :
vector<unit64_t> solutionsToTry;
solutionsToTry.reserve(1 << N);
for (uint64_t i = 0; i < (1 << N); ++i)
solutionsToTry.push_back(i);
//Now, filtering
QFuture<unit64_t> solutions = QtConcurrent::filtered(solutionsToTry, &isCorrect);
It does work too, and a bit faster, but when N goes higher than 20, There's simply not enough room in my RAM to allocate the vecotr (with N = 20 and 64 bits numbes, I need 8,3 GB of RAM. It's okay with swap partitions etc, but sinces it gets multiplied by 2 every time N increases by 1, it can't go further)
Is there a simple way to have concurrent filtering without bloating memory ?
If there isn't, I might rather hand-split loops on 4 threads to get concurrency without optimal size, or write the algorithm in Haskell to get lazy-evaluation and filtering of infinite lists :-)
I have an algorithm that does some computations on elements of an array. I'd like to re-use the input-data buffer to write the results into it.
In terms of the data traversal pattern, it looks almost exactly like this (the only other thing happening in that for-loop are increments to some pointers and counting variables):
int *inputData = /*input data is here */;
for(int i=0;i<some_value;++i)
{
int result = do_some_computations(*inputData);
*inputData = result;
++inputData;
}
Now the interesting part: inputData contains about six million elements. If I comment out the write to the inputData array, so the algorithm looks basically like this:
int *inputData = /*input data is here */;
for(int i=0;i<some_value;++i)
{
int result = do_some_computations(*inputData);
// *inputData = result;
++inputData;
}
The algorithm, over a series of ~100 measurements, takes on average about 7 milliseconds. However, if I leave the write in, the algorithm takes about 55 milliseconds. Writing "*inputData = do_some_computations(*inputData);" instead of the way it is now makes no difference in performance. Using a separate outputBuffer makes no difference as well.
This is bad. The performance of this algorithm is absolutely critical to the requirements of the program. I was very happy with 7ms, however I am very unhappy with 55 ms.
Why does this single write-back cause such a large overhead, and how can I fix it?
Your code is being optimised to nothing in the non-write back version. To show this, assuming a 5GHz single core CPU then:-
7ms = 35,000,000 cycles
6 million items = 35/6 = 5.8 cycles per item = not a lot of work being done
For the slow version:-
55ms = 275,000,000 cycles
6 million items = 275/6 = 45.8 cycles per item = far more work per item
If you want to verify this, look at the assembly output from the compiler.
I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible.
//n is about 60
for (int k = 0;k < n;k++)
{
double fone = z[k*n+i+1];
double fzer = z[k*n+i];
z[k*n+i+1]= s*fzer+c*fone;
z[k*n+i] = c*fzer-s*fone;
}
Are there any optimizations that can be made such as vectorization or some evil inline that can help this code?
I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html
Short answer: Change the memory layout of your matrix from row-major order to column-major order.
Long answer:
It seems you are accessing the (i)th and (i+1)th column of a matrix stored in row-major order - probably a big matrix that doesn't as a whole fit into CPU cache. Basically, on every loop iteration the CPU has to wait for RAM (in the order of hundred cycles). After a few iteraterations, theoretically, the address prediction should kick in and the CPU should speculatively load the data items even before the loop acesses them. That should help with RAM latency. But that still leaves the problem that the code uses the memory bus inefficiently: CPU and memory never exchange single bytes, only cache-lines (64 bytes on current processors). Of every 64 byte cache-line loaded and stored your code only touches 16 bytes (or a quarter).
Transposing the matrix and accessing it in native major order would increase memory bus utilization four-fold. Since that is probably the bottle-neck of your code, you can expect a speedup of about the same order.
Whether it is worth it, depends on the rest of your algorithm. Other parts may of course suffer because of the changed memory layout.
I take it you are rotating something (or rather, lots of things, by the same angle (s being a sin, c being a cos))?
Counting backwards is always good fun and cuts out variable comparison for each iteration, and should work here. Making the counter the index might save a bit of time also (cuts out a bit of arithmetic, as said by others).
for (int k = (n-1) * n + i; k >= 0; k -= n)
{
double fone=z[k+1];
double fzer=z[k];
z[k+1]=s*fzer+c*fone;
z[k] =c*fzer-s*fone;
}
Nothing dramatic here, but it looks tidier if nothing else.
As first move i'd cache pointers in this loop:
//n is about 60
double *cur_z = &z[0*n+i]
for (int k = 0;k < n;k++)
{
double fone = *(cur_z+1);
double fzer = *cur_z;
*(cur_z+1)= s*fzer+c*fone;
*cur_z = c*fzer-s*fone;
cur_z += n;
}
Second, i think its better to make templatized version of this function. As a result, you can get good perfomance benefit if your matrix holds integer values (since FPU operations are slower).