no performance improvement with std::thread - c++

I am working on a audio "real time" application and I would like to imrpove the performance of it. I actually already posted a topic but this is about std::thread specificly.
The audio processing ist mostly done by two seperate objects ( leftProcessor track and rightProcessor). Since these objects don't rely on each other, using two threads to process them should realy improve the performance with multi core CPUs. However, I currently get the opposite result.
Before I activated the compiler performance optimization (O2), using two threads got me about 50% more performance, but after I switched the optimization on, I got ~10-20% less performance, while the performance got drasticly better for both versions.
I measure the performance by taking the time at two points within the function and bruteforce printing the result to the screen so there could be some problems with this. ;)
My guess would be that creating a std::thread would take more time than I actually gain from running the processing on the second thread.
In this case, is it possible to improve the performance by using the same "thread" for every function call and just passing the thread the new arguments? I don't realy know if this is possible.
The function currently takes about 0.0005ms to 0.002ms to process.
Here is the code:
void AudioController::processAudio(int frameCount, float *output) {
#ifdef AUDIOCONTROLLER_MULTITHREADING
// CALCULATE RIGHT TRACK
std::thread rightProcessorThread;
if(rightLoaded) {
rightProcessorThread = std::thread(&AudioProcessor::tick, //function
rightProcessor, //object
rightFrameBuffer, //arg1
frameCount); //arg2
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
}
}
#else
// CALCULATE RIGHT TRACK
if(rightLoaded) {
rightProcessor->tick(rightFrameBuffer, frameCount);
} else {
for(int i = 0; i < frameCount; i++) {
rightFrameBuffer[i].leftSample = 0.0f;
rightFrameBuffer[i].rightSample = 0.0f;
}
}
#endif
// CALCULATE LEFT TRACK
Frame * leftFrameBuffer = (Frame*) output;
if(leftLoaded) {
leftProcessor->tick(leftFrameBuffer, frameCount);
} else {
for(int i = 0; i < frameCount; i++) {
leftFrameBuffer[i].leftSample = 0.0f;
leftFrameBuffer[i].rightSample = 0.0f;
}
}
#ifdef AUDIOCONTROLLER_MULTITHREADING
// CALCULATE RIGHT TRACK
if(rightLoaded) {
rightProcessorThread.join();
}
#endif
// MIX
for(int i = 0; i < frameCount; i++ ) {
leftFrameBuffer[i] = volume * (leftRightMix * leftFrameBuffer[i] + (1.0 - leftRightMix) * rightFrameBuffer[i]);
}
}

Related

Designing a multithreaded application that scales well

The code below is a demonstration of what I'm trying to do and it has the same problem as my original code (which is not included here). I have spectrogram code and I'm trying to improve its performance by using multiple threads (my computer has 4 cores). The spectrogram code basically computes an FFT over many overlapping frames (these frames correspond to sound samples at a particular time).
As an example let's say that we have 1000 frames which overlap by 50%.
If we're using 4 threads, then each thread should handle 250 frames. Overlapping frames just means that if our frames are 1024 samples in length, the first
frame has the range 0-1023, the second frame 512-1535, the third 1024-2047 etc (an overlap of 512 samples ).
The code creating and using the threads
void __fastcall TForm1::Button1Click(TObject *Sender)
{
numThreads = 4;
fftLen = 1024;
numWindows = 10000;
int startTime = GetTickCount();
numOverlappingWindows = numWindows*2;
overlap = fftLen/2;
const unsigned numElem = fftLen*numWindows+overlap;
rx = new float[numElem];
for(int i=0; i<numElem; i++) {
rx[i] = rand();
}
useThreads = true;
vWThread.reserve(numOverlappingWindows);
if(useThreads){
for(int i=0;i<numThreads;i++){
TWorkerThread *pWorkerThread = new TWorkerThread(true);
pWorkerThread->SetWorkerMethodCallback(&CalculateWindowFFTs);//this is called in TWorkerThread::Execute
vWThread.push_back(pWorkerThread);
}
pLock = new TCriticalSection();
for(int i=0;i<numThreads;i++){ //start the threads
vWThread.at(i)->Resume();
}
while(TWorkerThread::GetNumThreads()>0);
}else CalculateWindowFFTs();
int endTime = GetTickCount();
Label1->Caption = IntToStr(endTime-startTime);
}
void TForm1::CalculateWindowFFTs(){
unsigned startWnd = 0, endWnd = numOverlappingWindows, threadId;
if(useThreads){
threadId = TWorkerThread::GetCurrentThreadId();
unsigned wndPerThread = numOverlappingWindows/numThreads;
startWnd = (threadId-1)*wndPerThread;
endWnd = threadId*wndPerThread;
if(numThreads==threadId){
endWnd = numOverlappingWindows;
}
}
float *pReal, *pImg;
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
pLock->Acquire();
vWndFFT.push_back(pReal);
vWndFFT.push_back(pImg);
pLock->Release();
}
}
void TForm1::FFT(float *rx, float *ix, int fftSize)
{
int i, j, k, m;
float rxt, ixt;
m = log(fftSize)/log(2);
int fftSizeHalf = fftSize/2;
j = k = fftSizeHalf;
for (i = 1; i < (fftSize-1); i++){
if (i < j) {
rxt = rx[j];
ixt = ix[j];
rx[j] = rx[i];
ix[j] = ix[i];
rx[i] = rxt;
ix[i] = ixt;
}
k = fftSizeHalf;
while (k <= j){
j = j - k;
k = k/2;
}
j = j + k;
} //end for
int le, le2, l, ip;
float sr, si, ur, ui;
for (k = 1; k <= m; k++) {
le = pow(2, k);
le2 = le/2;
ur = 1;
ui = 0;
sr = cos(PI/le2);
si = -sin(PI/le2);
for (j = 1; j <= le2; j++) {
l = j - 1;
for (i = l; i < fftSize; i += le) {
ip = i + le2;
rxt = rx[ip] * ur - ix[ip] * ui;
ixt = rx[ip] * ui + ix[ip] * ur;
rx[ip] = rx[i] - rxt;
ix[ip] = ix[i] - ixt;
rx[i] = rx[i] + rxt;
ix[i] = ix[i] + ixt;
} //end for
rxt = ur;
ur = rxt * sr - ui * si;
ui = rxt * si + ui * sr;
}
}
}
While it's easy to divide this process over multiple threads, the performance is only marginally improved compared to the single-threaded version (<10%).
Interestingly if I increase the number of threads to, say, 100, I do get an increase in speed of about 25%, which is surprising because
I'd expect that thread context-switching overhead be a factor in this case.
At first I thought that the main reason for the poor performance is a lock on writing to a vector object so I experimented with an array of vectors (a
vector per thread), thus eliminiting the need for the locks but the performance remained pretty much the same.
pVfft = new vector<float*>[numThreads];//create an array of vectors
//and then in CalculateWindowFFTs, do something like
vector<float*> &vThr = pVfft[threadId-1];
for(unsigned i=startWnd; i<endWnd; i++){
pReal = new float[fftLen];
pImg = new float[fftLen];
memcpy(pReal, &rx[i*overlap], fftLen*sizeof(float));
memset(pImg, '0', fftLen);
FFT(pReal, pImg, fftLen); //perform an in place FFT
vThr.push_back(pReal);
}
I think I'm running into caching problems here though I'm not certain how to go about changing my design in order to have a solution that scales well.
I can also provide the code for TWorkerThread if you think that's important.
Any help is much appreciated.
Thanks
UPDATE:
As suggested by 1201ProgramAlarm I removed that while loop and got about 15-20% speed improvement on my system. Now my main thread is not actively waiting for the threads to finish but rather I have TWorkerThread execute code on the main thread via TThread::Synchronize after all the worker threads have finished (i.e.when numThreads has reached 0).
While this is looking better now, it's still far from being optimal.
The locks to write to vWndFFT will hurt, as will the repeated (leaking) calls to new assigned to pReal and pImg (these should be outside the for loop).
But the real performance killer is probably your loop waiting for the threads to finish: while(TWorkerThread::GetNumThreads()>0);. This will consume one available thread in a very unfriendly way.
One quick fix (not recommended) would be to add a sleep(1) (or 2, 5, or 10) so the loop is not continuous.
A better solution would be to have the main thread be one of your calculation threads, and have a way for that thread (once it is done with all processing) to simply wait for the other thread to finish without consuming a core, using something like WaitForMultipleObjects that is available on Windows.
One simple way to try out your threaded code is simply to run threaded, but only use one thread. Performance should be about the same as the non-threaded version, and the results should match.

How to use cv::parallel_for_ for execution time reduction

I created an image processing algorithm using OpenCV and currently I'm trying to improve the time efficiency of my own, simple function which is similar to LUT, but with interpolation between values (double calibRI::corr(double)).
I optimized the pixel loop according to the OpenCV docs.
Non parallel function (calib(cv::Mat) -an object of calibRI functor class) takes about 0.15s. I decided to use cv::parallel_for_ to make it shorter.
First I implemented it as image tiling -according to >> this document. The time was reduced to 0.12s (4 threads).
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
// divide image in 'thr' number of parts and process simultaneously
cv::Rect roi(0, (img.rows/thr)*i, img.cols, img.rows/thr);
cv::Mat in = img(roi);
cv::Mat out = retVal(roi);
out = calib(in); //loops over all pixels and does out[u,v]=calibRI::corr(in[u,v])
}
I though that running my function in parallel for subimages/tiles/ROIs is not yet optimal, so I implemented it as below:
template <typename T>
class ParallelPixelLoop : public cv::ParallelLoopBody
{
typedef boost::function<T(T)> pixelProcessingFuntionPtr;
private:
cv::Mat& image; //source and result image (to be overwritten)
bool cont; //if the image is continuous
size_t rows;
size_t cols;
size_t threads;
std::vector<cv::Range> ranges;
pixelProcessingFuntionPtr pixelProcessingFunction; //pixel modif. function
public:
ParallelPixelLoop(cv::Mat& img, pixelProcessingFuntionPtr fun, size_t thr = 4)
: image(img), cont(image.isContinuous()), rows(img.rows), cols(img.cols), pixelProcessingFunction(fun), threads(thr)
{
int groupSize = 1;
if (cont) {
cols *= rows;
rows = 1;
groupSize = ceil( cols / threads );
}
else {
groupSize = ceil( rows / threads );
}
int t = 0;
for(t=0; t<threads-1; ++t) {
ranges.push_back( cv::Range( t*groupSize, (t+1)*groupSize ) );
}
ranges.push_back( cv::Range( t*groupSize, rows<=1?cols:rows ) ); //last range must be to the end of image (ceil used before)
}
virtual void operator()(const cv::Range& range) const
{
for(int r = range.start; r < range.end; r++)
{
T* Ip = nullptr;
cv::Range ran = ranges.at(r);
if(cont) {
Ip = image.ptr<T>(0);
for (int j = ran.start; j < ran.end; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
else {
for(int i = ran.start; i < ran.end; ++i)
{
Ip = image.ptr<T>(i);
for (int j = 0; j < cols; ++j)
{
Ip[j] = pixelProcessingFunction(Ip[j]);
}
}
}
}
}
};
Then I run it on 1280x1024 64FC1 image, on i5 processor, Win8, and get the time in range of 0.4s using the code below:
double t = cv::getTickCount();
ParallelPixelLoop<double> loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
I have no idea why is my implementation so much slower than iterating all the pixels in subimages... Is there a bug in my code or the OpenCV ROIs are optimized in some special way?
I do not think there is a time measurement error issue, as described here. I'm using OpenCV time functions.
Is there any other way to reduce the time of this function?
Thanks in advance!
Generally it's really hard to say why using cv::parallel_for failed to speed up whole process. One possibility is that the problem is not related to processing/multithreading, but to time measurement. About 2 months ago i tried to optimize this algorithm and i noticed strange thing - first time i use it, it takes x ms, but if use use it second, third, ... time (of course without restarting application) it takes about x/2 (or even x/3) ms. I'm not sure what causes this behaviour - most likely (in my opinion) it's causes by branch prediction - when code is executed first time branch predictor "learns" which paths are usually taken, so next time it can predict which branch to take(and usually the guess will be correct). You can read more about it here - it's really good question and it can open your eyes for some quite important thing.
So, in your situation i would try few things:
measure it many times - 100 or 1000 should be enough (if it takes 0.12-0.4s it won't take much time) and see whether the last version of you code still is the slowest one. So just replace your code with this:
double t = cv::getTickCount();
for (unsigned int i=0; i<1000; i++) {
ParallelPixelLoop loop(V,boost::bind(&calibRI::corr,this,_1),4);
cv::parallel_for_(cv::Range(0,4),loop);
}
std::cout << "Exec time: " << (cv::getTickCount()-t)/cv::getTickFrequency() << "s\n";
test it on bigger image. Maybe in your situation you just "don't need" 4 cores, but on bigger image 4 cores will make positive difference.
Use profiler (for example Very Sleepy) to see what part of your code is critical

Parallelism vs Threading - Performance

I have been reading on the subject, but I haven't been able to find a concrete answer to my question. I am interested in using parallelism/multithreading to improve the performance of my game, but I have heard some contradicting facts. For example, that multithreading may not produce any improvement on the execution speed for a game. I
I have thought of two ways to do this:
putting the rendering component into a thread. There are some things
I would need to change, but I have a good idea of what needs to be
done.
using openMP to parallelize the rendering function. I have already code to do so, thus this might be easier option.
This being an Uni assessment, the target hardware are my Uni's computers, which are multi-core (4 cores), and therefore I am hoping to achieve some additional efficiency using either one of those techniques.
My question, is therefore, the following: Which one should I prefer? Which normally produces the best results?
EDIT: The main function I mean to parallelize/multithread away:
void Visualization::ClipTransBlit ( int id, Vector2i spritePosition, FrameData frame, View *view )
{
const Rectangle viewRect = view->GetRect ();
BYTE *bufferPtr = view->GetBuffer ();
Texture *txt = txtMan_.GetTexture ( id );
Rectangle clippingRect = Rectangle ( 0, frame.frameSize.x, 0, frame.frameSize.y );
clippingRect.Translate ( spritePosition );
clippingRect.ClipTo ( viewRect );
Vector2i negPos ( -spritePosition.x, -spritePosition.y );
clippingRect.Translate ( negPos );
if ( spritePosition.x < viewRect.left_ ) { spritePosition.x = viewRect.left_; }
if ( spritePosition.y < viewRect.top_ ) { spritePosition.y = viewRect.top_; }
if (clippingRect.GetArea() == 0) { return; }
//clippingRect.Translate ( frameData );
BYTE *destPtr = bufferPtr + ((abs(spritePosition.x) - abs(viewRect.left_)) + (abs(spritePosition.y) - abs(viewRect.top_)) * viewRect.Width()) * 4; // corner position of the sprite (top left corner)
BYTE *tempSPtr = txt->GetData() + (clippingRect.left_ + clippingRect.top_ * txt->GetSize().x) * 4;
int w = clippingRect.Width();
int h = clippingRect.Height();
int endOfLine = (viewRect.Width() - w) * 4;
int endOfSourceLine = (txt->GetSize().x - w) * 4;
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
memcpy(destPtr, tempSPtr, 4);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
}
instead of calling memcpy for each pixel consider just setting the value there. the overhead in calling a function that many times could be dominating the overall execution time for this loop. E.g:
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
*((DWORD*)destPtr) = *((DWORD*)tempSPtr);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
you could also avoid the conditional by employing one of the tricks mentioned here avoiding conditionals - in such a tight loop conditionals can be very expensive.
edit -
as to whether it's better to run several instances of ClipTransBlit concurrently or to parallelize ClipTransBlit internally, I would say generally speaking it's better to implement parallelization at as high a level as possible to reduce the overhead you incur by setting it up (creating threads, synchronizing them, etc.)
In your case though because it looks like you're drawing sprites, if they were to overlap then without additional synchronization your high level threading might lead to nasty visual artifacts and even a race condition on checking the alpha bit. In that case the low level parallelism might be a better choice.
Theoretically, they should produce the same effect. In practice, it might be quite different.
If you print out assembly code of an OpenMP program, OpenMP simply calls some function in the scope like #pragma omp parallel .... It is similar to folk.
OpenMP is parallel computing oriented, on the other hand, multi-thread is more general.
For example, if you want to write a GUI program, multithreading is necessary(Some frameworks may hide it. It still needs multiple threads). However you never want to implement it with OpenMP.

Particle Deposition Terrain Generation

I'm using Particle Deposition to try and create some volcano-like mountains procedurally but all I'm getting out of it is pyramid-like structures. Is anyone familiar with the algorithm that might be able to shed some light on what I might be doing wrong. I'm dropping each particle in the same place at the moment. If I don't they spread out in a very thin layer rather than any sort of mountain.
void TerrainClass::ParticalDeposition(int loops){
float height = 0.0;
//for(int k= 0; k <10; k++){
int dropX = mCurrentX = rand()%(m_terrainWidth-80) + 40;
int dropY = mCurrentZ = rand()%(m_terrainHeight-80) + 40;
int radius = 15;
float angle = 0;
int tempthing = 0;
loops = 360;
for(int i = 0; i < loops; i++){
mCurrentX = dropX + radius * cos(angle);
mCurrentZ = dropY + radius * sin(angle);
/*f(i%loops/5 == 0){
dropX -= radius * cos(angle);
dropY += radius * sin(angle);
angle+= 0.005;
mCurrentX = dropX;
mCurrentZ = dropY;
}*/
angle += 360/loops;
//dropX += rand()%5;
//dropY += rand()%5;
//for(int j = 0; j < loops; j++){
float newY = 0;
newY = (1 - (2.0f/loops)*i);
if(newY < 0.0f){
newY = 0.0f;
}
DepositParticle(newY);
//}
}
//}
}
void TerrainClass::DepositParticle(float heightIncrease){
bool posFound = false;
m_lowerList.clear();
while(posFound == false){
int offset = 10;
int jitter;
if(Stable(0.5f)){
m_heightMap[(m_terrainHeight*mCurrentZ)+mCurrentX].y += heightIncrease;
posFound = true;
}else{
if(!m_lowerList.empty()){
int element = rand()%m_lowerList.size();
int lowerIndex = m_lowerList.at(element);
MoveTo(lowerIndex);
}
}
}
}
bool TerrainClass::Stable(float deltaHeight){
int index[9];
float height[9];
index[0] = ((m_terrainHeight*mCurrentZ)+mCurrentX); //the current index
index[1] = ValidIndex((m_terrainHeight*mCurrentZ)+mCurrentX+1) ? (m_terrainHeight*mCurrentZ)+mCurrentX+1 : -1; // if the index to the right is valid index set index[] to index else set index[] to -1
index[2] = ValidIndex((m_terrainHeight*mCurrentZ)+mCurrentX-1) ? (m_terrainHeight*mCurrentZ)+mCurrentX-1 : -1; //to the left
index[3] = ValidIndex((m_terrainHeight*(mCurrentZ+1))+mCurrentX) ? (m_terrainHeight*(mCurrentZ+1))+mCurrentX : -1; // above
index[4] = ValidIndex((m_terrainHeight*(mCurrentZ-1))+mCurrentX) ? (m_terrainHeight*(mCurrentZ-1))+mCurrentX : -1; // bellow
index[5] = ValidIndex((m_terrainHeight*(mCurrentZ+1))+mCurrentX+1) ? (m_terrainHeight*(mCurrentZ+1))+mCurrentX+1: -1; // above to the right
index[6] = ValidIndex((m_terrainHeight*(mCurrentZ-1))+mCurrentX+1) ? (m_terrainHeight*(mCurrentZ-1))+mCurrentX+1: -1; // below to the right
index[7] = ValidIndex((m_terrainHeight*(mCurrentZ+1))+mCurrentX-1) ? (m_terrainHeight*(mCurrentZ+1))+mCurrentX-1: -1; // above to the left
index[8] = ValidIndex((m_terrainHeight*(mCurrentZ-1))+mCurrentX-1) ? (m_terrainHeight*(mCurrentZ-1))+mCurrentX-1: -1; // above to the right
for ( int i = 0; i < 9; i++){
height[i] = (index[i] != -1) ? m_heightMap[index[i]].y : -1;
}
m_lowerList.clear();
for(int i = 1; i < 9; i++){
if(height[i] != -1){
if(height[i] < height[0] - deltaHeight){
m_lowerList.push_back(index[i]);
}
}
}
return m_lowerList.empty();
}
bool TerrainClass::ValidIndex(int index){
return (index > 0 && index < m_terrainWidth*m_terrainHeight) ? true : false;
}
void TerrainClass::MoveTo(int index){
mCurrentX = index%m_terrainWidth;
mCurrentZ = index/m_terrainHeight;
}
Thats all the code thats used.
You should have a look at these two papers:
Fast Hydraulic Erosion Simulation and Visualization on GPU
Fast Hydraulic and Thermal Erosion on the GPU (read the first one first, the second one expands on it)
Don't get scared by the "on GPU", the algorithms work just fine on CPU (albeit slower). The algorithms don't do particle sedimentation per se (but you don't either ;) ) - they instead aggregate the particles into several layers of vector fields.
One important thing about this algorithm is that it erodes already existing heightmaps - for example generated with perlin noise. It fails miserably if the initial height field is completely flat (or even if it has insufficient height variation).
I had implemented this algorithm myself and had mostly success with it (still have more work to do, the algorithms are very hard to balance to give universally great results) - see the image below.
Note that perlin noise with the Thermal weathering component from the second paper may be well enough for you (and might save you a lot of trouble).
You can also find C++ CPU-based implementation of this algorithm in my project (specifically this file, mind the GPL license!) and its simplified description on pages 24-29 of my thesis.
Your particles will need to have some surface friction and/or stickiness (or similar) in their physics model if you want them to not spread out into a single-layer. This is performed in the collision detection and collision response parts of your code when updating your particle simulation.
A simple approach is to make the particles stick (attract each-other). Particles need to have a size too so that they don't simply converge to perfectly overlapping. If you want to make them attract each other, then you need to test the distance between particles.
You might benefit from looking through some of the DirectX SDK examples that use particles, and in particular (pun arf!) there is a great demo (by Simon Green?) in the NVidia GPU Computing SDK that implements sticky particles in CUDA. It includes a ReadMe document describing what they've done. You can see how the particles interact and ignore all the CUDA/GPU stuff if you aren't going for massive particle counts.
Also note that as soon as you use inter-particle forces, then you are checking approximately 0.5*n^2 combinations (pairs) of particles...so you may need to use a simple spatial partitioning scheme or similar to limit forces to nearby groups of particles only.
Good luck!

How to make timer for a game loop?

I want to time fps count, and set it's limit to 60 and however i've been looking throught some code via google, I completly don't get it.
If you want 60 FPS, you need to figure out how much time you have on each frame. In this case, 16.67 milliseconds. So you want a loop that completes every 16.67 milliseconds.
Usually it goes (simply put): Get input, do physics stuff, render, pause until 16.67ms has passed.
Its usually done by saving the time at the top of the loop and then calculating the difference at the end and sleeping or looping doing nothing for that duration.
This article describes a few different ways of doing game loops, including the one you want, although I'd use one of the more advanced alternatives in this article.
delta time is the final time, minus the original time.
dt= t-t0
This delta time, though, is simply the amount of time that passes while the velocity is changing.
The derivative of a function represents an infinitesimal change
in the function with respect to one of its variables.
The derivative of a function with respect to the variable is defined as
f(x + h) - f(x)
f'(x) = lim -----------------
h->0 h
http://mathworld.wolfram.com/Derivative.html
#include<time.h>
#include<stdlib.h>
#include<stdio.h>
#include<windows.h>
#pragma comment(lib,"winmm.lib")
void gotoxy(int x, int y);
void StepSimulation(float dt);
int main(){
int NewTime = 0;
int OldTime = 0;
float dt = 0;
float TotalTime = 0;
int FrameCounter = 0;
int RENDER_FRAME_COUNT = 60;
while(true){
NewTime = timeGetTime();
dt = (float) (NewTime - OldTime)/1000; //delta time
OldTime = NewTime;
if (dt > (0.016f)) dt = (0.016f); //delta time
if (dt < 0.001f) dt = 0.001f;
TotalTime += dt;
if(TotalTime > 1.1f){
TotalTime=0;
StepSimulation(dt);
}
if(FrameCounter >= RENDER_FRAME_COUNT){
// draw stuff
//Render();
gotoxy(1,2);
printf(" \n");
printf("OldTime = %d \n",OldTime);
printf("NewTime = %d \n",NewTime);
printf("dt = %f \n",dt);
printf("TotalTime = %f \n",TotalTime);
printf("FrameCounter = %d fps\n",FrameCounter);
printf(" \n");
FrameCounter = 0;
}
else{
gotoxy(22,7);
printf("%d ",FrameCounter);
FrameCounter++;
}
}
return 0;
}
void gotoxy(int x, int y){
COORD coord;
coord.X = x; coord.Y = y;
SetConsoleCursorPosition(GetStdHandle(STD_OUTPUT_HANDLE), coord);
return;
}
void StepSimulation(float dt){
// calculate stuff
//vVelocity += Ae * dt;
}
You shouldn't try to limit the fps. The only reason to do so is if you are not using delta time and you expect each frame to be the same length. Even the simplest game cannot guarantee that.
You can however take your delta time and slice it into fixed sizes and then hold onto the remainder.
Here's some code I wrote recently. It's not thoroughly tested.
void GameLoop::Run()
{
m_Timer.Reset();
while(!m_Finished())
{
Time delta = m_Timer.GetDelta();
Time frameTime(0);
unsigned int loopCount = 0;
while (delta > m_TickTime && loopCount < m_MaxLoops)
{
m_SingTick();
delta -= m_TickTime;
frameTime += m_TickTime;
++loopCount;
}
m_Independent(frameTime);
// add an exception flag later.
// This is if the game hangs
if(loopCount >= m_MaxLoops)
{
delta %= m_TickTime;
}
m_Render(delta);
m_Timer.Unused(delta);
}
}
The member objects are Boost slots so different code can register with different timing methods. The Independent slot is for things like key mapping or changing music Things that don't need to be so precise. SingTick is good for physics where it is easier if you know every tick will be the same but you don't want to run through a wall. Render takes the delta so animations run smooth, but must remember to account for it on the next SingTick.
Hope that helps.
There are many good reasons why you should not limit your frame rate in such a way. One reason being as stijn pointed out, not every monitor may run at exactly 60fps, another reason being that the resolution of timers is not sufficient, yet another reason being that even given sufficient resolutions, two separate timers (monitor refresh and yours) running in parallel will always get out of sync with time (they must!) due to random inaccuracies, and the most important reason being that it is not necessary at all.
Note that the default timer resolution under Windows is 15ms, and the best possible resolution you can get (by using timeBeginPeriod) is 1ms. Thus, you can (at best) wait 16ms or 17ms. One frame at 60fps is 16.6666ms How do you wait 16.6666ms?
If you want to limit your game's speed to the monitor's refresh rate, enable vertical sync. This will do what you want, precisely, and without sync issues. Vertical sync does have its pecularities too (such as the funny surprise you get when a frame takes 16.67ms), but it is by far the best available solution.
If the reason why you wanted to do this was to fit your simulation into the render loop, this is a must read for you.
check this one out:
//Creating Digital Watch in C++
#include<iostream>
#include<Windows.h>
using namespace std;
struct time{
int hr,min,sec;
};
int main()
{
time a;
a.hr = 0;
a.min = 0;
a.sec = 0;
for(int i = 0; i<24; i++)
{
if(a.hr == 23)
{
a.hr = 0;
}
for(int j = 0; j<60; j++)
{
if(a.min == 59)
{
a.min = 0;
}
for(int k = 0; k<60; k++)
{
if(a.sec == 59)
{
a.sec = 0;
}
cout<<a.hr<<" : "<<a.min<<" : "<<a.sec<<endl;
a.sec++;
Sleep(1000);
system("Cls");
}
a.min++;
}
a.hr++;
}
}