Framelimiter | Why are there extra milliseconds? - c++

After a lot of testing with this thing, I still can't figure out why there are extra milliseconds appended to the millisecond limit.
In this case, the whole running loop should last 4000ms, and then print 4000 followed by some other data, however it is always around 4013ms.
I currently know that the problem isn't the stress testing, since without it, it's still at around 4013ms. Besides, there's is a limit for how long the stress testing can take, and the time is justified by how much rendering is able to be done in the remaining time. I also know that it isn't "SDL_GetTicks" including the time I'm initialising variables, since it only starts timing when it is first called. It's not the time it takes to call the function either, because I tested this with a very lightweight nanosecond timer as well, and the result is the same.
Here's some of my results, that are printed at the end:
4013 100 993 40
4013 100 1000 40
4014 100 1000 40
4012 100 992 40
4015 100 985 40
4013 100 1000 40
4022 100 986 40
4014 100 1000 40
4017 100 993 40
Unlike the third column (the amount of frames rendered), the first column shouldn't vary by much more than the few nanoseconds it took to exit the loops and such. Meaning it shouldn't even show a difference since the scope of the timer is milliseconds in this case.
I recompiled between all of them, and the list is pretty much continues the same.
Here's the code:
#include <iostream>
#include <SDL/SDL.h>
void stress(int n) {
n = n + n - n * n + n * n;
}
int main(int argc, char **argv) {
int running = 100,
timestart = 0, timestep = 0,
rendering = 0, logic = 0,
SDL_Init(SDL_INIT_EVERYTHING);
while(running--) { // - Running loop
timestart = SDL_GetTicks();
std::cout << "logic " << logic++ << std::endl;
for(int i = 0; i < 9779998; i++) { // - Stress testing
if(SDL_GetTicks() - timestart >= 30) { // - Maximum of 30 milliseconds spent running logic
break;
}
stress(i);
}
while(SDL_GetTicks() - timestart < 1) { // - Minimum of one millisecond to run through logic
;
}
timestep = SDL_GetTicks() - timestart;
while(40 > timestep) {
timestart = SDL_GetTicks();
std::cout << "rendering " << rendering++ << std::endl;
while(SDL_GetTicks() - timestart < 1) { // - Maximum of one rendering frame per millisecond
;
}
timestep += SDL_GetTicks() - timestart;
}
}
std::cout << SDL_GetTicks() << " " << logic << " " << rendering << " " << timestep << std::endl;
SDL_Quit();
return 0;
}

Elaborating on crowders comment - if your OS decides to switch tasks you will end up with a random error between 0 and 1 ms(or -.5 and .5 if SDL_GetTicks would be rounding it's internal timer, but your results always greater than expected suggest it is actually truncating). These will 'even out' within your next busy wait, but not near the end of the loop - since there is none 'next busy wait' there. To counteract it you need a reference point from before you start your game loop and compare it with GetTicks to measure how much time have "leaked". Also your approach of X miliseconds per frame and busy waits/breaks in the middle of computation isn't the cleanest i've seen. You should probably google about game loops and read a bit.

Related

Code performance strick measurement

I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}
You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).
Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.

Odd results when adding artificial delays to C++ code. Embedded Linux

I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.
Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}

Is this a good way to lock a loop on 60 loops per second?

I have a game with Bullet Physics as the physics engine, the game is online multiplayer so I though to try the Source Engine approach to deal with physics sync over the net. So in the client I use GLFW so the fps limit is working there by default. (At least I think it's because GLFW). But in the server side there is no graphics libraries so I need to "lock" the loop which simulating the world and stepping the physics engine to 60 "ticks" per second.
Is this the right way to lock a loop to run 60 times a second? (A.K.A 60 "fps").
void World::Run()
{
m_IsRunning = true;
long limit = (1 / 60.0f) * 1000;
long previous = milliseconds_now();
while (m_IsRunning)
{
long start = milliseconds_now();
long deltaTime = start - previous;
previous = start;
std::cout << m_Objects[0]->GetObjectState().position[1] << std::endl;
m_DynamicsWorld->stepSimulation(1 / 60.0f, 10);
long end = milliseconds_now();
long dt = end - start;
if (dt < limit)
{
std::this_thread::sleep_for(std::chrono::milliseconds(limit - dt));
}
}
}
Is it ok to use std::thread for this task?
Is this way is efficient enough?
Will the physics simulation will be steped 60 times a second?
P.S
The milliseconds_now() looks like this:
long long milliseconds_now()
{
static LARGE_INTEGER s_frequency;
static BOOL s_use_qpc = QueryPerformanceFrequency(&s_frequency);
if (s_use_qpc) {
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
return (1000LL * now.QuadPart) / s_frequency.QuadPart;
}
else {
return GetTickCount();
}
}
Taken from: https://gamedev.stackexchange.com/questions/26759/best-way-to-get-elapsed-time-in-miliseconds-in-windows
If you want to limit the rendering to a maximum FPS of 60, it is very simple :
Each frame, just check if the game is running too fast, if so just wait, for example:
while ( timeLimitedLoop )
{
float framedelta = ( timeNow - timeLast )
timeLast = timeNow;
for each ( ObjectOrCalculation myObjectOrCalculation in allItemsToProcess )
{
myObjectOrCalculation->processThisIn60thOfSecond(framedelta);
}
render(); // if display needed
}
Please note that if vertical sync is enabled, rendering will already be limited to the frequency of your vertical refresh, perhaps 50 or 60 Hz).
If, however, you wish the logic locked at 60fps, that's different matter: you will have to segregate your display and logic code in such a way that the logic runs at a maximum of 60 fps, and modify the code so that you can have a fixed time-interval loop and a variable time-interval loop (as above). Good sources to look at are "fixed timestep" and "variable timestep" ( Link 1 Link 2 and the old trusty Google search).
Note on your code:
Because you are using a sleep for the whole duration of the 1/60th of a second - already elapsed time you can miss the correct timing easily, change the sleep to a loop running as follows:
instead of
if (dt < limit)
{
std::this_thread::sleep_for(std::chrono::milliseconds(limit - dt));
}
change to
while(dt < limit)
{
std::this_thread::sleep_for(std::chrono::milliseconds(limit - (dt/10.0)));
// or 100.0 or whatever fine-grained step you desire
}
Hope this helps, however let me know if you need more info:)

What's the best timing resolution can i get on Linux

I'm trying to measure the time difference between 2 signals on the parallel port, but first i got to know how much accurate and precise is my measuring system (AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ × 2) on SUSE 12.1 x64.
So after some reading i decide to use clock_gettime(), first i get the clock_getres() value using this code:
/*
* This program prints out the clock resolution.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( void )
{
struct timespec res;
if ( clock_getres( CLOCK_REALTIME, &res) == -1 ) {
perror( "clock get resolution" );
return EXIT_FAILURE;
}
printf( "Resolution is %ld nano seconds.\n",
res.tv_nsec);
return EXIT_SUCCESS;
}
and the out was: 1 nano second. And i was so happy!!
But here is my problem, when i tried to check that fact with this other code:
#include <iostream>
#include <time.h>
using namespace std;
timespec diff(timespec start, timespec end);
int main()
{
timespec time1, time2, time3,time4;
int temp;
time3.tv_sec=0;
time4.tv_nsec=000000001L;
clock_gettime(CLOCK_REALTIME, &time1);
NULL;
clock_gettime(CLOCK_REALTIME, &time2);
cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
return 0;
}
timespec diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return temp;
}
this one calculate the time between the two calls of clock_gettime, the time3 and time4 are declared but not used in this example because i was doing tests with them.
The output in this example is fluctuating between 978 and 1467 ns. both numbers are multiples of 489, this make me think that 489 ns is my REAL resolution. far far from the 1 ns obtained above.
My question: is there ANY WAY of getting better results? am i missing something?
I really need at least 10ns resolution for my project. Come on! a GPS can get better resolution than a PC??
I realise this topic is long dead, but wanted to throw in my findings. This is a long answer so I have put the short answer here and those with the patience can wade through the rest. The not-quite-the-answer to the question is 700 ns or 1500 ns depending on which mode of clock_gettime() you used. The long answer is way more complicated.
For reference, the machine I did this work on is an old laptop that nobody wanted. It is an Acer Aspire 5720Z running Ubuntu 14.041 LTS.
The hardware:
RAM: 2.0 GiB // This is how Ubuntu reports it in 'System Settings' → 'Details'
Processor: Intel® Pentium(R) Dual CPU T2330 # 1.60GHz × 2
Graphics: Intel® 965GM x86/MMX/SSE2
I wanted to measure time accurately in an upcoming project and as a relative new comer to PC hardware regardless of operating system, I thought I would do some experimentation on the resolution of the timing hardware. I stumbled across this question.
Because of this question, I decided that clock_gettime() looks like it meets my needs. But my experience with PC hardware in the past has left me under-whelmed so I started fresh with some experiments to see what the actual resolution of the timer is.
The method: Collect successive samples of the result from clock_gettime() and look any patterns in the resolution. Code follows.
Results in a slightly longer Summary:
Not really a result. The stated resolution of the fields in the structure is in nanoseconds. The result of a call to clock_getres() is also tv_sec 0, tv_nsec 1. But previous experience has taught to not trust the resolution from a structure alone. It is an upper limit on precision and reality tends to be a whole lot more complex.
The actual resolution of the clock_gettime() result on my machine, with my program, with my operating system, on one particular day etc turns out to be 70 nanoseconds for mode 0 and 1. 70 ns is not too bad but unfortunately, this is not realistic as we will see in the next point. To complicate matters, the resolution appears to be 7 ns when using modes 2 and 3.
Duration of the clock_gettime() call is more like 1500 ns for modes 0 and 1. It doesn't make sense to me at all to claim 70 ns resolution on the time if it takes 20 times the resolution to get a value.
Some modes of clock_gettime() are faster than others. Modes 2 and 3 are clearly about half the wall-clock time of modes 0 and 1. Modes 0 and 1 are statistically indistinguishable from each other. Modes 2 and 3 are much faster than modes 0 and 1, with mode 3 being the fastest overall.
Before continuing, I better define the modes: Which mode is which?:
Mode 0 CLOCK_REALTIME // reference: http://linux.die.net/man/3/clock_gettime
Mode 1 CLOCK_MONOTONIC
Mode 2 CLOCK_PROCESS_CPUTIME_ID
Mode 3 CLOCK_THREAD_CPUTIME_ID
Conclusion: To me it doesn't make sense to talk about the resolution of the time intervals if the resolution is smaller than the length of time the function takes to get the time interval. For example, if we use mode 3, we know that the function completes within 700 nanoseconds 99% of the time. And we further know that the time interval we get back will be a multiple of 7 nanoseconds. So the 'resolution' of 7 nanoseconds, is 1/100th of the time to do the call to get the time. I don't see any value in the 7 nanosecond change interval. There are 3 different answers to the question of resolution: 1 ns, 7 or 70 ns, and finally 700 or 1500 ns. I favour the last figure.
After all is said and done, if you want to measure the performance of some operation, you need to keep in mind how long the clock_gettime() call takes – that is 700 or 1500 ns. There is no point trying to measure something that takes 7 nanoseconds for example. For the sake of argument, lets say you were willing to live with 1% error on your performance test conclusions. If using mode 3 (which I think I will be using in my project) you would have to say that the interval you need to be measuring needs to be 100 times 700 nanoseconds or 70 microseconds. Otherwise your conclusions will have more than 1% error. So go ahead and measure your code of interest, but if your elapsed time in the code of interest is less that 70 microseconds, then you better go and loop through the code of interest enough times so that the interval is more like 70 microseconds or more.
Justification for these claims and some details:
Claim 3 first. This is simple enough. Just run clock_gettime() a large number of times and record the results in an array, then process the results. Do the processing outside the loop so that the time between clock_gettime() calls is as short as possible.
What does all that mean? See the graph attached. For mode 0 for example, the call to clock_gettime() takes less than 1.5 microseconds most of the time. You can see that mode 0 and mode 1 are basically the same. However, modes 2 and 3 are very different to modes 0 and 1, and slightly different to each other. Modes 2 and 3 take about half the wall-clock time for clock_gettime() compared to modes 0 and 1. Also note that mode 0 and 1 are slightly different to each other – unlike modes 2 and 3. Note that mode 0 and 1 differ by 70 nanoseconds – which is a number which we will come back to in claim #2.
The attached graph is range-limited to 2 microseconds. Otherwise the outliers in the data prevents the graph from conveying the previous point. Something the graph doesn't make clear then is that the outliers for modes 0 and 1 are much worse than the outliers for modes 2 and 3. In other words, not only is the average and the statistical 'mode' (the value which occurs the most) and the median (i.e. the 50th percentile) for all these modes different so is there maximum values and their 99th percentiles.
The graph attached is for 100,001 samples for each of the four modes. Please note that the tests graphed were using a CPU mask of processor 0 only. Whether I used CPU affinity or not didn't seem to make any difference to the graph.
Claim 2: If you look closely at the samples collected when preparing the graph, you soon notice that the difference between the differences (i.e. the 2nd order differences) is relatively constant – at around 70 nanoseconds (fore Modes 0 and 1 at least). To repeat this experiment, collect 'n' samples of clock time as before. Then calculate the differences between each sample. Now sort the differences into order (e.g. sort -g) and then derive the individual unique differences (e.g. uniq -c).
For example:
$ ./Exp03 -l 1001 -m 0 -k | sort -g | awk -f mergeTime2.awk | awk -f percentages.awk | sort -g
1.118e-06 8 8 0.8 0.8 // time,count,cumulative count, count%, cumulative count%
1.188e-06 17 25 1.7 2.5
1.257e-06 9 34 0.9 3.4
1.327e-06 570 604 57 60.4
1.397e-06 301 905 30.1 90.5
1.467e-06 53 958 5.3 95.8
1.537e-06 26 984 2.6 98.4
<snip>
The difference between the durations in the first column is often 7e-8 or 70 nanoseconds. This can become more clear by processing the differences:
$ <as above> | awk -f differences.awk
7e-08
6.9e-08
7e-08
7e-08
7e-08
7e-08
6.9e-08
7e-08
2.1e-07 // 3 lots of 7e-08
<snip>
Notice how all the differences are integer multiples of 70 nanoseconds? Or at least within rounding error of 70 nanoseconds.
This result may well be hardware dependent but I don't actually know what limits this to 70 nanoseconds at this time. Perhaps there is 14.28 MHz oscillator somewhere?
Please note that in practise I use a much larger number of samples such as 100,000, not 1000 as above.
Relevant code (attached):
'Expo03' is the program which calls clock_gettime() as fast as possible. Note that typical usage would be something like:
./Expo03 -l 100001 -m 3
This would call clock_gettime() 100,001 times so that we can compute 100,000 differences. Each call to clock_gettime() in this example would be using mode 3.
MergeTime2.awk is a useful command which is a glorified 'uniq' command. The issue is that the 2nd order differences are often in pairs of 69 and 1 nanosecond, not 70 (for Mode 0 and 1 at least) as I have lead you to believe so far. Because there is no 68 nanosecond difference or a 2 nanosecond difference, I have merged these 69 and 1 nanosecond pairs into one number of 70 nanoseconds. Why the 69/1 behaviour occurs at all is interesting, but treating these as two separate numbers mostly added 'noise' to the analysis.
Before you ask, I have repeated this exercise avoiding floating point, and the same problem still occurs. The resulting tv_nsec as an integer has this 69/1 behaviour (or 1/7 and 1/6) so please don't assume that this is an artefact caused by floating point subtraction.
Please note that I am confident with this 'simplification' for 70 ns and for small integer multiples of 70 ns, but this approach looks less robust for the 7 ns case especially when you get 2nd order differences of 10 times the 7 ns resolution.
percentages.awk and differences.awk attached in case.
Stop press: I can't post the graph as I don't have a 'reputation of at least 10'. Sorry 'bout that.
Rob Watson
21 Nov 2014
Expo03.cpp
/* Like Exp02.cpp except that here I am experimenting with
modes other than CLOCK_REALTIME
RW 20 Nov 2014
*/
/* Added CPU affinity to see if that had any bearing on the results
RW 21 Nov 2014
*/
#include <iostream>
using namespace std;
#include <iomanip>
#include <stdlib.h> // getopts needs both of these
#include <unistd.h>
#include <errno.h> // errno
#include <string.h> // strerror()
#include <assert.h>
// #define MODE CLOCK_REALTIME
// #define MODE CLOCK_MONOTONIC
// #define MODE CLOCK_PROCESS_CPUTIME_ID
// #define MODE CLOCK_THREAD_CPUTIME_ID
int main(int argc, char ** argv)
{
int NumberOf = 1000;
int Mode = 0;
int Verbose = 0;
int c;
// l loops, m mode, h help, v verbose, k masK
int rc;
cpu_set_t mask;
int doMaskOperation = 0;
while ((c = getopt (argc, argv, "l:m:hkv")) != -1)
{
switch (c)
{
case 'l': // ell not one
NumberOf = atoi(optarg);
break;
case 'm':
Mode = atoi(optarg);
break;
case 'h':
cout << "Usage: <command> -l <int> -m <mode>" << endl
<< "where -l represents the number of loops and "
<< "-m represents the mode 0..3 inclusive" << endl
<< "0 is CLOCK_REALTIME" << endl
<< "1 CLOCK_MONOTONIC" << endl
<< "2 CLOCK_PROCESS_CPUTIME_ID" << endl
<< "3 CLOCK_THREAD_CPUTIME_ID" << endl;
break;
case 'v':
Verbose = 1;
break;
case 'k': // masK - sorry! Already using 'm'...
doMaskOperation = 1;
break;
case '?':
cerr << "XXX unimplemented! Sorry..." << endl;
break;
default:
abort();
}
}
if (doMaskOperation)
{
if (Verbose)
{
cout << "Setting CPU mask to CPU 0 only!" << endl;
}
CPU_ZERO(&mask);
CPU_SET(0,&mask);
assert((rc = sched_setaffinity(0,sizeof(mask),&mask))==0);
}
if (Verbose) {
cout << "Verbose: Mode in use: " << Mode << endl;
}
if (Verbose)
{
rc = sched_getaffinity(0,sizeof(mask),&mask);
// cout << "getaffinity rc is " << rc << endl;
// cout << "getaffinity mask is " << mask << endl;
int numOfCPUs = CPU_COUNT(&mask);
cout << "Number of CPU's is " << numOfCPUs << endl;
for (int i=0;i<sizeof(mask);++i) // sizeof(mask) is 128 RW 21 Nov 2014
{
if (CPU_ISSET(i,&mask))
{
cout << "CPU " << i << " is set" << endl;
}
//cout << "CPU " << i
// << " is " << (CPU_ISSET(i,&mask) ? "set " : "not set ") << endl;
}
}
clockid_t cpuClockID;
int err = clock_getcpuclockid(0,&cpuClockID);
if (Verbose)
{
cout << "Verbose: clock_getcpuclockid(0) returned err " << err << endl;
cout << "Verbose: clock_getcpuclockid(0) returned cpuClockID "
<< cpuClockID << endl;
}
timespec timeNumber[NumberOf];
for (int i=0;i<NumberOf;++i)
{
err = clock_gettime(Mode, &timeNumber[i]);
if (err != 0) {
int errSave = errno;
cerr << "errno is " << errSave
<< " NumberOf is " << NumberOf << endl;
cerr << strerror(errSave) << endl;
cerr << "Aborting due to this error" << endl;
abort();
}
}
for (int i=0;i<NumberOf-1;++i)
{
cout << timeNumber[i+1].tv_sec - timeNumber[i].tv_sec
+ (timeNumber[i+1].tv_nsec - timeNumber[i].tv_nsec) / 1000000000.
<< endl;
}
return 0;
}
MergeTime2.awk
BEGIN {
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{array[$0]++}
END {
lastX = -1;
first = 1;
for (x in array)
{
if (first) {
first = 0
lastX = x; lastCount = array[x];
} else {
delta = x - lastX;
if (delta < 2e-9) { # this is nasty floating point stuff!!
lastCount += array[x];
lastX = x
} else {
Cumulative += lastCount;
print lastX "\t" lastCount "\t" Cumulative
lastX = x;
lastCount = array[x];
}
}
}
print lastX "\t" lastCount "\t" Cumulative+lastCount
}
percentages.awk
{ # input is $1 a time interval $2 an observed frequency (i.e. count)
# $3 is a cumulative frequency
b[$1]=$2;
c[$1]=$3;
sum=sum+$2
}
END {
for (i in b) print i,b[i],c[i],(b[i]/sum)*100, (c[i]*100/sum);
}
differences.awk
NR==1 {
old=$1;next
}
{
print $1-old;
old=$1
}
As far as I know, Linux running on a PC will generally not be able to give you timer accuracy in the nanoseconds range. This is mainly due to the type of task/process scheduler used in the kernel. This is as much a result of the kernel as it is of the hardware.
If you need timing with nanosecond resolution I'm afraid that you're out of luck. However you should be able to get micro-second resolution which should be good enough for most scenarios - including your parallel port application.
If you need timing in the nano-seconds range to be accurate to the nano-second you will need a dedicated hardware solution most likely; with a really accurate oscillator (for comparison, the base clock frequency of most x86 CPUs is in the range of mega-hertz before the multipliers)
Finally, if you're looking to replace the functionality of an oscilloscope with your computer that's just not going to work beyond relatively low frequency signals. You'd be much better off investing in a scope - even a simple, portable, hand-held that plugs into your computer for displaying the data.
RDTSCP on your AMD Athlon 64 X2 will give you the time stamp counter with resolution dependent upon your clock. However accuracy is different to resolution, you need to lock thread affinity and disable interrupts (see IRQ routing).
This entails dropping down to assembler or for Windows developers using MSVC 2008 instrinsics.
RedHat with RHEL5 introduced user-space shims that replace gettimeofday with high resolution RDTSCP calls:
http://developer.amd.com/Resources/documentation/articles/Pages/1214200692_5.aspx
https://web.archive.org/web/20160812215344/https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html
Also, check your hardware an AMD 5200 has a 2.6Ghz clock which has 0.4ns interval and the cost of gettimeofday with RDTSCP is 221 cycles that equals 88ns at best.

chars per second, KBps, Kilo[bits/bytes] per second, (computer network), how compute right

Sorry for maybe stupid question, but how is really Kbps/... (kilobits per second, and kilobytes per second) is computed?
I have such computing now:
DWORD ibytesin=0,ibytes_sttime=0,ibytes_st=0,ibps=0,ilastlen;
DWORD obytesin=0,obytes_sttime=0,obytes_st=0,obps=0,olastlen;
ibytesin - total bytes in for all time;
ibytes_sttime - time when start ibytes_st was assigned;
ibytes_st - bytes count at time of ibytes_sttime;
ibps - kbps/bps/...;
ilastlen - to be more accurate and because my protocol uses request packets, i do not want to take last in length;
The same rules for out traffic (o*).
First harvest for example in bytes:
len = recv(ConnectSocket, (char*)p, readleft, 0);
if(len>0) {
ibytesin+=len;
ilastlen=len;
}
Same for out.
Then later in some fast frequently executed place, in stats thread for example:
if ((GetTickCount() - obytes_sttime) >= 1000) // update once per second
{
obps = (obytesin-obytes_st-olastlen) / 1024 * 8;
obytes_sttime = GetTickCount();
obytes_st = obytesin;
olastlen=0;
}
if ((GetTickCount() - ibytes_sttime) >= 1000) // update once per second
{
ibps = (ibytesin-ibytes_st-ilastlen) / 1024* 8; // get kilobytes*8 == Kbps ?
ibytes_sttime = GetTickCount();
ibytes_st = ibytesin;
ilastlen=0;
}
sprintf(str, "In/Out %3d / %-3d Kbps/сек", ibps,obps);
I have errorenous speed when i try to increase updates bps shows. For example i want to recalculate it at every 100 msecs, not 1 sec, so as i can imagine i need to divide ibps not by 1024, but by 102 (as 1000/10=100, so 1024/10=102.4), but the speed rate is not computed rightly 100%, it is increased, or i do mistake in my first tryings.
How to do right?
Sergey,
I'm assuming your counters don't have to be accurate to any specific tolerance or standard. And that you just want to have a "running average" displayed periodically. If that's the case, you can use my "Counter" class below which I put together from some previous code that I wrote that counted something else. It just recomputes the "rate" every couple of seconds.
Create an instance of "Counter". After your recv function, call "Increment" with the number of bytes received. Then just call Counter::GetRate whenever you want to print the average. (Divide the result by 1024 and multiply by 8 to convert from "bytes per second" to "kbps".)
You'll notice that the "average N per second" is recomputed every two seconds (instead of every one second). You can change this, but I find that keeping the moving average steady for 2 seconds at a time produces "smoother" results as the counter printout doesn't appear as erratic when there is variance to the count. If you want a count more closer to "how many kbits received in the last second", then call counter.SetInterval(1000). I suppose you could set the interval as low as 100. You'll just get more erratic results due to network jitter.
class Counter
{
static const DWORD DEFAULT_INTERVAL = 2000; // 2000ms = 2 seconds
bool m_fFirst;
DWORD m_dwInterval; // how often we recompute the average
DWORD m_dwCount;
DWORD m_dwStartTime;
DWORD m_dwComputedRate;
public:
Counter()
{
Reset();
m_dwInterval = DEFRAULT_INTERVAL;
}
void Reset()
{
m_dwFrames = 0;
m_dwStartTime = 0;
m_dwComputedRate = 0;
m_fFirst = true;
}
void Increment(DWORD dwIncrement)
{
DWORD dwCurrentTime = GetTickCount();
DWORD dwActualInterval = dwCurrentTime - m_dwStartTime;
if (m_fFirst)
{
m_dwStartTime = dwCurrentTime;
m_fFirst = false;
}
else
{
m_dwCount += dwIncrement;
if (dwActualInterval >= m_dwInterval)
{
// "round up" by adding 500 to the formula below
// that way a computed average of "234.56" gets rounded up to "235"
// instead of "rounded down" to 234
const DWORD ROUND_UP = 500;
// multiply by 1000 to convert from milliseconds to seconds
m_dwComputedRate = (m_dwCount * 1000 + ROUND_UP) / dwActualInterval;
// reset counting
m_dwStartTime = dwCurrentTime;
m_dwCount = 0;
}
}
}
// returns rate in terms of "per second"
DWORD GetRate()
{
return m_dwComputedRate;
}
void SetInterval(DWORD dwInterval)
{
m_dwInterval = dwInterval;
}
};