Code performance strick measurement

Code performance strick measurement - c++

I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}

You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).

Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.

Related

Performance difference between float and double in x86 and x86_64

A while ago I heard that some compilers use SSE2 extensions for floating point operations for x86_64 architecture, so I used this simple code to determine the performance difference between them.
I disabled Intel SpeedStep technology via BIOS and system load was approximately equal for my tests. I am using GCC 4.8 on OpenSuSE 64 bit.
I am writing a program with a lot of FPU operations and I would like to know if this test is valid or not?
And any information about the performance difference between float and double under each architecture is appreciated.
Code :
#include <iostream>
#include <sys/time.h>
#include <vector>
#include <cstdlib>
using namespace std;
int main()
{
timeval t1, t2;
double elapsedTime;
double TotalTime = 0;
for(int j=0 ; j < 100 ; j++)
{
// start timer
gettimeofday(&t1, NULL);
vector<float> RealVec;
float temp;
for (int i = 0; i < 1000000; i++)
{
temp = static_cast <float> (rand()) / (static_cast <float> (RAND_MAX));
RealVec.push_back(temp);
}
for (int i = 0; i < 1000000; i++)
{
RealVec[i] = (RealVec[i]*2-435.345345)/15.75;
}
// stop timer
gettimeofday(&t2, NULL);
elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms
TotalTime = TotalTime + elapsedTime;
}
cout << TotalTime/100 << " ms.\n";
return 0;
}
and result :
32 Bit Double
157.781 ms.
151.994 ms.
152.244 ms.
32 Bit Float
149.896 ms.
148.489 ms.
161.086 ms.
64 Bit Double
110.125 ms.
111.612 ms.
113.818 ms.
64 Bit Float
110.393 ms.
106.778 ms.
107.833 ms.

You're really not measuring much; perhaps just the degree of compiler
optimization. In order for the measurements to be valid, you really
have to do something with the results, or the compiler can optimize out
all, or the major part of your tests. What I woule do is 1) initialize
the vector, 2) get the start time (probably using clock, since that
only takes CPU time into account), 3) execute the second loop a 100 (or
more... enough to last a couple of seconds, at least) times, 4) get the
end time, and finally, 5) output the sum of the elements in the vector.
With regards to the differences you may find: independently of the
floating point processors, the 64 bit machine has more general registers
for the compiler to play with. This could have an enormous impact.
Unless you look at the generated assembler, you just can't know.

Not really valid. You're basically testing the performance of the random number generator.
Also, you're not trying to enforce SSE2 SIMD operation, so you can't really claim this compares anything SSE-related.

Valid in what sense?
Measure actual usage, with your actual code.
Some artificial test suite probably won't help you assess the performance characteristics.
You can use a typedef, then change the actual underlying type with a flick of a switch.

Odd results when adding artificial delays to C++ code. Embedded Linux

I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.

Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}

How to handle caching while timing an operating in C++ in linux

I have to time the clock_gettime() function for estimating and profiling other operations, and it's for homework so I cant use a profiler and have to write my own code.
The way I'm doing it is like below:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&begin);
for(int i=0;i<=n;i++)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
cout<<(end.tv_nsec-begin.tv_nsec)/n; //time per clock_gettime()
The problem is that when n=100, output is: 370.63 ns, when n=100000, output: 330 ns, when n=1000000, output: 260 ns, n=10000000, output: 55 ns, ....keeps reducing.
I understand that this is happening because of instruction caching, but I don't know how to handle this in profiling. Because for example when I estimate the time for a function call using gettime, how would I know how much time that gettime used for itself?
Would taking a weighted mean of all these values be a good idea? (I can run the operation I want the same number of times, take weighted mean of that, subtract weighted mean of gettime and get a good estimate of the operation irrespective of caching?)
Any suggestions are welcome.
Thank you in advance.

When you compute the time difference: (end.tv_nsec-begin.tv_nsec)/n
You are only taking into account the nanoseconds part of the elapsed time. You must also take the seconds into account since the tv_nsec field only reflects the fractional part of a second:
int64_t end_ns = ((int64_t)end.tv_sec * 1000000000) + end.tv_nsec;
int64_t begin_ns = ((int64_t)begin.tv_sec * 1000000000) + begin.tv_nsec;
int64_t elapsed_ns = end_ns - begin_ns;
Actually, with your current code you should sometimes get negative results when the nanoseconds part of end has wrapped around and is less than begin's nanoseconds part.
Fix that, and you'll be able to observe much more consistent results.
Edit: for the sake of completeness, here's the code I used for my tests, which gets me very consistent results (between 280 and 300ns per call, whatever number of iterations I use):
int main() {
const int loops = 100000000;
struct timespec begin;
struct timespec end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &begin);
for(int i = 0; i < loops; i++)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
int64_t end_ns = ((int64_t)end.tv_sec * 1000000000) + end.tv_nsec;
int64_t begin_ns = ((int64_t)begin.tv_sec * 1000000000) + begin.tv_nsec;
int64_t elapsed_ns = end_ns - begin_ns;
int64_t ns_per_call = elapsed_ns / loops;
std::cout << ns_per_call << std::endl;
}

<time.h> / <ctime> are not counting ticks

EDIT: It appears to be functioning now. The code has been updated to show my revisions. Thank you all for your help.
I imagine I'm just stupid, but I'm attempting to use ctime to count CPU ticks through my entire program. I'm writing an encryption algorithm for a school project and I'm trying to include a timer so that I can add noise processes, equalizing the amount of time among different key/plaintext combinations.
Here is a little test for ctime:
#include <iostream>
#include <string>
#include <ctime>
int main (int arc, char **argv)
{
double elapsedTime;
const clock_t start = clock ();
int uselessInt = 0;
for (int i = 0; i <= 200; i++)
{
uselessInt = uselessInt * 2 / 3 + i;
std::cout << uselessInt << std::endl;
}
clock_t end = clock();
elapsedTime = static_cast<double>(end - start);
std::cout << elapsedTime << " CPU ticks have elapsed since this application's initiation." << std::endl;
return (0);
}
which prints:
0
1
2
4
/* ... long list of numbers ... */
591
594
0 CPU ticks have elapsed since this application's initiation.
[smalltock#localhost Desktop]$
I am using GCC (G++) and it appears that ctime/time.h simply isn't counting ticks like I want it to. Can anybody identify the problem? I'm a relative amateur in this language.

My two cents. When you do cin.get(), it waits for your to input something on the console, did you do anything or simply typed enter?
I did run your code without typing any text but simply press enter, it gave the following output:
Test Text
It's a stone, Luigi... you didn't make it.
0 CPU ticks have elapsed since this application's initiation.
Real 0m0.700s
User 0m0.000s
Sys 0m0.061s
It may be because the precision of CLOCKS_PER_SEC is kind of "big" (in seconds) compared to the CPU time used by your program
Meanwhile, a syntax error in duration line, you either missed another ) or should delete the first (
BTW:
Real is wall clock time - time from start to finish of the call.
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process.
Sys is the amount of CPU time spent in the kernel within the process.
So you basically have 0 CPU time since you are keep waiting for I/O, no CPU computation.

elapsedTime in your program is a measure of time in seconds, not a count of clock ticks. If you want ticks, use duration.
Since your program (presumably) spends the vast majority of its time blocked on I/O, not very many seconds are going to have gone by.

Calculating time length between operations in c++

The program is a middleware between a database and application. For each database access I most calculate the time length in milliseconds. The example bellow is using TDateTime from Builder library. I must, as far as possible, only use standard c++ libraries.
AnsiString TimeInMilliseconds(TDateTime t) {
Word Hour, Min, Sec, MSec;
DecodeTime(t, Hour, Min, Sec, MSec);
long ms = MSec + Sec * 1000 + Min * 1000 * 60 + Hour * 1000 * 60 * 60;
return IntToStr(ms);
}
// computing times
TDateTime SelectStart = Now();
sql_manipulation_statement();
TDateTime SelectEnd = Now();

On both Windows and POSIX-compliant systems (Linux, OSX, etc.), you can calculate the time in 1/CLOCKS_PER_SEC (timer ticks) for a call using clock() found in <ctime>. The return value from that call will be the elapsed time since the program started running in milliseconds. Two calls to clock() can then be subtracted from each other to calculate the running time of a given block of code.
So for example:
#include <ctime>
#include <cstdio>
clock_t time_a = clock();
//...run block of code
clock_t time_b = clock();
if (time_a == ((clock_t)-1) || time_b == ((clock_t)-1))
{
perror("Unable to calculate elapsed time");
}
else
{
unsigned int total_time_ticks = (unsigned int)(time_b - time_a);
}
Edit: You are not going to be able to directly compare the timings from a POSIX-compliant platform to a Windows platform because on Windows clock() measures the the wall-clock time, where-as on a POSIX system, it measures elapsed CPU time. But it is a function in a standard C++ library, and for comparing performance between different blocks of code on the same platform, should fit your needs.

On windows you can use GetTickCount (MSDN) Which will give the number of milliseconds that have elapsed since the system was started. Using this before and after the call you get the amount of milliseconds the call took.
DWORD start = GetTickCount();
//Do your stuff
DWORD end = GetTickCount();
cout << "the call took " << (end - start) << " ms";
Edit:
As Jason mentioned, Clock(); would be better because it is not related to Windows only.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js