How to effectively measure difference in a run-time - c++

One of the exercises in TC++PL asks:
Write a function that either returns a value or that throws that value based on an argument. Measure the difference in run-time between the two ways.
Great pity he never explaines how to measure such things. I'm not sure if I'm suppose to write simple "time start, time end" counter, or are there more effective and practical ways?

For each of the functions,
Get the start time
Call the function a million times (or more... a million isn't that many, really)
Get the end time and subtract from it the start time
and compare the results. That's about as practical as performance measuring gets.

Consider using boost.timer, it is about as simple as it gets.
#include <iostream>
#include <boost/timer.hpp>
boost::timer timer;
for (int i = 0; i < 100000; ++i) {
// ... whatever you want to measure
}
std::cout << timer.elapsed() << " seconds.\n";

He doesn't explain because that's part of exercise.
If seriously, I believe you should write simple "time start, time end" and big loop inside.

The only measurement that ever matters is wall-clock time. Don't let anyone else trick you into believing anything else. Read Lean Thinking if you don't believe it yourself.

the simplest way would be to get the system time before and after your run. if that doesn't give enough resolution you can look into some of the higher-resolution system timers.
if you need more details you can use a commercial product like dot trace: http://www.jetbrains.com/profiler/

Related

Dealing with heavy profiling of execution times in C++

I'm currently working on a scientific computing project involving huge data and complex algorithms, so I need to do a lot of code profiling. I'm currently relying on <ctime> and clock_t to time the execution of my code. I'm perfectly happy with this solution... except that I'm basically timing everything and thus for every line of real code I have to call start_time_function123 = clock(), end_time_function123 = clock() and cout << "function123 execution time: " << (end_time_function123-start_time_function123) / CLOCKS_PER_SEC << endl. This leads to heavy code bloating and quickly makes my code unreadable. How would you deal with that?
The only solution I can think of would be to find an IDE allowing me to mark portions of my code (at different locations, even in different files) and to toggle hide/show all marked code with one button. This would allow me to hide the part of my code related to profiling most of the time and display it only whenever I want to.
Have a RAII type that marks code as timed.
struct timed {
char const* name;
clock_t start;
timed( char const* name_to_record):
name(name_to_record),
start(clock())
{}
~timed(){
auto end=clock();
std::cout << name << " execution time: " << (end-start) / CLOCKS_PER_SEC << std::endl;
}
};
The use it:
void foo(){
timed timer(__func__);
// code
}
Far less noise.
You can augment with non-scope based finish operations. When doing heavy profiling sometimes I like to include unique ids. Using cout esoecially with endl could result in it dominating timing; fast recording to a large buffer that is dumped out in an async manner may be optimal. If you need to time ms level time, even allocation, locks and string manipulation should be avoided.
You don't say so explicitly, but I assume you are looking for possible speedups - ways to reduce the time it takes.
You think you need to do this by measuring how much time different parts of it take. If you're interested, there's an orthogonal way to approach it.
Just get it running under a debugger (using a non-optimized debug build).
Manually interrupt it at random, by Ctrl-C, Ctrl-Break, or the IDE's "pause" button.
Display the call stack and carefully examine what the program is doing, at all levels.
Do this with a suspicion that whatever it's doing could be something wasteful that you could find a better way to do.
Then if you start it up again, and halt it again, and see it doing the same thing or something similar, you know you will get a substantial speedup if you fix it.
The fewer samples you took to see that thing twice, the more speedup you will get.
That's the random pausing technique, and the statistical rationale is here.
The reason you do it on a debug build is here.
After you've cut out the fat using this method, you can switch to an optimized build and get the extra margin it gives you.

C++ precise measure time in decimal ms

I am measuring time of sorting algorhytms like Bubble,Insert, Selection and Quick sort.
I am using this for my purpose
long int before = GetTickCount();
QuickSort(pole,0,dlzka-1);
long int after = GetTickCount();
double dif = double((after - before));
cout << "Quick Sort with time "<< dif << " ms " << endl;
I am sorting array with 30 000 integers and working fine for other sort except the QuickSort which is probbably so fast that it sorts 30k integers in less then 1ms and then my timeer says it is 0ms which look like a mistake.
I want to write it for example 0,01ms just to make it looks that it run corectly.
Thank you.
When you benchmark, you never benchmark just one run. Your timer is not precise/accurate enough to give meaningful results across that tiny amount of time.
For example, the documentation for GetTickCount says:
The resolution of the GetTickCount function is limited to the resolution of the system timer, which is typically in the range of 10 milliseconds to 16 milliseconds.
So, it is plainly obvious that obtaining a value of 0.01ms is folly.
Instead, benchmark many runs, then divide by the number of times you ran it.
Put your code into a loop that you run 1000 times, with the clock started and stopped outside of that loop. Then divide the result by 1000. Or, if you like, the result of the clock will now be in µs instead of in ms.
If your loop is very fast, you may need more than 1000 repetitions to get a meaningful measurement. You could run 10,000, 100,000, ... etc times until you get a "reasonable number of milliseconds".
When the piece of code you are testing is very fast, the overhead of the loop may become significant; in that case, you might run an "empty loop" and subtract the two results to give you the "net" timing of the inner part of the loop only.
It is rare, however, that this is something you need to do - most often you are trying to compare different algorithms, and as long as the overhead of the loop is the same it doesn't matter that it exists - the faster algorithm will still be faster.
One more thought - and this is pretty important: if you sort things in the first pass through the loop, and your algorithm speed depends on whether the data is sorted or not, you will get a different answer for multiple passes than you get for a single pass. Thus you need to make sure that you are using the same inputs for every pass through the algorithm. This might mean that you cannot use in-place sorting, or that you copy the unsorted data back into the "to be sorted" array at every pass of the algorithm.
other option: there is a good article on high precision timing that explains the use of the clock_gettime() function, with its various options and flavors. On some systems this will allow you to use higher resolution measurements. It is still always a good idea to do multiple runs, or even multiple runs of multiple runs - so you can compute statistics and thus come up with a confidence interval.
If you are using c++11:
std::chrono::high_resolution_clock represents the clock with the smallest tick period provided by the implementation.
If your compiler supports C++11 with std::chrono, this is best way to measure time at high accuracy; it is cross-platform and part of the standard library.
#include <chrono>
#include <iostream>
#include <iomanip>
::std::chrono::steady_clock::time_point startTime = std::chrono::steady_clock::now();
doWork();
::std::chrono::steady_clock::duration elapsedTime = ::std::chrono::steady_clock::now() - startTime;
std::cout << std::fixed << std::setprecision(9) << std::endl;
double duration = ::std::chrono::duration_cast< ::std::chrono::duration< double > >(elapsedTime).count();
std::cout << "Milliseconds: " << duration * 1000 << std::endl;
To use C++11 in GCC, you run g++ -std=c++11 -o app main.cpp. For the Visual Studio compiler, you need 2012 or higher to use chrono.

C++ , Timer, Milliseconds

#include <iostream>
#include <conio.h>
#include <ctime>
using namespace std;
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC/1000);
return diffms;
}
int main()
{
clock_t start = clock();
for(int i=0;;i++)
{
if(i==10000)break;
}
clock_t end = clock();
cout << diffclock(start,end)<<endl;
getch();
return 0;
}
So my problems comes to that it returns me a 0, well to be stright i want to check how much time my program does operate...
I found tons of crap over the internet well mostly it comes to the same point of getting a 0 beacuse the start and the end is the same
This problems goes to C++ remeber : <
There are a few problems in here. The first is that you obviously switched start and stop time when passing to diffclock() function. The second problem is optimization. Any reasonably smart compiler with optimizations enabled would simply throw the entire loop away as it does not have any side effects. But even you fix the above problems, the program would most likely still print 0. If you try to imagine doing billions operations per second, throw sophisticated out of order execution, prediction and tons of other technologies employed by modern CPUs, even a CPU may optimize your loop away. But even if it doesn't, you'd need a lot more than 10K iterations in order to make it run longer. You'd probably need your program to run for a second or two in order to get clock() reflect anything.
But the most important problem is clock() itself. That function is not suitable for any time of performance measurements whatsoever. What it does is gives you an approximation of processor time used by the program. Aside of vague nature of the approximation method that might be used by any given implementation (since standard doesn't require it of anything specific), POSIX standard also requires CLOCKS_PER_SEC to be equal to 1000000 independent of the actual resolution. In other words — it doesn't matter how precise the clock is, it doesn't matter at what frequency your CPU is running. To put simply — it is a totally useless number and therefore a totally useless function. The only reason why it still exists is probably for historical reasons. So, please do not use it.
To achieve what you are looking for, people have used to read the CPU Time Stamp also known as "RDTSC" by the name of the corresponding CPU instruction used to read it. These days, however, this is also mostly useless because:
Modern operating systems can easily migrate the program from one CPU to another. You can imagine that reading time stamp on another CPU after running for a second on another doesn't make a lot of sense. It is only in latest Intel CPUs the counter is synchronized across CPU cores. All in all, it is still possible to do this, but a lot of extra care must be taken (i.e. once can setup the affinity for the process, etc. etc).
Measuring CPU instructions of the program oftentimes does not give an accurate picture of how much time it is actually using. This is because in real programs there could be some system calls where the work is performed by the OS kernel on behalf of the process. In that case, that time is not included.
It could also happen that OS suspends an execution of the process for a long time. And even though it took only a few instructions to execute, for user it seemed like a second. So such a performance measurement may be useless.
So what to do?
When it comes to profiling, a tool like perf must be used. It can track a number of CPU clocks, cache misses, branches taken, branches missed, a number of times the process was moved from one CPU to another, and so on. It can be used as a tool, or can be embedded into your application (something like PAPI).
And if the question is about actual time spent, people use a wall clock. Preferably, a high-precision one, that is also not a subject to NTP adjustments (monotonic). That shows exactly how much time elapsed, no matter what was going on. For that purpose clock_gettime() can be used. It is part of SUSv2, POSIX.1-2001 standard. Given that use you getch() to keep the terminal open, I'd assume you are using Windows. There, unfortunately, you don't have clock_gettime() and the closest thing would be performance counters API:
BOOL QueryPerformanceFrequency(LARGE_INTEGER *lpFrequency);
BOOL QueryPerformanceCounter(LARGE_INTEGER *lpPerformanceCount);
For a portable solution, the best bet is on std::chrono::high_resolution_clock(). It was introduced in C++11, but is supported by most industrial grade compilers (GCC, Clang, MSVC).
Below is an example of how to use it. Please note that since I know that my CPU will do 10000 increments of an integer way faster than a millisecond, I have changed it to microseconds. I've also declared the counter as volatile in hope that compiler won't optimize it away.
#include <ctime>
#include <chrono>
#include <iostream>
int main()
{
volatile int i = 0; // "volatile" is to ask compiler not to optimize the loop away.
auto start = std::chrono::steady_clock::now();
while (i < 10000) {
++i;
}
auto end = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "It took me " << elapsed.count() << " microseconds." << std::endl;
}
When I compile and run it, it prints:
$ g++ -std=c++11 -Wall -o test ./test.cpp && ./test
It took me 23 microseconds.
Hope it helps. Good Luck!
At a glance, it seems like you are subtracting the larger value from the smaller value. You call:
diffclock( start, end );
But then diffclock is defined as:
double diffclock( clock_t clock1, clock_t clock2 ) {
double diffticks = clock1 - clock2;
double diffms = diffticks / ( CLOCKS_PER_SEC / 1000 );
return diffms;
}
Apart from that, it may have something to do with the way you are converting units. The use of 1000 to convert to milliseconds is different on this page:
http://en.cppreference.com/w/cpp/chrono/c/clock
The problem appears to be the loop is just too short. I tried it on my system and it gave 0 ticks. I checked what diffticks was and it was 0. Increasing the loop size to 100000000, so there was a noticeable time lag and I got -290 as output (bug -- I think that the diffticks should be clock2-clock1 so we should get 290 and not -290). I tried also changing "1000" to "1000.0" in the division and that didn't work.
Compiling with optimization does remove the loop, so you have to not use it, or make the loop "do something", e.g. increment a counter other than the loop counter in the loop body. At least that's what GCC does.
Note: This is available after c++11.
You can use std::chrono library.
std::chrono has two distinct objects. (timepoint and duration). Timepoint represents a point in time, and duration, as we already know the term represents an interval or a span of time.
This c++ library allows us to subtract two timepoints to get a duration of time passed in the interval. So you can set a starting point and a stopping point. Using functions you can also convert them into appropriate units.
Example using high_resolution_clock (which is one of the three clocks this library provides):
#include <chrono>
using namespace std::chrono;
//before running function
auto start = high_resolution_clock::now();
//after calling function
auto stop = high_resolution_clock::now();
Subtract stop and start timepoints and cast it into required units using the duration_cast() function. Predefined units are nanoseconds, microseconds, milliseconds, seconds, minutes, and hours.
auto duration = duration_cast<microseconds>(stop - start);
cout << duration.count() << endl;
First of all you should subtract end - start not vice versa.
Documentation says if value is not available clock() returns -1, did you check that?
What optimization level do you use when compile your program? If optimization is enabled compiler can effectively eliminate your loop entirely.

Best way to test code speed in C++ without profiler, or does it not make sense to try?

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.
What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?
I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:
Consider this example. The following code shows 2 ways of doing the same thing:
#include <algorithm>
#include <ctime>
#include <iostream>
typedef unsigned char byte;
inline
void
swapBytes( void* in, size_t n )
{
for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )
in[lo] ^= in[hi]
, in[hi] ^= in[lo]
, in[lo] ^= in[hi] ;
}
int
main()
{
byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
const int iterations = 100000000;
clock_t begin = clock();
for( int i=iterations; i!=0; --i )
swapBytes( arr, 8 );
clock_t middle = clock();
for( int i=iterations; i!=0; --i )
std::reverse( arr, arr+8 );
clock_t end = clock();
double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC;
std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin
<< " clock ticks, which is: " << secSwap << "sec." << std::endl;
std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle
<< " clock ticks, which is: " << secReve << "sec." << std::endl;
std::cin.get();
return 0;
}
// Output:
// Release:
// swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
// std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.
// Debug:
// swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec.
// std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
The issues:
Which timers to use and how get the cpu time actually consumed by the code under question?
What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?
Contemplations:
It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?
What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.
Thanks for any directions.
update
Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.
inline
void
swapBytes( byte* in, size_t n )
{
byte t;
for( int i=0; i<7-i; ++i )
{
t = in[i];
in[i] = in[7-i];
in[7-i] = t;
}
}
Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.
Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.
I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:
Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)
Test when the machine is more or less relatively idle except for whatever I'm testing.
Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).
What can we know for certain without debugging, dissassembling and profiling tools?
Nearly nothing (except that I/O tends to be relatively slow).
To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.
Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.
Is it safe to say you're asking two questions?
Which one is faster, and by how much?
And why is it faster?
For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)
For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.
Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.
(This answer is specific to Windows XP and the 32-bit VC++ compiler.)
The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.
To read the time-stamp counter, use code like the following:
LARGE_INTEGER tsc;
__asm {
cpuid
rdtsc
mov tsc.LowPart,eax
mov tsc.HighPart,edx
}
(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)
There are four things worth noting about this approach.
Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)
Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)
Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.
Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.
I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.
So, anyway, perhaps I can answer your question about which clock to use on Windows.
clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.
So, really, the only clock to use is QueryPerformanceCounter()
Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).
Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.
As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.
Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.
Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.
Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".
Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.
-1 for pointlessness.

optimize time(NULL) call in c++

I have a system that spend 66% of its time in a time(NULL) call.
It there a way to cache or optimize this call?
Context: I'm playing with Protothread for c++. Trying to simulate threads with state machines. So Therefore I cant use native threads.
Here's the header:
#ifndef __TIMER_H__
#define __TIMER_H__
#include <time.h>
#include <iostream>
class Timer
{
private:
time_t initial;
public:
Timer();
unsigned long passed();
};
#endif
and the source file:
#include "Timer.h"
using namespace std;
Timer::Timer()
{
initial = time(NULL);
}
unsigned long Timer::passed()
{
time_t current = time(NULL);
return (current - initial);
}
UPDATE:
Final solution!
The cpu cycles it going away somewhere, and if I spend them being correct. That is
not so bad after all.
#define start_timer() timer_start=time(NULL)
#define timeout(x) ((time(NULL)-timer_start)>=x)
I presume you are calling it within some loop which is otherwise stonkingly efficient.
What you could do is keep a count of how many iterations your loop goes through before the return value of time changes.
Then don't call it again until you've gone through that many iterations again.
You can dynamically adjust this count upwards or downwards if you find you're going adrift, but you should be able to engineer it so that on average, it calls time() once per second.
Here's a rough idea of how you might do it (there's many variations on this theme)
int iterations_per_sec=10; //wild guess
int iterations=0;
while(looping)
{
//do the real work
//check our timing
if (++iterations>iterations_per_sec)
{
int t=time(NULL);
if (t==lasttime)
{
iterations_per_sec++;
}
else
{
iterations_per_sec=iterations/(t-lasttime);
iterations=0;
lastime=t;
//do whatever else you want to do on a per-second basis
}
}
}
That sounds quite much, given that time only has a precision of 1 second. Sounds like you call it way too often. One possible improvement would be to maybe call it only each 500ms. So it will still hit every second.
So instead of calling it 100 times a second, start off a timer that rings every 500ms, taking the current time and storing it into an integer. Then, read that integer 100 times a second instead.
As pointed out, you cannot cache it, as the whole point of time() is to give you the current time, which obviously changes all the time.
The real question however probably is: Why is the program calling time() so frequently? I can't think of any good reason to do so.
Is it polling time()? In that case sleep() might be more appropriate.
Call it less often - unless you really need the current time hundreds of times a second, you shouldn't be calling it that often.
EDIT:
After trying it, I'm even more curious, I realize you might be on a small embeded system, but on my system, I had no problems running 10,000,000 calls to time() in a second. You're likely doing something seriously wrong given that time() is only going to change once a second. What exactly are you trying to achieve?
If you're on Unix, you may consider using gettimeofday (http://www.opengroup.org/onlinepubs/000095399/functions/gettimeofday.html) - it's faster and has better precision.
Caching will not help, unless and until you don't want the current time. Can you post some code?
It really depends, but saving the result won't help if you always want the current time. time( NULL ) likely results in a system call, which will take time since you have to switch to/from kernel mode.
What you can do is read the tsc at the same time that you get the current time, then read the tsc again when you want to get the current time, and add the number of cycles/CPU speed to your time.
There are some answers about rdtsc on here that should help you.
Edit: see my answer in Timer to find elapsed time in a function call in C for more information about rdtsc.
Also note that I don't particularly recommend this unless you absolutely have to. It is highly likely that calling rdtsc, subtracting from the previous rdtsc converting that to a fractional equivalent in seconds by dividing by your cpu spped will be slower than just calling time() again.
Typically what you can do is save the result of time off into a local variable, and then use that as your current time until you perform some blocking call, or some long running CPU intensive section of code.
What are you doing that you need to call time this often and can you post some code?
You could create a thread which called time() a few times a second and then slept, updating a shared variable.
A quick skim of Protothread implied that it didn't use OS threads, so you might get away with no memory barriers. Otherwise something like an efficient read/write lock should mean it's negligible cost.
You could use a separate thread which would run an endless loop that would sleep() for 1 second (or less if you need finer granularity) and then update the timestamp value.
Other threads would just check this timestamp value without any performance penalty.