Given a while loop and the function ordering as follows:
int k=0;
int total=100;
while(k<total){
doSomething();
if(approx. t milliseconds elapsed) { measure(); }
++k;
}
I want to perform 'measure' every t-th milliseconds. However, since 'doSomething' can be close to the t-th millisecond from the last execution, it is acceptable to perform the measure after approximately t milliseconds elapsed from the last measure.
My question is: how could this be achieved?
One solution would be to set timer to zero, and measure it after every 'doSomething'. When it is withing the acceptable range, I perform measures, and reset. However, I'm not which c++ function I should use for such a task. As I can see, there are certain functions, but the debate on which one is the most appropriate is outside of my understanding. Note that some of the functions actually take into account the time taken by some other processes, but I want my timer to only measure the time of the execution of my c++ code (I hope that is clear). Another thing is the resolution of the measurements, as pointed out below. Suppose the medium option of those suggested.
High resolution timing is platform specific, and you have not specified in the question. The standard library clock() function returns a count that increments at CLOCKS_PER_SEC per second. On some platforms this may be fast enough to give you the resolution you need but you should check your system's tick rate since it is implementation defined. However if you find it is high enough then:
#define SAMPLE_PERIOD_MS 100
#define SAMPLE_PERIOD_TICKS ((CLOCKS_PER_SEC * SAMPLE_PERIOD_MS) / 1000)
int k=0;
int total=100;
clock_t measure_time = clock() + SAMPLE_PERIOD_TICKS ;
while(k<total)
{
doSomething();
if( clock() - measure_time > 0 )
{
measure();
measure_time += SAMPLE_PERIOD_TICKS ;
++k;
}
}
You might replace clock() with some other high-resolution clock source if necessary.
However note a couple of issues. This method is a "busy-loop"; unless either doSomething() or measure() yield the CPU, the process will take all the cpu cycles it can. If this is the only code running on a target, that may not matter. On the other hand is this is running on a general purpose OS such as Windows or Linux which are not real-time, the process may be pre-empted by other processes, and this may affect the accuracy of the sampling periodicity. If you need accurate timing use of an RTOS and performing doSomething() and measure() in separate threads would be better. Even in a GPOS that would be better. For example a general pattern (using a made-up API in teh absence of any specification) would be:
int main()
{
StartThread( measure_thread, HIGH_PRIORITY ) ;
for(;;)
{
doSomething() ;
}
}
void measure_thread()
{
for(;;)
{
measure() ;
sleep( SAMPLE_PERIOD_MS ) ;
}
}
The code for measure_thread() is only accurate if measure() takes a negligible time to run. If it takes significant time you may need to account for that. If it is non-deterministic, you may even have to measure its execution time in order to subtract it the sleep period.
Related
I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.
There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.
One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).
Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.
Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?
#include <chrono>
#include <iostream>
#include <thread>
void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);
namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;
int main() {
int* x = new int[itemCount];
int* y = new int[itemCount];
// Initialize arrays
for (int i = 0; i < itemCount; i++) {
x[i] = 1;
y[i] = 2;
}
// Call add() on multiple threads
std::thread threads[threadCount];
startTimer();
for (int i = 0; i < threadCount; ++i) {
threads[i] = std::thread(add, x, y, i);
}
for (auto& thread : threads) {
thread.join();
}
stopTimer();
// Verify results
for (int i = 0; i < itemCount; ++i) {
if (y[i] != 3) {
std::cout << "Error!";
}
}
delete[] x;
delete[] y;
}
void add(int* x, int* y, int threadIdx) {
int firstIdx = threadIdx * itemsPerThread;
int lastIdx = firstIdx + itemsPerThread - 1;
for (int i = firstIdx; i <= lastIdx; ++i) {
y[i] = x[i] + y[i];
}
}
void startTimer() {
start = ch::steady_clock::now();
}
void stopTimer() {
auto end = ch::steady_clock::now();
auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
std::cout << duration << " ms\n";
}
You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.
On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).
As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.
Your problem is not the processor. You ran against the RAM read and write latency. As your cache is able to hold some megabytes of data and you exceed this storage by far. Multi-threading is so long useful, as long as you can shovel data into your processor. The cache in your processor is incredibly fast, compared to your RAM. As you exceed your cache storage, this results in a RAM latency test.
If you want to see the advantages of multi-threading, you have to choose data sizes in range of your cache size.
EDIT
Another thing to do, would be to create a higher workload for the cores, so the storage latency goes unrecognized.
sidenote: keep in mind, your core has several execution units. one or more for each type of operation - integer, float, shift and so on. That means, one core can execute more then one command per step. In particular one operation per execution unit. You can keep the data size of the test data and do more stuff with it - be creative =) Filling the queue with integer operations only, will give you an advantage in multi-threading. If you can variate in your code, when and where you do different operations, do it, this also will show impact on the speedup. Or avoid it, if you want to see a nice speedup on multi-threading.
to avoid any kind of optimization, you should use randomized test data. so neither the compiler nor the processor itself can predict what the outcome of your operation is.
Also avoid doing branches like if and while. Each decision the processor has to predict and execute, will slow you down and alter the result. With branch-prediction, you will never get a deterministic result. Later in a "real" program, be my guest and do what you want. But when you want to explore the multi-threading world, this could lead you to wrong conclusions.
BTW
Please use a delete for every new you use, to avoid memory leaks. AND even better, avoid plain pointers, new and delete. You should use RAII. I advice to use std::array or std::vector, simple a STL-container. This will save you tons of debugging time and headaches.
Speedup from parallelization is limited by the portion of the task that remains serial. This is called Amdahl's law. In your case, a decent amount of that serial time is spent initializing the array.
Are you compiling the code with -O3? If so, the compiler might be able to unroll and/or vectorize some of the loops. The loop strides are predictable, so hardware prefetching might help as well.
You might want to also explore if using all 8 hyperthreads are useful or if it's better to run 1 thread per core (I am going to guess that since the problem is memory-bound, you'll likely benefit from all 8 hyperthreads).
Nevertheless, you'll still be limited by memory bandwidth. Take a look at the roofline model. It'll help you reason about the performance and what speedup you can theoretically expect. In your case, you're hitting the memory bandwidth wall that effectively limits the ops/sec achievable by your hardware.
I have decided to compare the times of passing by value and by reference in C++ (g++ 5.4.0) with the following code:
#include <iostream>
#include <sys/time.h>
using namespace std;
int fooVal(int a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int fooRef(int & a) {
for (size_t i = 0; i < 1000; ++i) {
++a;
--a;
}
return a;
}
int main() {
int a = 0;
struct timeval stop, start;
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooVal(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
gettimeofday(&start, NULL);
for (size_t i = 0; i < 10000; ++i) {
fooRef(a);
}
gettimeofday(&stop, NULL);
printf("The loop has taken %lu microseconds\n", stop.tv_usec - start.tv_usec);
return 0;
}
It was expected that the fooRef execution would take much more time in comparison with fooVal case because of "looking up" referenced value in memory while performing operations inside fooRef. But the result proved to be unexpected for me:
The loop has taken 18446744073708648210 microseconds
The loop has taken 99967 microseconds
And the next time I run the code it can produce something like
The loop has taken 97275 microseconds
The loop has taken 99873 microseconds
Most of the time produced values are close to each other (with fooRef being just a little bit slower), but sometimes outbursts like in the output from the first run can happen (both for fooRef and fooVal loops).
Could you please explain this strange result?
UPD: Optimizations were turned off, O0 level.
If gettimeofday() function relies on operating system clock, this clock is not really designed for dealing with microseconds in an accurate manner. The clock is typically updated periodically and only frequently enough to give the appearance of showing seconds accurately for the purpose of working with date/time values. Sampling at the microsecond level may be unreliable for a benchmark such as the one you are performing.
You should be able to work around this limitation by making your test time much longer; for example, several seconds.
Again, as mentioned in other answers and comments, the effects of which type of memory is accessed (register, cache, main, etc.) and whether or not various optimizations are applied, could substantially impact results.
As with working around the time sampling limitation, you might be able to somewhat work around the memory type and optimization issues by making your test data set much larger such that memory optimizations aimed at smaller blocks of memory are effectively bypassed.
Firstly, you should look at the assembly language to see if there are any differences between passing by reference and passing by value.
Secondly, make the functions equivalent by passing by constant reference. Passing by value says that the original variable won't be changed. Passing by constant reference keeps the same principle.
My belief is that the two techniques should be equivalent in both assembly language and performance.
I'm no expert in this area, but I would tend to think that the reason why the two times are somewhat equivalent is due to cache memory.
When you need to access a memory location (Say, address 0xaabbc125 on an IA-32 architecure), the CPU copies the memory block (addresses 0xaabbc000 to 0xaabbcfff) to your cache memory. Reading from and writing to the memory is very slow, but once it's been copied into you cache, you can access values very quickly. This is useful because programs usually require the same range of addresses over and over.
Since you execute the same code over and over and that your code doesn't require a lot of memory, the first time the function is executed, the memory block(s) is (are) copied to your cache once, which probably takes most of the 97000 time units. Any subsequent calls to your fooVal and fooRef functions will require addresses that are already in your cache, so they will require only a few nanoseconds (I'd figure roughly between 10ns and 1µs). Thus, dereferencing the pointer (since a reference is implemented as a pointer) is about double the time compared to just accessing a value, but it's double of not much anyway.
Someone who is more of an expert may have a better or more complete explanation than mine, but I think this could help you understand what's going on here.
A little idea : try to run the fooVal and fooRef functions a few times (say, 10 times) before setting start and beginning the loop. That way, (if my explanation was correct!) the memory block will (should) be already into cache when you begin looping them, which means you won't be taking caching in your times.
About the super-high value you got, I can't explain that. But the value is obviously wrong.
It's not a bug, it's a feature! =)
I am trying to read a struct from a binary byte buffer using cast and pack.
I was trying to keep track of worst case read time from in memory buffer so I decided to keep a chrono high resolution clock nano timer. Whenever the timer increased I printed the value. It gave me a worst case scenario of about 20 micro seconds which was huge considering the size of the struct.
When I measured the average time taken it came out to be ~20 nanoseconds. Then I measured how many times was I breaching 50. And it turns out of the ~20 million times, I was breaching 50 nanoseconds only 500 times.
My question is what can possibly cause this performance fluctuation: average of 20 and worst of 20,000?
Secondly, how can I ensure a constant time performance. I am compiling with -O3 and C++11.
// new approach
#pragma pack(push, 1)
typedef struct {
char a;
long b, c;
char d, name[10];
int e , f;
char g, h;
int h, i;
} myStruct;
#pragma pack(pop)
//in function where i am using it
auto am1 = chrono::high_resolution_clock::now();
myStruct* tmp = (myStruct*)cTemp;
tmp->name[10] = 0;
auto am2 = chrono::high_resolution_clock::now();
chrono::duration<long, nano> arM = chrono::duration_cast<chrono::nanoseconds>(am2 - am1);
if(arM.count() > maxMPO.count())
{
cout << "myStruct read time increased: " << arM.count() << "\n";
maxMPO = arM;
}
I am using g++4.8 with C++11 and an ubuntu server.
what can possibly cause this performance fluctuation: avg of 20 and
worst of 20,000?
On a PC (or Mac, or any desktop), there are Ethernet interrupts, timers, mem-refresh, and dozens of other things going on over which you have no (or very little) control.
You might consider changing the target. If you use a single board computer (SBC) with only static ram, and a network connection which you can turn off and disconnect, and timers and clocks and every other kind of interrupt under your software control, you might achieve an acceptable result.
I once worked with a gal who wrote software for an 8085 SBC. When we hooked up a scope and saw the waveform stability of a software controlled bit, I thought she must have added logic chips. It was amazing.
You simply can not achieve 'jitter' free behaviour on a desktop.
What is the most accurate way to calculate the elapsed time in C++? I used clock() to calculate this, but I have a feeling this is wrong as I get 0 ms 90% of the time and 15 ms the rest of it which makes little sense to me.
Even if it is really small and very close to 0 ms, is there a more accurate method that will give me the exact the value rather than a rounded down 0 ms?
clock_t tic = clock();
/*
main programme body
*/
clock_t toc = clock();
double time = (double)(toc-tic);
cout << "\nTime taken: " << (1000*(time/CLOCKS_PER_SEC)) << " (ms)";
Thanks
With C++11, I'd use
#include <chrono>
auto t0 = std::chrono::high_resolution_clock::now();
...
auto t1 = std::chrono::high_resolution_clock::now();
auto dt = 1.e-9*std::chrono::duration_cast<std::chrono::nanoseconds>(t1-t0).count();
for the elapsed time in seconds.
For pre 2011 C++, you can use QueryPerformanceCounter() on windows or gettimeofday() with linux/OSX. For example (this is actually C, not C++):
timeval oldCount,newCount;
gettimeofday(&oldCount, NULL);
...
gettimeofday(&newCount, NULL);
double t = double(newCount.tv_sec -oldCount.tv_sec )
+ double(newCount.tv_usec-oldCount.tv_usec) * 1.e-6;
for the elapsed time in seconds.
std::chrono::high_resolution_clock is as portable a solution as you can get, however it may not actually be higher resolution than what you already saw.
Pretty much any function which returns system time is going to jump forward whenever the system time is updated by the timer interrupt handler, and 10ms is a typical interval for that on modern OSes.
For better precision timing, you need to access either a CPU cycle counter or high precision event timer (HPET). Compiler library vendors ought to use these for high_resolution_clock, but not all do. So you may need OS-specific APIs.
(Note: specifically Visual C++ high_resolution_clock uses the low resolution system clock. But there are likely others.)
On Win32, for example, the QueryPerformanceFrequency() and QueryPerformanceCounter() functions are a good choice. For a wrapper that conforms to the C++11 timer interface and uses these functions, see
Mateusz answers "Difference between std::system_clock and std::steady_clock?"
If you have C++11 available, use the chrono library.
Also, different platforms provide access to high precision clocks.
For example, in linux, use clock_gettime. In Windows, use the high performance counter api.
Example:
C++11:
auto start=high_resolution_clock::now();
... // do stuff
auto diff=duration_cast<milliseconds>(high_resolution_clock::now()-start);
clog << diff.count() << "ms elapsed" << endl;
All the 10 questions with 5 marks need to be answered within time. so the time consumed for each question n remaining time should be displayed. can anybody help?
A portable C++ solution would be to use chrono::steady_clock to measure time. This is available in C++11 in the header <chrono>, but may well be available to older compilers in TR1 in <tr1/chrono> or boost.chrono.
The steady clock always advances at a rate "as uniform as possible", which is an important consideration on a multi-tasking multi-threaded platform. The steady clock is also independent of any sort of "wall clock", like the system clock (which may be arbitrarily manipulated at any time).
(Note: if steady_clock isn't in your implementation, look for monotonic_clock.)
The <chrono> types are a bit fiddly to use, so here is a sample piece of code that returns a steady timestamp (or rather, a timestamp from whichever clock you like, e.g. the high_resolution_clock):
template <typename Clock>
long long int clockTick(int multiple = 1000)
{
typedef typename Clock::period period;
return (Clock::now().time_since_epoch().count() * period::num * multiple) / period::den;
}
typedef std::chrono::monotonic_clock myclock; // old
typedef std::chrono::steady_clock yourclock; // C++11
Usage:
long long int timestamp_ms = clockTick<myclock>(); // milliseconds by default
long long int timestamp_s = clockTick<yourclock>(1); // seconds
long long int timestamp_us = clockTick<myclock>(1000000); // microseconds
Use time().
This has the limitation that Kerrek has pointed out in his answer. But it's also very simple to use.