I want to run a loop inside a thread that calculates some data every millisecond. But I am having trouble with the sleep function. It is sleeping much too long.
I created a basic console application in visual studio:
#include <windows.h>
#include <iostream>
#include <chrono>
#include <thread>
using namespace std;
typedef std::chrono::high_resolution_clock Clock;
int _tmain(int argc, _TCHAR* argv[])
{
int iIdx = 0;
bool bRun = true;
auto aTimeStart = Clock::now();
while (bRun){
iIdx++;
if (iIdx >= 500) bRun = false;
//Sleep(1);
this_thread::sleep_for(chrono::microseconds(10));
}
printf("Duration: %i ms\n", chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - aTimeStart).count());
cin.get();
return 0;
}
This prints out: Duration: 5000 ms
The same result is printed, when i use Sleep(1);
I would expect the duration to be 500 ms, and not 5000 ms. What am I doing wrong here?
Update:
I was using Visual Studio 2013. Now I have installed Visual Studio 2015, and its fine - prints out: Duration: 500 ms (sometimes its 527 ms).
However, this sleep_for still isn't very accurate, so I will look out for other solutions.
The typical time slice used by popular OSs is much longer than 1ms (say 20ms or so); the sleep sets a minimum for how long you want your thread to be suspended not a maximum. Once your thread becomes runnable it is up to the OS when to next schedule it.
If you need this level of accuracy you either need a real time OS, or set a very high priority on your thread (so it can pre-empt almost anything else), or write your code in the kernel, or use a busy wait.
But do you really need to do the calculation every ms? That sort of timing requirement normally comes from hardware. What goes wrong if you bunch up the calculations a bit later?
On Windows, try timeBeginPeriod: https://msdn.microsoft.com/en-us/library/windows/desktop/dd757624(v=vs.85).aspx
It increases timer resolution.
What am I doing wrong here?
Attempting to use sleep for precise timing.
sleep(n) does not pause your thread for precisely n time then immediately continue.
sleep(n) yields control of the thread back to the scheduler, and indicates that you do not want control back until at least n time has passed.
Now, the scheduler already divvies up thread processing time into time slices, and these are typically on the order of 25 milliseconds or so. That's the bare minimum you can expect your sleep to run.
sleep is simply the wrong tool for this job. Never use it for precise scheduling.
This thread is fairly old, but perhaps someone can still use this code.
It's written for C++11 and I've tested it on Ubuntu 15.04.
class MillisecondPerLoop
{
public:
void do_loop(uint32_t loops)
{
int32_t time_to_wait = 0;
next_clock = ((get_current_clock_ns() / one_ms_in_ns) * one_ms_in_ns);
for (uint32_t loop = 0; loop < loops; ++loop)
{
on_tick();
// Assume on_tick takes less than 1 ms to run
// calculate the next tick time and time to wait from now until that time
time_to_wait = calc_time_to_wait();
// check if we're already past the 1ms time interval
if (time_to_wait > 0)
{
// wait that many ns
std::this_thread::sleep_for(std::chrono::nanoseconds(time_to_wait));
}
++m_tick;
}
}
private:
void on_tick()
{
// TEST only: simulate the work done in every tick
// by waiting a random amount of time
std::this_thread::sleep_for(std::chrono::microseconds(distribution(generator)));
}
uint32_t get_current_clock_ns()
{
return std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::system_clock::now().time_since_epoch()).count();
}
int32_t calc_time_to_wait()
{
next_clock += one_ms_in_ns;
return next_clock - get_current_clock_ns();
}
static constexpr uint32_t one_ms_in_ns = 1000000L;
uint32_t m_tick;
uint32_t next_clock;
};
A typical run shows a pretty accurate 1ms loop with a 1- 3 microsecond error. Your PC may be more accurate than this if it's a faster CPU.
Here's typical output:
One Second Loops:
Avg (ns) ms err(ms)
[ 0] 999703 0.9997 0.0003
[ 1] 999888 0.9999 0.0001
[ 2] 999781 0.9998 0.0002
[ 3] 999896 0.9999 0.0001
[ 4] 999772 0.9998 0.0002
[ 5] 999759 0.9998 0.0002
[ 6] 999879 0.9999 0.0001
[ 7] 999915 0.9999 0.0001
[ 8] 1000043 1.0000 -0.0000
[ 9] 999675 0.9997 0.0003
[10] 1000120 1.0001 -0.0001
[11] 999606 0.9996 0.0004
[12] 999714 0.9997 0.0003
[13] 1000171 1.0002 -0.0002
[14] 999670 0.9997 0.0003
[15] 999832 0.9998 0.0002
[16] 999812 0.9998 0.0002
[17] 999868 0.9999 0.0001
[18] 1000096 1.0001 -0.0001
[19] 999665 0.9997 0.0003
Expected total time: 20.0000ms
Actual total time : 19.9969ms
I have a more detailed write up here:
https://arrizza.org/wiki/index.php/One_Millisecond_Loop
Related
i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.
I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.
Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
I have compiled the problem on my quadcore laptop with
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
and I get the following output;
$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08
$ mpirun -np 3 ./a.out
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08
$ mpirun -np 2 ./a.out
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08
To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks
If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket).
Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.
The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.
For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.
From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.
I wrote a program that for every five seconds would print a random number (1-10) within a ten seconds timeframe. But it seems to be printing more than one random number every five seconds. Could anyone point me in the right direction?
clock_t start;
int random;
start = clock();
while (float(clock() - start) / CLOCKS_PER_SEC <= 10.0) {
if (fmod(float(clock() - start) / CLOCKS_PER_SEC, 5) == 0 && (float(clock() - start) / CLOCKS_PER_SEC) != 0) {
random = rand() % 10 + 1;
cout << random << endl;
}
}
return 0;
EDIT: I felt this answer was incomplete, because it does not answer your actual question. The first part now explains why your approach fails, the second part is about how to solve your problem in a better way.
You are using clock() in a way, where you wait for a number of specific points in time. Due to the nature of clock() and the limited precision of float, your check basically is equivalent to saying: Are we in a window [x-eps, x+eps], where x is a multiple of 5 and eps is generally small and depends on the floating point type used and how big (clock() - start) is. A way to increase eps is to add a constant like 1e6 to (clock() - start). If floating point numbers were precise, that should not affect your logic, because 1e6 is a multiple of 5, but in fact it will do so drastically.
On a fast machine, that condition can be true multiple times every 5 seconds; on a slow machine it may not be true every time 5 seconds passed.
The correct way to implement it is below; but if you wanted to do it using a polling approach (like you do currently), you would have to increment start by 5 * CLOCKS_PER_SECOND in your if-block and change the condition to something like (clock() - start) / CLOCKS_PER_SECOND >= 5.
Apart from the clock()-specific issues that you have, I want to remind you that it measures CPU time or ticks and is hardly a reliable way to measure wall time. Fortunately, in modern C++, we have std::chrono:
auto t = std::chrono::steady_clock::now();
auto end = t + std::chrono::seconds( 10 );
while( t < end )
{
t += std::chrono::seconds( 5 );
std::this_thread::sleep_until( t );
std::cout << ( rand() % 10 + 1 ) << std::endl;
}
I also highly recommend replacing rand() with the more modern tools in <random>, e.g.:
std::random_device rd; // Hopefully a good source of entropy; used for seeding.
std::default_random_engine gen( rd() ); // Faster pseudo-random source.
std::uniform_int_distribution<> dist( 1, 10 ); // Specify the kind of random stuff that you want.
int random = dist( gen ); // equivalent to rand() % 10 + 1.
Your code seems to be fast enough and your calculation precision small enough that you do multiple iterations before the number you are calculating changes. Thus, when the condition matches, it will match several times at once.
However, this is not a good way to do this, as you are making your computer work very hard. This way of waiting will put a rather severe load on one processor, potentially slowing down your computer, and definitely draining more power. If you're on a quad-core desktop it is not that bad, but for a laptop it's hell on batteries. Instead of asking your computer "is it time yet? is it time yet? is it time yet?" as fast as you can, trust that your computer knows how to wait, and use sleep, usleep, sleep_for, or whatever the library you're using is calling it now. See here for an example.
I am scanning through every permutation of vectors and I would like to multithread this process (each thread would scan all the permutation of some vectors).
I manage to extract the code that would not speed up (I know it does not do anything useful but it reproduces my problem).
int main(int argc, char *argv[]){
std::vector<std::string *> myVector;
for(int i = 0 ; i < 8 ; ++i){
myVector.push_back(new std::string("myString" + std::to_string(i)));
}
std::sort(myVector.begin(), myVector.end());
omp_set_dynamic(0);
omp_set_num_threads(8);
#pragma omp parallel for shared(myVector)
for(int i = 0 ; i < 100 ; ++i){
std::vector<std::string*> test(myVector);
do{ //here is a permutation
} while(std::next_permutation(test.begin(), test.end())); // tests all the permutations of this combination
}
return 0;
}
The result is :
1 thread : 15 seconds
2 threads : 8 seconds
4 threads : 15 seconds
8 threads : 18 seconds
16 threads : 20 seconds
I am working with an i7 processor with 8 cores. I can't understand how it could be slower with 8 threads than with 1... I don't think the cost of creating new threads is higher than the one to go through 40320 permutations.. so what is happening?
Thanks to the help of everyone, I finally manage to find the answer :
There were two problems :
A quick performance profiling showed that most of the time was spent in std::lockit which is something used for debug on visual studio.. to prevent that just add this command line /D "_HAS_ITERATOR_DEBUGGING=0" /D "_SECURE_SCL=0". That was why adding more threads resulted in loss of time
Switching optimization on helped improve the performance
I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.
Allocate a big array
Access part of it of different size each time.
So I wrote a little program.
Here's my code:
#include <cstdio>
#include <time.h>
#include <sys/mman.h>
const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;
long long clock_time() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}
int main() {
// allocate memory and lock
void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
if (map == MAP_FAILED) {
return 0;
}
int* data = (int*)map;
// write all to avoid paging on demand
for (int i = 0;i< data_size / sizeof(int);i++) {
data[i]++;
}
int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB,
128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB,
5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
double totalTime = 0;
for (int k = 0; k < times; k++) {
int size_mask = steps[i] / sizeof(int) - 1;
long long start = clock_time();
for (int j = 0; j < repeats; j++) {
++data[ (j * 16) & size_mask ];
}
long long end = clock_time();
totalTime += (end - start) / 1000000000.0;
}
printf("%d time: %lf\n", steps[i] / KB, totalTime);
}
munmap(map, (size_t)data_size);
return 0;
}
However, the result is so weird:
1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195
My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K.
Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that.
Could anybody give some help? Thanks in advance.
Update:
Follow #Leeor advice, I use j*64 instead of j*16. New results:
1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753
Two peaks, 4096K and 8192K. Still weird.
I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.
More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)
Update the questions with the new results and we can figure out if there's anything else here.
EDIT1:
Ok, I ran the fixed code and got -
1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817
It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.
This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).
Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)
I think you'd be better off looking at the CPUID instruction. It's not trivial, but there should be information on the web.
Also, if you're on Windows, you can use GetLogicalProcessorInformation function. Mind you, it's only present in Windows XP SP3 and above. I know nothing about Linux/Unix.
If you're using GNU/Linux you can just read the content of the files /proc/cpuinfo and for further details /sys/devices/system/cpu/*. It is just common under UNIX not to define a API, where a plain file can do that job anyway.
I would also take a look at the source of util-linux, it contains a program named lscpu. This should be give you an example how to retrieve the required information.
// update
http://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/sys-utils/lscpu.c
If just taken a look at the source their. It basically reading from the file mentioned above, thats all. An therefore it is absolutely valid to read also from that files, they are provided by the kernel.