What is the fastest way to get seconds passed in cpp? - c++

I made a factoring program that needs to loop as quickly as possible. However, I also want to track the progress with minimal code. To do this, I display the current value of i every second by comparing time_t start - time_t end and an incrementing value marker.
using namespace std; // cause I'm a noob
// logic stuff
int divisor = 0, marker = 0;
int limit = sqrt(num);
for (int i = 1; i <= limit; i++) // odd number = odd factors
{
if (num % i == 0)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
if (i != 1)
cout << "\n";
divisor = num / i;
cout << i << "," << divisor << "\n";
}
end = time(&end); // PROBLEM HERE
if ((end - start) > marker)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
cout << "\t\t\t\t" << i;
marker++;
}
}
Of course, the actual code is much more optimized and uses boost::multiprecision, but I don't think that's the problem. When I remove the line end = time(&end), I see a performance gain of at least 10%. I'm just wondering, how can I track the time (or at least approximate seconds) without unconditionally calling a function every loop? Or is there a faster function?

You observe "When I remove the line end = time(&end), I see a performance gain of at least 10%." I am not surprised, reading time easily is taking inefficient time, compared to doing pure CPU calculations.
I assume hence that the time reading is actually what eats the performance which observe lost when removing the line.
You could use an estimation of the minimum number of iterations your loop does within a second and then only check the time if multiples of (half of) that number have looped.
I.e., if you only want to be aware of time in a resolution of seconds, then you should try to only marginally more often do the time-consuming reading of the time.

I would use a totally different approach where you seperate measurement/display code from the loop completely and even run it on another thread.
Live demo here : https://onlinegdb.com/8nNsGy7EX
#include <iostream>
#include <chrono> // for all things time
#include <future> // for std::async, that allows us to run functions on other threads
void function()
{
const std::size_t max_loop_count{ 500 };
std::atomic<std::size_t> n{ 0ul }; // make access to loopcounter threadsafe
// start another thread that will do the reporting independent of the
// actual work you are doing in your loop.
// for this capture n (loop counter) by reference (so this thread can look at it)
auto future = std::async(std::launch::async,[&n, max_loop_count]
{
while (n < max_loop_count)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "\rprogress = " << (100 * n) / max_loop_count << "%";
}
});
// do not initialize n here again. since we share it with reporting
for (; n < max_loop_count; n++)
{
// do your loops work, just a short sleep now to mimmick actual work
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
// synchronize with reporting thread
future.get();
}
int main()
{
function();
return 0;
}
If you have any questions regarding this example let me know.

Related

c++ threads safety and time efficiency: why does thread with mutex check sometimes works faster than without it?

I'm beginner in threads usage in c++. I've read basics about std::thread and mutex, and it seems I understand the purpose of using mutexes.
I decided to check if threads are really so dangerous without mutexes (Well I believe books but prefer to see it with my own eyes). As a testcase of "what I shouldn't do in future" I created 2 versions of the same concept: there are 2 threads, one of them increments a number several times (NUMBER_OF_ITERATIONS), another one decrements the same number the same number of times, so we expect to see the same number after the code is executed as before it. The code is attached.
At first I run 2 threads which do it in unsafe manner - without any mutexes, just to see what can happen. And after this part is finished I run 2 threads which do the same thing but in safe manner (with mutexes).
Expected results: without mutexes a result can differ from initial value, because data could be corrupted if two threads works with it simultaneously. Especially it's usual for huge NUMBER_OF_ITERATIONS - because the probability to corrupt data is higher. So this result I can understand.
Also I measured time spent by both "safe" and "unsafe" parts. For huge number of iterations the safe part spends much more time, than unsafe one, as I expected: there is some time spent for mutex check. But for small numbers of iterations (400, 4000) the safe part execution time is less than unsafe time. Why is that possible? Is it something which operating system does? Or is there some optimization by compiler which I'm not aware of? I spent some time thinking about it and decided to ask here.
I use windows and MSVS12 compiler.
Thus the question is: why the safe part execution could be faster than unsafe part one (for small NUMBER_OF_ITERATIONS < 1000*n)?
Another one: why is it related to NUMBER_OF_ITERATIONS: for smaller ones (4000) "safe" part with mutexes is faster, but for huge ones (400000) the "safe" part is slower?
main.cpp
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <windows.h>
//
///change number of iterations for different results
const long long NUMBER_OF_ITERATIONS = 400;
//
/// time check counter
class Counter{
double PCFreq_ = 0.0;
__int64 CounterStart_ = 0;
public:
Counter(){
LARGE_INTEGER li;
if(!QueryPerformanceFrequency(&li))
std::cerr << "QueryPerformanceFrequency failed!\n";
PCFreq_ = double(li.QuadPart)/1000.0;
QueryPerformanceCounter(&li);
CounterStart_ = li.QuadPart;
}
double GetCounter(){
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return double(li.QuadPart-CounterStart_)/PCFreq_;
}
};
/// "dangerous" functions for unsafe threads: increment and decrement number
void incr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)++;
std::cout << "incr finished" << std::endl;
}
void decr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)--;
std::cout << "decr finished" << std::endl;
}
///class for safe thread operations with incrment and decrement
template<typename T>
class Safe_number {
public:
Safe_number(int i){number_ = T(i);}
Safe_number(long long i){number_ = T(i);}
bool inc(){
if(m_.try_lock()){
number_++;
m_.unlock();
return true;
}
else
return false;
}
bool dec(){
if(m_.try_lock()){
number_--;
m_.unlock();
return true;
}
else
return false;
}
T val(){return number_;}
private:
T number_;
std::mutex m_;
};
///
template<typename T>
void incr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->inc()) i++;
}
std::cout << "incr <T> finished" << std::endl;
}
///
template<typename T>
void decr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->dec()) i++;
}
std::cout << "decr <T> finished" << std::endl;
}
using namespace std;
// run increments and decrements of the same number
// in threads in "safe" and "unsafe" way
int main()
{
//init numbers to 0
long long number = 0;
Safe_number<long long> sNum(number);
Counter cnt;//init time counter
//
//run 2 unsafe threads for ++ and --
std::thread t1(incr, &number);
std::thread t2(decr, &number);
t1.join();
t2.join();
//check time of execution of unsafe part
double time1 = cnt.GetCounter();
cout <<"finished first thr" << endl;
//
// run 2 safe threads for ++ and --, now we expect final value 0
std::thread t3(incr<long long>, &sNum);
std::thread t4(decr<long long>, &sNum);
t3.join();
t4.join();
//check time of execution of safe part
double time2 = cnt.GetCounter() - time1;
cout << "unsafe part, number = " << number << " time1 = " << time1 << endl;
cout << "safe part, Safe number = " << sNum.val() << " time2 = " << time2 << endl << endl;
return 0;
}
You should not draw conclusions about the speed of any given algorithm if the input size is very small. What defines "very small" can be kind of arbitrary, but on modern hardware, under usual conditions, "small" can refer to any collection size less than a few hundred thousand objects, and "large" can refer to any collection larger than that.
Obviously, Your Milage May Vary.
In this case, the overhead of constructing threads, which, while usually slow, can also be rather inconsistent and could be a larger factor in the speed of your code than what the actual algorithm is doing. It's possible that the compiler has some kind of powerful optimizations it can do on smaller input sizes (which it can definitely know about due to the input size being hard-coded into the code itself) that it cannot then perform on larger inputs.
The broader point being that you should always prefer larger inputs when testing algorithm speed, and to also have the same program repeat its tests (preferably in random order!) to try to "smooth out" irregularities in the timings.

How to develop a program that use only one single core?

I want to know how to properly implement a program in C++, in which I have a function func that I want to be executed in a single thread. I want to do this, because I want to test the Single Core Speed of my CPU. I will loop this function(func) for about 20 times, and record the execution time of each repetition, then I will sum the results and get the average execution time.
#include <thread>
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
std::thread one_thread (func,100000000);
one_thread.join();
return 0;
}
So , in this program, does the func is executed on a single particular core ?
Here is the source code of my program:
#include <iostream>
#include <thread>
#include <iomanip>
#include <windows.h>
#include "font.h"
#include "timer.h"
using namespace std;
#define steps 20
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
SetFontConsolas(); // Set font consolas
ShowConsoleCursor(false); // Turn off the cursor
timer t;
short int number = 0;
cout << number << "%";
for(int i = 0 ; i < steps ; i++)
{
t.restart(); // start recording
std::thread one_thread (func,100000000);
one_thread.join(); // wait function return
t.stop(); // stop recording
t.record(); // save the time in vector
number += 5;
cout << "\r ";
cout << "\r" << number << "%";
}
double time = 0.0;
for(int i = 0 ; i < steps ; i++)
time += t.times[i]; // sum all recorded times
time /= steps; // get the average execution time
cout << "\nExecution time: " << fixed << setprecision(4) << time << '\n';
double score = 0.0;
score = (1.0 * 100) / time; // calculating benchmark score
cout << "Score: ";
SetColor(12);
cout << setprecision(2) << score << " pts";
SetColor(15);
cout << "\nPress any key to continue.\n";
cin.get();
return 0;
}
No, your program has at least two treads: main, and the one you've created to run func. Moreover, neither of these threads is guaranteed to get executed on particular core. Depending on OS scheduler they may switch cores in unpredictable manner. Though main thread will mostly just wait. If you want to lock thread execution on particular core then you need to set thread core affinity by some platform-specific method such as SetThreadAffinityMask on Windows. But you don't really need to go that deep because there is no core switch sensitive code in your example. There is even no need to spawn separate thread dedicated to perform calculations.
If your program doesn't have multiple threads in the source and if the compiler does not insert automatic parallelization, the program should run on a single core (at a time).
Now depending on your compiler you can use appropriate optimization levels to ensure that it doesn't parallelize.
On the other hand what might happen is that the compiler can completely eliminate the loop in the function if it can statically compute the result. That however doesn't seem to be the issue with your case.
I don't think any C++ compiler makes use of multiple core, behind your back. There would be large language issues in doing that. If you neither spawn threads nor use a parallel library such as MPI, the program should execute on only one core.

Conditional Statement is never triggered within Chrono Program

Abstract:
I wrote a short program dealing with the Chrono library in C++ for experimentation purposes. I want the CPU to count as high as it can within one second, display what it counted to, then repeat the process within an infinite loop.
Current Code:
#include <iostream>
#include <chrono>
int counter()
{
int num = 0;
auto startTime = std::chrono::system_clock::now();
while (true)
{
num++;
auto currentTime = std::chrono::system_clock::now();
if (std::chrono::duration_cast<std::chrono::seconds>(currentTime - startTime).count() == 1)
return num;
}
}
int main()
{
while(true)
std::cout << "You've counted to " << counter() << "in one second!";
return 0;
}
Problem:
The conditional statement in my program:
if (std::chrono::duration_cast<std::chrono::seconds>(currentTime - startTime).count() == 1)
isn't being triggered because the casted value of currentTime - startTime never equals nor rises above one. This can be demonstrated by replacing the operator '==' with '<', which outputs an incorrect result, as opposed to outputting nothing at all. I don't understand why the condition isn't being met; if this program is gathering time from the system clock at one point, then repeatedly comparing it to the current time, shouldn't the integer value of the difference equal one at some point?
You're hitting a cout issue, not a chrono issue. The problem is that you're printing with cout which doesn't flush if it doesn't feel like it.
cerr will flush on newline. Change to cerr and add a \n and you'll get what you expect.
std::cerr << "You've counted to " << counter() << "in one second!\n";

Openmp can't create threads automatically

I am trying to learn how to use openmp for multi threading.
Here is my code:
#include <iostream>
#include <math.h>
#include <omp.h>
//#include <time.h>
//#include <cstdlib>
using namespace std;
bool isprime(long long num);
int main()
{
cout << "There are " << omp_get_num_procs() << " cores." << endl;
cout << 2 << endl;
//clock_t start = clock();
//clock_t current = start;
#pragma omp parallel num_threads(6)
{
#pragma omp for schedule(dynamic, 1000)
for(long long i = 3LL; i <= 1000000000000; i = i + 2LL)
{
/*if((current - start)/CLOCKS_PER_SEC > 60)
{
exit(0);
}*/
if(isprime(i))
{
cout << i << " Thread: " << omp_get_thread_num() << endl;
}
}
}
}
bool isprime(long long num)
{
if(num == 1)
{
return 0;
}
for(long long i = 2LL; i <= sqrt(num); i++)
{
if (num % i == 0)
{
return 0;
}
}
return 1;
}
The problem is that I want openmp to automatically create a number of threads based on how many cores are available. If I take out the num_threads(6), then it just uses 1 thread yet the omp_get_num_procs() correctly outputs 64.
How do I get this to work?
You neglected to mention which compiler and OpenMP implementation you are using. I'm going to guess you're using one of the ones, like PGI, which does not automatically assume the number of threads to create in a default parallel region unless asked to do so. Since you did not specify the compiler I cannot be certain that these options will actually help you, but for PGI's compilers the necessary option is -mp=allcores when compiling and linking the executable. With that added, it will cause the system to create one thread per core for parallel regions which do not specify the number of threads or have the appropriate environment variable set.
The number you're getting from omp_get_num_procs is used by default to set the limit on the number of threads, but not necessarily the number created. If you want to dynamically set the number created, set the environment variable OMP_NUM_THREADS to the desired number before running your application and it should behave as expected.
I'm not sure if I understand your question correctly, but it seems that you are almost there. Do you mean something like:
#include <omp.h>
#include <iostream>
int main(){
const int num_procs = omp_get_num_procs();
std::cout<<num_procs;
#pragma omp parallel for num_threads(num_procs) default(none)
for(int i=0; i<(int)1E20; ++i){
}
return 0;
}
Unless I'm rather badly mistaken, OpenMP normally serializes I/O (at least to a single stream) so that's probably at least part of where your problem is arising. Removing that from the loop, and massaging a bit of the rest (not much point in working at parallelizing until you have reasonably efficient serial code), I end up with something like this:
#include <iostream>
#include <math.h>
#include <omp.h>
using namespace std;
bool isprime(long long num);
int main()
{
unsigned long long total = 0;
cout << "There are " << omp_get_num_procs() << " cores.\n";
#pragma omp parallel for reduction(+:total)
for(long long i = 3LL; i < 100000000; i += 2LL)
if(isprime(i))
total += i;
cout << "Total: " << total << "\n";
}
bool isprime(long long num) {
if (num == 2)
return 1;
if(num == 1 || num % 2 == 0)
return 0;
unsigned long long limit = sqrt(num);
for(long long i = 3LL; i <= limit; i+=2)
if (num % i == 0)
return 0;
return 1;
}
This doesn't print out the thread number, but timing it I get something like this:
Real 78.0686
User 489.781
Sys 0.125
Note the fact that the "User" time is more than 6x as large as the "Real" time, indicating that the load is being distributed across the cores 8 available on this machine with about 80% efficiency. With a little more work, you might be able to improve that further, but even with this simple version we're seeing considerably more than one core being used (on your 64-core machine, we should see at least a 50:1 improvement over single-threaded code, and probably quite a bit better than that).
The only problem I see with your code is that when you do the output you need to put it in a critcal section otherwise multiple threads can write to the same line at the same time.
See my code corrections.
In terms of one thread I think what you might be seeing is due to using dynamic. A thread running over small numbers is much quicker then one running over large numbers. When the thread with small numbers finishes and gets another list of small numbers to run it finishes again quick while the thread with large numbers is still running. This does not mean you're only running one thread though. In my output I see long streams of the same thread finding primes but eventually others report as well. You have also set the chuck size to 1000 so if you for example only ran over 1000 numbers only one thread will be used in the loop.
It looks to me like you're trying to find a list of primes or a sum of the number of primes. You're using trial division for that. That's much less efficient than using the "Sieve of Eratosthenes".
Here is an example of the Sieve of Eratosthenes which finds the primes in the the first billion numbers in less than one second on my 4 core system with OpenMP.
http://create.stephan-brumme.com/eratosthenes/
I cleaned up your code a bit but did not try to optimize anything since the algorithm is inefficient anyway.
int main() {
//long long int n = 1000000000000;
long long int n = 1000000;
cout << "There are " << omp_get_num_procs() << " cores." << endl;
double dtime = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(long long i = 3LL; i <= n; i = i + 2LL) {
if(isprime(i)) {
#pragma omp critical
{
cout << i << "\tThread: " << omp_get_thread_num() << endl;
}
}
}
}
dtime = omp_get_wtime() - dtime;
cout << "time " << dtime << endl;
}

Working with timers

I am trying to create a timer where it begins with a certain value and ends with another value like.
int pktctr = (unsigned char)unpkt[0];
if(pktctr == 2)
{
cout << "timer-begin" << endl;
//start timer here
}
if(pktctr == 255)
{
cout << "timer-end" << endl;
//stop timer here
//timer display total time then reset.
}
cout << "displays total time it took from 1 to 255 here" << endl;
Any idea on how to achieve this?
void WINAPI MyUCPackets(char* unpkt, int packetlen, int iR, int arg)
{
int pktctr = (unsigned char)unpkt[0];
if(pktctr == 2)
{
cout << "timer-begin" << endl;
}
if(pktctr == 255)
{
cout << "timer-end" << endl;
}
return MyUC2Packets(unpkt,packetlen,iR,arg);
}
Everytime this function is called unpkt starts from 2 then reaches max of 255 then goes back to 1. And I want to compute how long it took for every revolution?
This will happen alot of times. But I just wanted to check how many seconds it took for this to happen because it won't be the same everytime.
Note: This is done with MSDetours 3.0...
I'll assume you're using Windows (from the WINAPI in the code) in which case you can use GetTickCount:
/* or you could have this elsewhere, e.g. as a class member or
* in global scope (yuck!) As it stands, this isn't thread safe!
*/
static DWORD dwStartTicks = 0;
int pktctr = (unsigned char)unpkt[0];
if(pktctr == 2)
{
cout << "timer-begin" << endl;
dwStartTicks = GetTickCount();
}
if(pktctr == 255)
{
cout << "timer-end" << endl;
DWORD dwDuration = GetTickCount() - dwStartTicks;
/* use dwDuration - it's in milliseconds, so divide by 1000 to get
* seconds if you so desire.
*/
}
Things to watch out for: overflow of GetTickCount is possible (it resets to 0 approximately every 47 days, so it's possible that if you start your timer close to the rollover time, it will finish after the rollover). You can solve this in two ways, either use GetTickCount64 or simply notice when dwStartTicks > GetTickCount and if so, calculate how many milliseconds were from dwStartTicks until the rollover, and how many millseconds from 0 to the result of GetTickCount() and add those numbers together (bonus points if you can do this in a more clever way).
Alternatively, you can use the clock function. You can find out more on that, including an example of how to use it at http://msdn.microsoft.com/en-us/library/4e2ess30(v=vs.71).aspx and it should be fairly easy to adapt and integrate into your code.
Finally, if you're interested in a more "standard" solution, you can use the <chrono> stuff from the C++ standard library. Check out http://en.cppreference.com/w/cpp/chrono for an example.
If you want to use the Windows-API use GetSystemTime(). Provide a struct SYSTEMTIME, initialize it properly and pass it to GetSystemTime():
#include <Windows.h>
...
SYSTEMTIME sysTime;
GetFileTime(&sysTime);
// use sysTime and create differences
Look here for GetSystemTime() there is a link for SYSTEMTIME there, too.
I think boost timer is the best solution for you.
You can check the elapsed time like this:
#include <boost/timer.hpp>
int main() {
boost::timer t; // start timing
...
double elapsed_time = t.elapsed();
...
}