Why is std::cout so time consuming? - c++

I have made a program to compute the permutations of an 8 character string "sharjeel".
#include <iostream>
#include <time.h>
char string[] = "sharjeel";
int len = 8;
int count = 0;
void swap(char& a, char& b){
char t = a;
a = b;
b = t;
}
void permute(int pos) {
if(pos==len-1){
std::cout << ++count << "\t" << string << std::endl;
return;
}
else {
for (int i = pos; i < len;i++)
{
swap(string[i], string[pos]);
permute(pos + 1);
swap(string[i], string[pos]);
}
}
}
int main(){
clock_t start = clock();
permute(0);
std::cout << "Permutations: " << count << std::endl;
std::cout << "Time taken: " << (double)(clock() - start) / (double)CLOCKS_PER_SEC << std::endl;
return 1;
}
If I print each permutation along it takes about 9.8 seconds for the execution to complete.
40314 lshaerej
40315 lshareej
40316 lshareje
40317 lshareej
40318 lshareje
40319 lsharjee
40320 lsharjee
Permutations: 40320
Time taken: 9.815
Now if I replace the line:
std::cout << ++count << "\t" << string << std::endl;
with this:
++count;
and then recompile, the output is:
Permutations: 40320
Time taken: 0.001
Running again:
Permutations: 40320
Time taken: 0.002
Compiled using g++ with -O3
Why is std::cout so relatively time consuming? Is there a way to print that is faster?
EDIT: Made a C# version of the program
/*
* Permutations
* in c#
* much faster than the c++ version
*/
using System;
using System.Diagnostics;
namespace Permutation_C
{
class MainClass
{
private static uint len;
private static char[] input;
private static int count = 0;
public static void Main (string[] args)
{
Console.Write ("Enter a string to permute: ");
input = Console.ReadLine ().ToCharArray();
len = Convert.ToUInt32(input.Length);
Stopwatch clock = Stopwatch.StartNew();
permute (0u);
Console.WriteLine("Time Taken: {0} seconds", clock.ElapsedMilliseconds/1000.0);
}
static void permute(uint pos)
{
if (pos == len - 1u) {
Console.WriteLine ("{0}.\t{1}",++count, new string(input));
return;
} else {
for (uint i = pos; i < len; i++) {
swap (Convert.ToInt32(i),Convert.ToInt32(pos));
permute (pos + 1);
swap (Convert.ToInt32(i),Convert.ToInt32(pos));
}
}
}
static void swap(int a, int b) {
char t = input[a];
input[a] = input[b];
input[b] = t;
}
}
}
Output:
40313. lshaerje
40314. lshaerej
40315. lshareej
40316. lshareje
40317. lshareej
40318. lshareje
40319. lsharjee
40320. lsharjee
Time Taken: 4.628 seconds
Press any key to continue . . .
From here, Console.WriteLine() seems almost twice as fast when compared with the results from std::cout. What seems to be slowing std::cout down?

std::cout ultimately results in the operating system being invoked.
If you want something to compute fast, you have to make sure that no external entities are involved in the computation, especially entities that have been written with versatility more than performance in mind, like the operating system.
Want it to run faster? You have a few options:
Replace << std::endl; with << '\n'. This will refrain from flushing the internal buffer of the C++ runtime to the operating system on every single line. It should result in a huge performance improvement.
Use std::ios::sync_with_stdio(false); as user Galik Mar suggests in a comment.
Collect as much as possible of your outgoing text in a buffer, and output the entire buffer at once with a single call.
Write your output to a file instead of the console, and then keep that file displayed by a separate application such as Notepad++ which can keep track of changes and keep scrolling to the bottom.
As for why it is so "time consuming", (in other words, slow,) that's because the primary purpose of std::cout (and ultimately the operating system's standard output stream) is versatility, not performance. Think about it: std::cout is a C++ library function which will invoke the operating system; the operating system will determine that the file being written to is not really a file, but the console, so it will send the data to the console subsystem; the console subsystem will receive the data and it will start invoking the graphics subsystem to render the text in the console window; the graphics subsystem will be drawing font glyphs on a raster display, and while rendering the data, there will be scrolling of the console window, which involves copying large amounts of video RAM. That's an awful lot of work, even if the graphics card takes care of some of it in hardware.
As for the C# version, I am not sure exactly what is going on, but what is probably happening is something quite different: In C# you are not invoking Console.Out.Flush(), so your output is cached and you are not suffering the overhead incurred by C++'s std::cout << std::endl which causes each line to be flushed to the operating system. However, when the buffer does become full, C# must flush it to the operating system, and then it is hit not only by the overhead represented by the operating system, but also by the formidable managed-to-native and native-to-managed transition that is inherent in the way it's virtual machine works.

Related

What is the fastest way to get seconds passed in cpp?

I made a factoring program that needs to loop as quickly as possible. However, I also want to track the progress with minimal code. To do this, I display the current value of i every second by comparing time_t start - time_t end and an incrementing value marker.
using namespace std; // cause I'm a noob
// logic stuff
int divisor = 0, marker = 0;
int limit = sqrt(num);
for (int i = 1; i <= limit; i++) // odd number = odd factors
{
if (num % i == 0)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
if (i != 1)
cout << "\n";
divisor = num / i;
cout << i << "," << divisor << "\n";
}
end = time(&end); // PROBLEM HERE
if ((end - start) > marker)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
cout << "\t\t\t\t" << i;
marker++;
}
}
Of course, the actual code is much more optimized and uses boost::multiprecision, but I don't think that's the problem. When I remove the line end = time(&end), I see a performance gain of at least 10%. I'm just wondering, how can I track the time (or at least approximate seconds) without unconditionally calling a function every loop? Or is there a faster function?
You observe "When I remove the line end = time(&end), I see a performance gain of at least 10%." I am not surprised, reading time easily is taking inefficient time, compared to doing pure CPU calculations.
I assume hence that the time reading is actually what eats the performance which observe lost when removing the line.
You could use an estimation of the minimum number of iterations your loop does within a second and then only check the time if multiples of (half of) that number have looped.
I.e., if you only want to be aware of time in a resolution of seconds, then you should try to only marginally more often do the time-consuming reading of the time.
I would use a totally different approach where you seperate measurement/display code from the loop completely and even run it on another thread.
Live demo here : https://onlinegdb.com/8nNsGy7EX
#include <iostream>
#include <chrono> // for all things time
#include <future> // for std::async, that allows us to run functions on other threads
void function()
{
const std::size_t max_loop_count{ 500 };
std::atomic<std::size_t> n{ 0ul }; // make access to loopcounter threadsafe
// start another thread that will do the reporting independent of the
// actual work you are doing in your loop.
// for this capture n (loop counter) by reference (so this thread can look at it)
auto future = std::async(std::launch::async,[&n, max_loop_count]
{
while (n < max_loop_count)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "\rprogress = " << (100 * n) / max_loop_count << "%";
}
});
// do not initialize n here again. since we share it with reporting
for (; n < max_loop_count; n++)
{
// do your loops work, just a short sleep now to mimmick actual work
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
// synchronize with reporting thread
future.get();
}
int main()
{
function();
return 0;
}
If you have any questions regarding this example let me know.

Optimizing .txt files creation speed

I've written the following simple testing code, that creates 10 000 empty .txt files in a subdirectory.
#include <iostream>
#include <time.h>
#include <string>
#include <fstream>
void CreateFiles()
{
int i = 1;
while (i <= 10000) {
int filename = i;
std::string string_i = std::to_string(i);
std::string file_dir = ".\\results\\"+string_i+".txt";
std::ofstream outfile(file_dir);
i++;
}
}
int main()
{
clock_t tStart1 = clock();
CreateFiles();
printf("\nHow long it took to make files: %.2fs\n", (double)(clock() - tStart1)/CLOCKS_PER_SEC);
std::cin.get();
return 0;
}
Everything works fine. All 10 000 .txt files are created within ~3.55 seconds. (using my PC)
Question 1: Ignoring the conversion from int to std::string etc., is there anything that I could optimize here for the program to create the files faster? I specifically mean the std::ofstream outfile usage - perhaps using something else would be relevantly faster?
Regardless, ~3,55 seconds is satisfying compared to the following:
I have modified the function so right now it would also fill the .txt files with some random i integer data and some constant text:
void CreateFiles()
{
int i = 1;
while (i <= 10000) {
int filename = i;
std::string string_i = std::to_string(i);
std::string file_dir = ".\\results\\"+string_i+".txt";
std::ofstream outfile(file_dir);
// Here is the part where I am filling the .txt with some data
outfile << i << " some " << i << " constant " << i << " text " << i << " . . . "
<< i << " --more text-- " << i << " --even more-- " << i;
i++;
}
}
And now everything (creating the .txt files and filling it with short data) executes within... ~37 seconds. That's a huge difference. And that's only 10 000 files.
Question 2: Is there anything I can optimize here? Perhaps there exist some alternative that would fill the .txt files quicker. Or perhaps I have forgotten about something very obvious that slows down the entire process?
Or, perhaps I am exaggerating a little bit and ~37 seconds seems normal and optimized?
Thanks for sharing your insights!
The speed of creation of file is hardware dependent, faster the drive faster you can create the files.
This is evident from the fact that I ran your code on an ARM processor (Snapdragon 636, on a Mobile phone using termux), now mobile phones have flash memory that are very fast when it comes to I/O. So it ran under 3 seconds most of the time and some time 5 second. This variation is expected as drive has to handle multi process read writes. You reported that it took 47 seconds for your hardware. Hence you can safely conclude that I/O speed is significantly dependent on Hardware.
None the less I thought to do some optimization to your code and I used 2 different approaches.
Using a C counterpart for I/O
Using C++ but writing in a chunk in one go.
I ran the simulation on my phone. I ran it 50 times and here are the results.
C was fastest taking 2.73928 second on average to write your word on 10000 text files, using fprintf
C++ writing with the complete line at one go took 2.7899 seconds. I used sprintf to get the complete line into a char[] then wrote using << operator on ofstream.
C++ Normal (Your Code) took 2.8752 seconds
This behaviour is expected, writing in chunks is fasters. Read this answer as to why. C was fastest no doubt.
You may note here that The difference is not that significant but if you are on a hardware with slow I/O, this becomes significant.
Here is the code I used for simulation. You can test it yourself but make sure to replace std::system argument with your own commands (different for windows).
#include <iostream>
#include <time.h>
#include <string>
#include <fstream>
#include <stdio.h>
void CreateFiles()
{
int i = 1;
while (i <= 10000) {
// int filename = i;
std::string string_i = std::to_string(i);
std::string file_dir = "./results/"+string_i+".txt";
std::ofstream outfile(file_dir);
// Here is the part where I am filling the .txt with some data
outfile << i << " some " << i << " constant " << i << " text " << i << " . . . "
<< i << " --more text-- " << i << " --even more-- " << i;
i++;
}
}
void CreateFilesOneGo(){
int i = 1;
while(i<=10000){
std::string string_i = std::to_string(i);
std::string file_dir = "./results3/" + string_i + ".txt";
char buffer[256];
sprintf(buffer,"%d some %d constant %d text %d . . . %d --more text-- %d --even more-- %d",i,i,i,i,i,i,i);
std::ofstream outfile(file_dir);
outfile << buffer;
i++;
}
}
void CreateFilesFast(){
int i = 1;
while(i<=10000){
// int filename = i;
std::string string_i = std::to_string(i);
std::string file_dir = "./results2/"+string_i+".txt";
FILE *f = fopen(file_dir.c_str(), "w");
fprintf(f,"%d some %d constant %d text %d . . . %d --more text-- %d --even more-- %d",i,i,i,i,i,i,i);
fclose(f);
i++;
}
}
int main()
{
double normal = 0, one_go = 0, c = 0;
for (int u=0;u<50;u++){
std::system("mkdir results results2 results3");
clock_t tStart1 = clock();
CreateFiles();
//printf("\nNormal : How long it took to make files: %.2fs\n", (double)(clock() - tStart1)/CLOCKS_PER_SEC);
normal+=(double)(clock() - tStart1)/CLOCKS_PER_SEC;
tStart1 = clock();
CreateFilesFast();
//printf("\nIn C : How long it took to make files: %.2fs\n", (double)(clock() - tStart1)/CLOCKS_PER_SEC);
c+=(double)(clock() - tStart1)/CLOCKS_PER_SEC;
tStart1 = clock();
CreateFilesOneGo();
//printf("\nOne Go : How long it took to make files: %.2fs\n", (double)(clock() - tStart1)/CLOCKS_PER_SEC);
one_go+=(double)(clock() - tStart1)/CLOCKS_PER_SEC;
std::system("rm -rf results results2 results3");
std::cout<<"Completed "<<u+1<<"\n";
}
std::cout<<"C on average took : "<<c/50<<"\n";
std::cout<<"Normal on average took : "<<normal/50<<"\n";
std::cout<<"One Go C++ took : "<<one_go/50<<"\n";
return 0;
}
Also I used clang-7.0 as the compiler.
If you have any other approach let me know, I will test that too. If you find a mistake do let me know, I will correct it as soon as possible.

C++ process time [duplicate]

This question already has answers here:
Measuring execution time of a function in C++
(14 answers)
Closed 4 years ago.
In C# I would fire up the Stopwatch class to do some quick-and-dirty timing of how long certain methods take.
What is the equivalent of this in C++? Is there a high precision timer built in?
I used boost::timer for measuring the duration of an operation. It provides a very easy way to do the measurement, and at the same time being platform independent. Here is an example:
boost::timer myTimer;
doOperation();
std::cout << myTimer.elapsed();
P.S. To overcome precision errors, it would be great to measure operations that take a few seconds. Especially when you are trying to compare several alternatives. If you want to measure something that takes very little time, try putting it into a loop. For example run the operation 1000 times, and then divide the total time by 1000.
I've implemented a timer for situations like this before: I actually ended up with a class with two different implemations, one for Windows and one for POSIX.
The reason was that Windows has the QueryPerformanceCounter() function which gives you access to a very accurate clock which is ideal for such timings.
On POSIX however this isn't available so I just used boost.datetime's classes to store the start and end times then calculated the duration from those. It offers a "high resolution" timer but the resolution is undefined and varies from platform to platform.
I use my own version of Python's time_it function. The advantage of this function is that it repeats a computation as many times as necessary to obtain meaningful results. If the computation is very fast, it will be repeated many times. In the end you obtain the average time of all the repetitions. It does not use any non-standard functionality:
#include <ctime>
double clock_diff_to_sec(long clock_diff)
{
return double(clock_diff) / CLOCKS_PER_SEC;
}
template<class Proc>
double time_it(Proc proc, int N=1) // returns time in microseconds
{
std::clock_t const start = std::clock();
for(int i = 0; i < N; ++i)
proc();
std::clock_t const end = std::clock();
if(clock_diff_to_sec(end - start) < .2)
return time_it(proc, N * 5);
return clock_diff_to_sec(end - start) * (1e6 / N);
}
The following example uses the time_it function to measure the performance of different STL containers:
void dummy_op(int i)
{
if(i == -1)
std::cout << i << "\n";
}
template<class Container>
void test(Container const & c)
{
std::for_each(c.begin(), c.end(), &dummy_op);
}
template<class OutIt>
void init(OutIt it)
{
for(int i = 0; i < 1000; ++i)
*it = i;
}
int main( int argc, char ** argv )
{
{
std::vector<int> c;
init(std::back_inserter(c));
std::cout << "vector: "
<< time_it(boost::bind(&test<std::vector<int> >, c)) << "\n";
}
{
std::list<int> c;
init(std::back_inserter(c));
std::cout << "list: "
<< time_it(boost::bind(&test<std::list<int> >, c)) << "\n";
}
{
std::deque<int> c;
init(std::back_inserter(c));
std::cout << "deque: "
<< time_it(boost::bind(&test<std::deque<int> >, c)) << "\n";
}
{
std::set<int> c;
init(std::inserter(c, c.begin()));
std::cout << "set: "
<< time_it(boost::bind(&test<std::set<int> >, c)) << "\n";
}
{
std::tr1::unordered_set<int> c;
init(std::inserter(c, c.begin()));
std::cout << "unordered_set: "
<< time_it(boost::bind(&test<std::tr1::unordered_set<int> >, c)) << "\n";
}
}
In case anyone is curious here is the output I get (compiled with VS2008 in release mode):
vector: 8.7168
list: 27.776
deque: 91.52
set: 103.04
unordered_set: 29.76
You can use the ctime library to get the time in seconds. Getting the time in milliseconds is implementation-specific. Here is a discussion exploring some ways to do that.
See also: How to measure time in milliseconds using ANSI C?
High-precision timers are platform-specific and so aren't specified by the C++ standard, but there are libraries available. See this question for a discussion.
I humbly submit my own micro-benchmarking mini-library (on Github). It's super simple -- the only advantage it has over rolling your own is that it already has the high-performance timer code implemented for Windows and Linux, and abstracts away the annoying boilerplate.
Just pass in a function (or lambda), the number of times it should be called per test run (default: 1), and the number of test runs (default: 100). The fastest test run (measured in fractional milliseconds) is returned:
// Example that times the compare-and-swap atomic operation from C++11
// Sample GCC command: g++ -std=c++11 -DNDEBUG -O3 -lrt main.cpp microbench/systemtime.cpp -o bench
#include "microbench/microbench.h"
#include <cstdio>
#include <atomic>
int main()
{
std::atomic<int> x(0);
int y = 0;
printf("CAS takes %.4fms to execute 100000 iterations\n",
moodycamel::microbench(
[&]() { x.compare_exchange_strong(y, 0); }, /* function to benchmark */
100000, /* iterations per test run */
100 /* test runs */
)
);
// Result: Clocks in at 1.2ms (12ns per CAS operation) in my environment
return 0;
}
#include <time.h>
clock_t start, end;
start = clock();
//Do stuff
end = clock();
printf("Took: %f\n", (float)((end - start) / (float)CLOCKS_PER_SEC));
This might be an OS-dependent issue rather than a language issue.
If you're on Windows then you can access a millisecond 10- to 16-millisecond timer through GetTickCount() or GetTickCount64(). Just call it once at the start and once at the end, and subtract.
That was what I used before if I recall correctly. The linked page has other options as well.
You can find useful this class.
Using RAII idiom, it prints the text given in construction when destructor is called, filling elapsed time placeholder with the proper value.
Example of use:
int main()
{
trace_elapsed_time t("Elapsed time: %ts.\n");
usleep(1.005 * 1e6);
}
Output:
Elapsed time: 1.00509s.

How to develop a program that use only one single core?

I want to know how to properly implement a program in C++, in which I have a function func that I want to be executed in a single thread. I want to do this, because I want to test the Single Core Speed of my CPU. I will loop this function(func) for about 20 times, and record the execution time of each repetition, then I will sum the results and get the average execution time.
#include <thread>
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
std::thread one_thread (func,100000000);
one_thread.join();
return 0;
}
So , in this program, does the func is executed on a single particular core ?
Here is the source code of my program:
#include <iostream>
#include <thread>
#include <iomanip>
#include <windows.h>
#include "font.h"
#include "timer.h"
using namespace std;
#define steps 20
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
SetFontConsolas(); // Set font consolas
ShowConsoleCursor(false); // Turn off the cursor
timer t;
short int number = 0;
cout << number << "%";
for(int i = 0 ; i < steps ; i++)
{
t.restart(); // start recording
std::thread one_thread (func,100000000);
one_thread.join(); // wait function return
t.stop(); // stop recording
t.record(); // save the time in vector
number += 5;
cout << "\r ";
cout << "\r" << number << "%";
}
double time = 0.0;
for(int i = 0 ; i < steps ; i++)
time += t.times[i]; // sum all recorded times
time /= steps; // get the average execution time
cout << "\nExecution time: " << fixed << setprecision(4) << time << '\n';
double score = 0.0;
score = (1.0 * 100) / time; // calculating benchmark score
cout << "Score: ";
SetColor(12);
cout << setprecision(2) << score << " pts";
SetColor(15);
cout << "\nPress any key to continue.\n";
cin.get();
return 0;
}
No, your program has at least two treads: main, and the one you've created to run func. Moreover, neither of these threads is guaranteed to get executed on particular core. Depending on OS scheduler they may switch cores in unpredictable manner. Though main thread will mostly just wait. If you want to lock thread execution on particular core then you need to set thread core affinity by some platform-specific method such as SetThreadAffinityMask on Windows. But you don't really need to go that deep because there is no core switch sensitive code in your example. There is even no need to spawn separate thread dedicated to perform calculations.
If your program doesn't have multiple threads in the source and if the compiler does not insert automatic parallelization, the program should run on a single core (at a time).
Now depending on your compiler you can use appropriate optimization levels to ensure that it doesn't parallelize.
On the other hand what might happen is that the compiler can completely eliminate the loop in the function if it can statically compute the result. That however doesn't seem to be the issue with your case.
I don't think any C++ compiler makes use of multiple core, behind your back. There would be large language issues in doing that. If you neither spawn threads nor use a parallel library such as MPI, the program should execute on only one core.

How do I properly use "time.h" to return a number greater than zero? c++

I am trying to calculate the time a certain function takes to run
#include <iostream>
#include <cstdlib>
#include <ctime>
#include "time.h"
int myFunction(int n)
{
.............
}
int n;
time_t start;
std::cout<<"What number would you like to enter ";
std::cout << std::endl;
std::cin>>n;
start = clock();
std::cout<<myFunction(m)<<std::endl;
std::cout << "Time it took: " << (clock() - start) / (double)(CLOCKS_PER_SEC/ 1000 ) << std::endl;
std::cout << std::endl;
This works fine in x-code (getting numbers such 4.2, 2.6 ...), but doesn't on a linux based server where I'm always getting 0. Any ideas why that is and how to fix it?
The "tick" of clock may be more than 1/CLOCKS_PER_SEC seconds, for example it could be 10ms, 15.832761ms or 32 microseconds.
If the time consumed by your code is smaller than this "tick", then the time taken will appear to be zero.
There's no simple way to find out what that is, other than - you could call clock repeatedly until the return-value is not the same as last time, but that may not be entirely reliable, and if the clock-tick is VERY small, it may not be accurate in that direction, but if the time is quite long, you may be able to find out.
For measuring very short times (a few milliseconds), and assuming the function is entirely CPU/Memory-bound, and not spending time waiting for file I/O or sending receiving packets over a network, then std::chrono can be used to measure the time. For extremely short times, using the processor time-stamp-counter can also be a method, although that can be quite tricky to use because it varies in speed depending on load, and can have different values between different cores.
In my compiler project, I'm using this to measure the time of things:
This part goes into a header:
class TimeTrace
{
public:
TimeTrace(const char *func) : impl(0)
{
if(timetrace)
{
createImpl(func);
}
}
~TimeTrace() { if(impl) destroyImpl(); }
private:
void createImpl(const char *func);
void destroyImpl();
TimeTraceImpl *impl;
};
This goes into a source file.
class TimeTraceImpl
{
public:
TimeTraceImpl(const char *func) : func(func)
{
start = std::chrono::steady_clock::now();
}
~TimeTraceImpl()
{
end = std::chrono::steady_clock::now();
uint64_t elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
std::cerr << "Time for " << func << " "
<< std::fixed << std::setprecision(3) << elapsed / 1000.0 << " ms" << std::endl;
}
private:
std::chrono::time_point<std::chrono::steady_clock> start, end;
const char* func;
};
void TimeTrace::createImpl(const char *func)
{
impl = new TimeTraceImpl(func);
}
void TimeTrace::destroyImpl()
{
delete impl;
}
The reason for the rather comple pImpl implementation is that I don't want to burden the code with extra work when the timing is turned off (timetrace is false).
Of course, the smallest actual tick of std::chrono also varies, but in most Linux implementations, it will be nanoseconds or some small multiple thereof, so (much) better precision than clock.
The drawback is that it measures the elapsed time, not the CPU-usage. This is fine for when the bottleneck is the CPU and memory, but not for things that depend on external hardware to perform something [unless you actually WANT that measurement].